unit-5 notes
unit-5 notes
5.1 Hbase
HBase is an open-source, distributed, non-relational, and scalable NoSQL database
system built on top of Apache Hadoop. It provides real-time read and write access to
large datasets, making it suitable for handling massive amounts of structured or semi-
structured data. HBase is modeled after Google's Bigtable and is often used for
applications that require low-latency access to vast amounts of data.
5.1.1Key features of HBase include:
Column-Family Based Storage: HBase organizes data into tables, which consist of rows
and column families. Column families can have multiple columns, and each column can
have multiple versions, which allows efficient storage and retrieval of sparse data.
Linear Scalability: HBase is designed to scale horizontally across multiple nodes,
making it suitable for big data scenarios. As data grows, you can add more nodes
to the HBase cluster to handle the increased workload.
High Availability: HBase ensures high availability by replicating data across
multiple nodes. In the event of node failure, data can be retrieved from
replicas, maintaining data integrity and availability.
Consistency: HBase provides eventual consistency, where all read and write
operations are guaranteed to eventually return the most recent data.
However, it may not be immediately consistent across all nodes.
Fault Tolerance: HBase handles node failures by replicating data and
redistributing regions across the cluster. This fault-tolerance mechanism
ensures data durability.
Data Model: HBase is a column-oriented database, where each row key is
associated with multiple column families, and each column family can contain
multiple columns. Data in HBase is stored in a sorted order based on the row
keys, allowing efficient range scans.
Integration with Hadoop Ecosystem: HBase is part of the Apache Hadoop
ecosystem and can work seamlessly with other components like HDFS (Hadoop
Distributed File System), Hive, MapReduce, and Apache Spark.
Typical use cases for HBase include time-series data storage, sensor data storage,
Internet of Things (IoT) applications, real-time analytics, and other scenarios where
low-latency access to large-scale data is crucial.
HBase provides a Java API for data manipulation and can also be accessed using HBase
Shell or other client libraries. The query language used in HBase is not SQL-based like
traditional relational databases, but it offers filtering and scanning capabilities to
retrieve data based on row keys and column values.
5.2 HBASE DATA MODEL:
The data model of HBase is different from traditional relational databases and is
based on the principles of a column-family-based storage system. HBase organizes data
into tables, which consist of rows and column families. Understanding the HBase data
model is crucial for efficiently storing, accessing, and querying data. Here are the key
components of the HBase data model.
5.2.1 Table:
An HBase database consists of one or more tables. Each table is identified by a
unique name and contains rows of data. Tables in HBase are sparse, meaning they
don't require a fixed schema. Different rows can have different columns, and you can
add columns on the fly without affecting other rows.
Row Key:
Each row in an HBase table is uniquely identified by a row key. Row keys are used
to store and retrieve data and are generally sorted in lexicographic order.
Efficient row key design is crucial for optimal data retrieval and performance.
Row keys are typically strings or binary data.
Column Families:
HBase stores data in column families, which are groups of related columns. Each
table can have one or more column families. Column families must be defined when
creating a table, and once defined, the number of column families cannot be
changed. All rows in an HBase table share the same set of column families, though
not necessarily the same columns.
Columns:
Columns within a column family are identified by unique names. Unlike column
families, columns can be added or removed dynamically for each row without
affecting other rows. Columns are addressed using their column family and column
qualifier (name).
Versions:
HBase allows the storage of multiple versions of a cell (value) for a given row,
column family, and column qualifier. Each version of a cell is timestamped, allowing
data to be versioned and historically tracked. By default, HBase retains only the
most recent version, but you can configure the number of versions to keep.
Cells:
Cells are the basic unit of data storage in HBase. A cell consists of a combination
of row key, column family, column qualifier, timestamp, and value. The row key,
column family, and column qualifier together are called the "cell address" or "cell
key."
Regions:
To enable scalability and distribution, HBase divides a table into regions. Each
region is a subset of the table's data, and each region is stored on a separate
region server. As data grows, HBase dynamically splits regions to distribute the
data evenly across the cluster.
The HBase data model, with its column-family-based design and distributed
architecture, allows for scalable and efficient storage and retrieval of vast amounts
of data. When designing an HBase data model, careful consideration of row key design,
column family layout, and access patterns is essential to achieve optimal performance
and scalability for specific use cases.
5.3 Hbase implementation:
Implementing HBase involves setting up a distributed HBase cluster, designing
the data model, and interacting with the database using appropriate APIs or client
libraries. Below are the general steps to implement HBase:
Set Up a Hadoop Cluster:
HBase is built on top of Apache Hadoop, so you need to have a working Hadoop
cluster before setting up HBase. Install Hadoop on each node of the cluster
and ensure that the HDFS (Hadoop Distributed File System) is properly
configured and running.
Install HBase:
Download the latest version of HBase from the Apache HBase website.
Extract the HBase package on each node of the Hadoop cluster.
Configure HBase:
HBase comes with various configuration files, such as hbase-site.xml, hbase-env.sh,
and hbase-default.xml. Customize these files based on your cluster
requirements, such as specifying the ZooKeeper quorum, HDFS data directory,
and other HBase settings.
Start HBase Services:
Start the HBase services on each node of the cluster. HBase has several
daemons, including the HMaster, RegionServers, and ZooKeeper, which work
together to manage the data storage and distribution.
Design the Data Model:
Design the HBase data model based on the requirements of your application.
Determine the tables, row keys, column families, and columns that will be used
to store the data. Careful consideration of data access patterns and
performance requirements is crucial in this step.
Create HBase Tables:
Using the HBase shell or HBase APIs, create the tables with the defined data
model. Specify the column families and other table properties during table
creation.
Interact with HBase:
To interact with HBase, you can use the HBase shell for simple operations or
use programming languages like Java, Python, or other supported languages to
connect to HBase using the appropriate client libraries (e.g., HBase Java API).
Through the client libraries, you can perform CRUD (Create, Read, Update,
Delete) operations, scan data, and interact with HBase tables
programmatically.
Monitor and Maintain the Cluster:
Regularly monitor the health and performance of the HBase cluster using
various monitoring tools provided with HBase. Keep an eye on cluster metrics,
node status, and data distribution to ensure smooth operation. Regularly
maintain the cluster by performing tasks like region splitting and compacting
to optimize data storage.
Backup and Disaster Recovery:
Implement a backup and disaster recovery strategy to ensure data safety in
case of node failures or other critical issues. Consider using Hadoop's HDFS
snapshot feature or external backup solutions for HBase data.
It's important to note that implementing HBase can be complex, especially in large-
scale production environments. It's advisable to refer to the official Apache HBase
documentation and seek expert guidance when deploying HBase in a production
environment.
5.3 Hbase clients
HBase provides several client libraries and interfaces that allow applications
to interact with the HBase database. These clients enable developers to perform
CRUD (Create, Read, Update, Delete) operations, scanning, and other data
manipulation tasks. Here are some of the common HBase clients:
HBase Java API:
The HBase Java API is one of the primary and most commonly used client
libraries for HBase. It provides a comprehensive set of classes and methods to
interact with HBase programmatically using the Java programming language. The
Java API offers features like table creation, data insertion, data retrieval,
filtering, and administrative operations.
HBase Shell:
HBase Shell is a command-line interface that comes bundled with HBase. It
allows users to interact with HBase using simple commands. The shell provides
basic CRUD operations, scanning, and table administration commands. It's useful
for quick testing and prototyping.
HBase REST API:
HBase also provides a RESTful web service interface, known as the HBase REST
API. This allows applications to interact with HBase using HTTP methods (GET,
PUT, POST, DELETE) and JSON or XML payloads. The REST API is suitable for
web and mobile applications that need to access HBase data over the web.
HBase Thrift API:
The HBase Thrift API is a cross-language interface that enables applications
to access HBase using Thrift, which is a software framework for scalable cross-
language services development. Thrift allows clients in different programming
languages (e.g., Java, Python, Ruby, C++, etc.) to communicate with HBase using
a common interface.
HBase Async API:
The HBase Async API is an asynchronous Java client library that provides non-
blocking access to HBase. It allows developers to perform operations
concurrently, which can be beneficial for applications that require high-
performance, asynchronous data access.
HBase MapReduce Integration:
HBase integrates with Apache Hadoop's MapReduce framework, allowing
MapReduce jobs to read data from HBase tables and write results back to
HBase. This integration is particularly useful for large-scale data processing
tasks that require data residing in HBase.
HBase Spark Integration:
Similar to HBase's integration with MapReduce, HBase can also be integrated
with Apache Spark. This allows Spark applications to read and write data from
HBase directly, facilitating real-time data processing and analytics.
When selecting an HBase client, consider the programming language and the specific
requirements of your application. For Java-based applications, the HBase Java API is
the most popular choice. For web applications, the HBase REST API might be more
suitable. Thrift API and other language-specific clients are helpful when working with
languages other than Java.
5.4HBASE EXAMPLES:
Here are some examples of how to use HBase with the HBase Java API:
5.4.1 Initializing HBase Configuration:
Before using the HBase Java API, you need to initialize the HBase
configuration and create an HBase connection.
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
public class HBaseExample {
public static void main(String[] args) {
try {
// Initialize HBase configuration
org.apache.hadoop.conf.Configuration config = HBaseConfiguration.create();
// Create HBase connection
Connection connection = ConnectionFactory.createConnection(config);
// Use the connection for HBase operations
// Don't forget to close the connection when done
connection.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
public class HBaseExample {
public static void main(String[] args) {
try {
// Initialize HBase configuration and create a connection (as shown in the
previous example)
// Create an HBase table
TableName tableName = TableName.valueOf("my_table");
Table table = connection.getTable(tableName);
// Add data to the table
Put put1 = new Put("row1".getBytes());
put1.addColumn("cf1".getBytes(), "col1".getBytes(), "value1".getBytes());
table.put(put1);
Put put2 = new Put("row2".getBytes());
put2.addColumn("cf1".getBytes(), "col1".getBytes(), "value2".getBytes());
table.put(put2);
// Retrieve data from the table
Get get = new Get("row1".getBytes());
Result result = table.get(get);
byte[] value = result.getValue("cf1".getBytes(), "col1".getBytes());
System.out.println("Value for row1: " + new String(value));
// Don't forget to close the table when done
table.close();
connection.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
5.4.3 Scanning Data:
You can use the HBase Scan class to perform a range scan on the table.
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
public class HBaseExample {
public static void main(String[] args) {
try {
// Initialize HBase configuration and create a connection (as shown in the
first example)
// Create an HBase table
TableName tableName = TableName.valueOf("my_table");
Table table = connection.getTable(tableName);
// Define the scan range
Scan scan = new Scan();
scan.withStartRow(Bytes.toBytes("row1"));
scan.withStopRow(Bytes.toBytes("row3"));
// Retrieve data using the scan
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
byte[] value = result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("col1"));
System.out.println("Value: " + new String(value));
}
// Don't forget to close the scanner and table when done
scanner.close();
table.close();
connection.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
These are some basic examples of how to interact with HBase using the HBase Java
API.
5.5 PRAXIS:
"praxis" refers to applying the theoretical understanding of HBase's data
model, architecture, and features to real-world scenarios and use cases. It involves
practical implementation and utilization of HBase in various applications, enabling
developers and data engineers to leverage its capabilities effectively.
Here are some examples of praxis in HBase:
Data Modeling: Designing the HBase data model based on the specific
requirements of the application is a crucial aspect of praxis. This involves
determining the row key design, column families, and columns based on the
access patterns and query requirements. Praxis in data modeling ensures
efficient data storage and retrieval.
Table Creation and Management: Practicing the creation and management of
HBase tables involves defining schema, column families, and other table
properties using the HBase Java API or HBase shell. This praxis ensures that
tables are created optimally to suit the application's needs.
Data Ingestion: Implementing praxis in HBase data ingestion involves loading
data from various sources into HBase tables. It may include batch data loading
using tools like Apache HBase Bulk Load or real-time data ingestion using
frameworks like Apache Kafka and Apache HBase Kafka Connector.
Data Retrieval: Utilizing HBase APIs to perform CRUD operations and
retrieve data based on row keys, column families, and column qualifiers is a
practical application of praxis. This ensures that data is retrieved efficiently
for specific application use cases.
Secondary Indexing: Praxis in secondary indexing involves setting up secondary
indexes on HBase tables to facilitate efficient querying and searching based
on non-row-key attributes. This can be accomplished using techniques like HBase
Coprocessors or integrating with external indexing systems.
Data Versioning: Understanding and implementing data versioning in HBase is a
praxis that enables applications to maintain historical data and track changes
over time. It involves using timestamps for cells and efficiently managing data
versions.
Bulk and Incremental Processing: Leveraging HBase's integration with Apache
Hadoop and Apache Spark for bulk and incremental data processing is a praxis
to achieve efficient analytics and data transformations.
Fault Tolerance and Replication: Implementing praxis in HBase fault tolerance
involves setting up data replication across HBase regions and nodes, ensuring
data availability and durability in case of node failures.
Overall, praxis in HBase involves hands-on experience in designing data models,
creating tables, loading data, querying, and understanding the performance
implications of various HBase operations. It enables practitioners to effectively use
HBase in real-world applications and leverage its strengths in managing large-scale,
distributed data
5.6 PIG
Pig provides an engine for executing data flows in parallel on Hadoop. It
includes a language, Pig Latin, for expressing these data flows. Pig Latin includes
operators for many of the traditional data operations (join, sort, filter, etc.), as
well as the ability for users to develop their own functions for reading, processing,
and writing data.
Pig is an Apache open source project. This means users are free to download it
as source or binary, use it for themselves, contribute to it, and—under the terms of
the ApacheLicense—use it in their products and change it as they see fit.
5.6.1Pig on Hadoop
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce. HDFS is a distributed filesystem
that stores files across all of the nodes in a Hadoop cluster. It handles breaking the
files into large blocks and distributing them across different machines, including
making multiple copies of each block so that if any one machine fails no data is lost.
By default, Pig reads input files from HDFS, uses HDFS to store intermediate data
between MapReduce jobs, and writes its output to HDFS.
MapReduce is a simple but powerful parallel data-processing paradigm. Every
job in
MapReduce consists of three main phases: map, shuffle, and reduce. In the map phase,
the application has the opportunity to operate on each record in the input
separately. In the shuffle phase, which happens after the map phase, data is collected
together by the key the user has chosen and distributed to different machines for
the reduce phase. Every record for a given key will go to the same reducer. In the
reduce phase, the application is presented each key, together with all of the records
containing that key. Again this is done in parallel on many machines. After processing
each group, the reducer can write its output.
5.6.2 MapReduce’s hello world
Consider a simple MapReduce application that counts the number of times each
word
appears in a given text. This is the “hello world” program of MapReduce. In this
example the map phase will read each line in the text, one at a time. It will then split
out each word into a separate string, and, for each word, it will output the word and
a 1 to indicate it has seen the word one time. The shuffle phase will use the word as
the key, hashing the records to reducers. The reduce phase will then sum up the
number of times each word was seen and write that together with the word as
output. Let’s consider the case of the nursery rhyme “Mary Had a Little Lamb.” Our
input will be:
Mary had a little lamb
its fleece was white as snow
and everywhere that Mary went
the lamb was sure to go.
Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin
scripts that users write into a series of one or more MapReduce jobs that it then
executes. Pig Latin script that will do a word count of “Mary Had a Little Lamb.”
The first line of this program loads the file users and declares that this data
has two fields: name and age. It assigns the name of Users to the input. The second
line applies a filter to Users that passes through records with an age between 18 and
25, inclusive. All other records are discarded. Now the data has only records of users
in the age range `we are interested in. The results of this filter are named Fltrd.
The second load statement loads pages and names it Pages. It declares its
schema to
have two fields, user and url. The line Jnd = join joins together Fltrd and Pages using
Fltrd.name and Pages.user as the key. After this join we have found all the URLs each
user has visited. The line Grpd = group collects records together by URL. So for each
value of url, such as pignews.com/frontpage, there will be one record with a
collection of all records that
have that value in the url field. The next line then counts how many records are
collected together for each URL. So after this line we now know, for each URL, how
many times it was visited by users aged 18–25.
The next thing to do is to sort this from most visits to least. The line Srtd = order
sorts on the count value from the previous line and places it in desc (descending)
order. Thus the largest value will be first. Finally, we need only the top five pages, so
the last line limits the sorted results to only five records. The results of this are
then stored back to HDFS in the file top5sites.
5.7 GRUNT
Grunt is Pig’s interactive shell. It enables users to enter Pig Latin
interactively and provides a shell for users to interact with HDFS.
To enter Grunt, invoke Pig with no script or command to run. Typing:
pig -x local
will result in the prompt:
grunt>
If you omit the -x
local and have a cluster configuration set in PIG_CLASSPATH, this will put you in a
Grunt shell that will interact with HDFS on your cluster. Grunt provides command-
line history and editing, as well as Tab completion. It does not provide filename
completion via the Tab key.
That is, if you type kil and then press the Tab key, it will complete the command as
kill. But if you have a file foo in your local directory and type ls fo, and then hit Tab,
it will not complete it as ls foo.
To exit Grunt you can type quit or enter Ctrl-D.
5.7.1 Entering Pig Latin Scripts in Grunt
One of the main uses of Grunt is to enter Pig Latin in an interactive session.
You can enter Pig Latin directly into Grunt. Pig will not start executing the Pig
Latin you enter until it sees either a store or dump. However, it will do basic syntax
and semantic checking to help you catch errors quickly. If you do make a mistake while
entering a line of Pig Latin in Grunt, you can reenter the line using the same alias,
and Pig will take the last instance of the line you enter. For example:
pig -x local
grunt> dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grunt> symbols = foreach dividends generate symbl;
...Error during parsing. Invalid alias: symbl ...
grunt> symbols = foreach A generate symbol;
...
5.7.2 HDFS Commands in Grunt
Grunt’s other major use is to act as a shell for HDFS. In versions 0.5 and later
of Pig, all hadoop fs shell commands are available. They are accessed using the keyword
fs. The dash (-) used in the hadoop fs is also required:
grunt>fs –ls
A number of the commands come directly from Unix shells and will operate in
ways that are familiar: chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, and stat. A few of
them either look like Unix commands you are used to but behave slightly differently
or are unfamiliar, including:
cat filename
Print the contents of a file to stdout. You can apply this command to a
directory and it will apply itself in turn to each file in the directory.
copyFromLocal localfile hdfsfile
Copy a file from your local disk to HDFS. This is done serially, not in parallel.
copyToLocal hdfsfile localfile
Copy a file from HDFS to your local disk. This is done serially, not in parallel.
rmr filename
Remove files recursively. This is equivalent to rm -r in Unix. Use this with caution.In
versions of Pig before 0.5, hadoop fs commands were not available. Instead, Grunt
had its own implementation of some of these commands: cat, cd, copyFromLocal, copy
ToLocal, cp, ls, mkdir, mv, pwd, rm (which acted like Hadoop’s rmr, not Hadoop’s rm),
and rmf. As of Pig 0.8, all of these commands are still available. However, with the
exception of cd and pwd, these commands are deprecated in favor of using hadoop fs,
and they might be removed at some point in the future. In version 0.8, a new command
was added to Grunt: sh. This command gives you access to the local shell, just as fs
gives you access to HDFS.
5.7.3 Controlling Pig from Grunt
Grunt also provides commands for controlling Pig and MapReduce:
1. kill jobid
2. exec
3. run
1. kill jobid:
Kill the MapReduce job associated with jobid. The output of the pig command
that spawned the job will list the ID of each job it spawns. You can also find the job’s
ID by looking at Hadoop’s JobTracker GUI, which lists all jobs currently running on the
cluster. If your Pig job contains other MapReduce jobs that do not depend on the
killed MapReduce job, these jobs will still continue. If you want to kill all of the Map-
Reduce jobs associated with a particular Pig job, it is best to terminate the process
running Pig, and then use this command to kill any MapReduce jobs that are still
running. Make sure to terminate the Pig process with a Ctrl-C or a Unix kill, not a
Unix kill -9.
2. exec [[-param param_name = param_value]] [[-param_file filename]] script
Execute the Pig Latin script script. Aliases defined in script are not imported
into Grunt. This command is useful for testing your Pig Latin scripts while inside a
Grunt session.
3. run [[-param param_name = param_value]] [[-param_file filename]] script
Execute the Pig Latin script script in the current Grunt shell. Thus all aliases
referenced in script are available to Grunt, and the commands in script are
accessible via the shell history.
5.8. Pig’s Data Model
5.8.1Pig data types
Pig’s data types can be divided into two categories: scalar types and complex
types.
5.8.1.1.Scalar Types
Pig’s scalar types are simple types that appear in most programming languages.
With
the exception of bytearray, they are all represented in Pig interfaces by java.lang
classes, making them easy to work with in UDFs:
1.int
2.long
3.float
4.double
5.chararray
6.bytearray
1.int:
An integer. Ints are represented in interfaces by java.lang.Integer. They store
a four byte signed integer. Constant integers are expressed as integer numbers, for
example,42.
2. long
A long integer. Longs are represented in interfaces by java.lang.Long. They
store an eight-byte signed integer. Constant longs are expressed as integer numbers
with an L appended, for example, 5000000000L.
3.float
A floating-point number. Floats are represented in interfaces by
java.lang.Float and use four bytes to store their value. Constant floats are
expressed as a floating-point number with an f appended. Floating-point numbers can
be expressed in simple format, 3.14f, or in exponent format, 6.022e23f.
4. double
A double-precision floating-point number. Doubles are represented in
interfaces by java.lang.Double and use eight bytes to store their value. Constant
doubles are expressed as a floating-point number in either simple format, 2.71828, or
in exponent format, 6.626e-34.
5. chararray
A string or character array. Chararrays are represented in interfaces by
java.lang.String. Constant chararrays are expressed as string literals with single
quotes, for example, 'fred'. In addition to standard alphanumeric and symbolic
characters,we can express certain characters in chararrays by using backslash codes,
such as \t for Tab and \n for Return. Unicode characters can be expressed as \u
followed by their four-digit hexadecimal Unicode value. For example, the value for
Ctrl-A is expressed as \u0001.
6. bytearray
A blob or array of bytes. Bytearrays are represented in interfaces by a Java
class DataByteArray that wraps a Java byte[]. There is no way to specify a constant
bytearray.
5.8.1.2Complex Types
Pig has three complex data types: maps, tuples, and bags
1. Maps
A map in Pig is a chararray to data element mapping, where that element can
be any Pig type, including a complex type. The chararray is called a key and is used as
an index to find the element, referred to as the value. Because Pig does not know
the type of the value, it will assume it is a bytearray. If the value is of a type other
than bytearray, Pig will figure that out at runtime and handle it. Map constants are
formed using brackets to delimit the map, a hash between keys and values, and a
comma between key-value pairs. For example, ['name'#'bob','age'#55] will create a map
with two keys, “name” and “age”. The first value is a chararray, and the second is an
integer.
2.Tuple:
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are
divided into fields, with each field containing one data element. These elements can
be of any type—they do not all need to be the same type. A tuple is analogous to a
row in SQL, with the fields being SQL columns. Because tuples are ordered, it is
possible to refer to the fields by position; Tuple constants use parentheses to
indicate the tuple and commas to delimit fields in the tuple. For example, ('bob', 55)
describes a tuple constant with two fields.
3. Bag:
A bag is an unordered collection of tuples. Because it has no order, it is not
possible to reference tuples in a bag by position. Like tuples, a bag can, but is not
required to, have a schema associated with it. In the case of a bag, the schema
describes all tuples within the bag.
Bag constants are constructed using braces, with tuples in the bag separated by
commas. For example, {('bob', 55), ('sally', 52), ('john', 25)} constructs a bag with three
tuples, each with two fields. It is possible to mimic a set type using the bag, by
wrapping the desired type
in a tuple of one field. bags are used to store collections when grouping, bags can
become quite large. Pig has the ability to spill bags to disk when necessary, keeping
only partial sections of the bag in memory. The size of the bag is limited to the
amount of local disk available for spilling the bag.
5.8.2 Nulls
Pig includes the concept of a data element being null. Data of any type can be
null. In Pig a null data element means the value is unknown. This might be because the
data is missing, an error occurred in processing it, etc. In most procedural languages, a
data value is said to be null when it is unset or does not point to a valid address or
object. This difference in the concept of null is important and affects the way Pig
treats null data, especially when operating on it.
5.8.3 schema
Pig has a very lax attitude when it comes to schemas. If a schema for the data
is available, Pig will make use of it, both for up-front error checking and for
optimization. But if no schema is available, Pig will still process the data, making the
best guesses it can based on how the script treats the data. The easiest way to
communicate the schema of your data to Pig is to explicitly tell Pig what it is when
you load the data:
dividends = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray, date:chararray, dividend:float);
Pig now expects your data to have four fields. If it has more, it will truncate the
extra ones. If it has less, it will pad the end of the record with nulls. It is also
possible to specify the schema without giving explicit data types. In this case, the
data type is assumed to be bytearray: dividends = load 'NYSE_dividends' as (exchange,
symbol, date, dividend);
5.8.1 Schema syntax
when you declare a schema, you do not have to declare the schema of complex types,
but you can if you want to. For example, if your data has a tuple in it, you can
declare that field to be a tuple without specifying the fields it contains. You can
also declare that field to be a tuple that has three columns, all of which are
integers. The runtime declaration of schemas is very nice. It makes it easy for users
to operate on data without having to first load it into a metadata system. But for
production systems that run over the same data every hour or every day, it has a
couple of significant drawbacks. One, whenever your data changes, you have to change
your Pig Latin. Two, although this works fine on data with 5 columns, it is painful when
your data has 100 columns. To address these issues, there is another way to load
schemas in Pig. If the load function you are using already knows the schema of the
data, the function can communicate that to Pig. Load functions might already know
the schema because it is stored in a metadata repository such as HCatalog, or it might
be stored in the data itself. you can still refer to fields by name because Pig will
fetch the schema from the load function before doing error checking on your script:
mdata = load 'mydata' using HCatLoader();
cleansed = filter mdata by name is not null;
...
Pig will determine whether it can adapt the one returned by the loader to match
the one you gave. For example, if you specified a field as a long and the loader said it
was an int, Pig can and will do that cast. However, if it cannot determine a way to
make the loader’s schema fit the one you gave, it will give an error.
--no_schema.pig
daily = load 'NYSE_daily';
calcs = foreach daily generate $7 / 1000, $3 * 100.0, SUBSTRING($0, 0, 1), $6 - $3;
In the expression $7 / 1000, 1000 is an integer, so it is a safe guess that the eighth
field of NYSE_daily is an integer or something that can be cast to an integer. In the
same way, $3 * 100.0 indicates $3 is a double, and the use of $0 in a function that
takes a chararray as an argument indicates the type of $0. But what about the last
expression, $6 - $3? The - operator is used only with numeric types in Pig, so Pig can
safely guess that $3 and $6 are numeric. But should it treat them as integers or
floating-point numbers? Here Pig plays it safe and guesses that they are floating
points, casting them to doubles. This is the safer bet because if they actually are
integers, those can be represented as floating-point numbers, but the reverse is not
true. However, because floating-point arithmetic is much slower and subject to loss of
precision, if these values really are integers, you should cast them so that Pig uses
integer types in this case. There are also cases where Pig cannot make any intelligent
guess:
--no_schema_filter
daily = load 'NYSE_daily';
fltrd = filter daily by $6 > $3;
It is a valid operator on numeric, chararray, and bytearray types in Pig Latin. So, Pig
has no way to make a guess. In this case, it treats these fields as if they were
bytearrays, which means it will do a byte-to-byte comparison of the data in these
fields. Pig also has to handle the case where it guesses wrong and must adapt on the
fly. Consider the following:
--unintended_walks.pig
player = load 'baseball' as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate bat#'base_on_balls' - bat#'ibbs';
Because the values in maps can be of any type, Pig has no idea what
type bat#'base_on_balls' and bat#'ibbs' are. By the rules laid out previously, Pig will
assume they are doubles. But let’s say they actually turn out to be represented
internally as integers. Pig will need to adapt at runtime and convert what it thought
was a cast from bytearray to double into a cast from int to double. Note that it will
still produce a double output and not an int output. This might seem nonintuitive;
Finally, Pig’s knowledge of the schema can change at different points in the Pig Latin
script. In all of the previous examples where we loaded data without a schema and
then passed it to a foreach statement, the data started out without a schema. But
after the foreach, the schema is known. Similarly, Pig can start out knowing the
schema, but if the data is mingled with other data without a schema, the schema can
be lost. That is, lack of schema is contagious:
--no_schema_join.pig
divs = load 'NYSE_dividends' as (exchange, stock_symbol, date, dividends);
daily = load 'NYSE_daily';
jnd = join divs by stock_symbol, daily by $1;
In this example, because Pig does not know the schema of daily, it cannot know the
schema of the join of divs and daily.
5.8.4 Casts
--unintended_walks_cast.pig
player = load 'baseball' as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate (int)bat#'base_on_balls' - (int)bat#'ibbs';
The syntax for specifying types in casts is exactly the same as specifying them in
schemas.Not all conceivable casts are allowed.The following table describes which casts
are allowed between scalar types. Casts to bytearrays are never allowed because Pig
does not know how
to represent the various data types in binary format. Casts from bytearrays to any
type are
allowed. Casts to and from complex types currently are not allowed, except from
bytearray.
One type of casting that requires special treatment is casting from bytearray to
other types. Because bytearray indicates a string of bytes, Pig does not know how to
convert its contents to any other type. Continuing the previous example,
both bat#'base_on_balls' and bat#'ibbs' were loaded as bytearrays. The casts in the
script indicate that you want them treated as ints.
Pig does not know whether integer values in baseball are stored as ASCII strings,
Java serialized values, binary-coded decimal, or some other format. So it asks the load
function, because it is that function’s responsibility to cast bytearrays to other
types. In general this works nicely, but it does lead to a few corner cases where Pig
does not know how to cast a bytearray. In particular, if a UDF returns a bytearray,
Pig will not know how to perform casts on it because that bytearray is not
generated by a load function.
Before leaving the topic of casts, we need to consider cases where Pig inserts casts
for the user. These casts are implicit, compared to explicit casts where the user
indicates the cast. Consider the following:
--total_trade_estimate.pig
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
rough = foreach daily generate volume * close;
In this case, Pig will change the second line to (float)volume * close to do the
operation without losing precision. In general, Pig will always widen types to fit when
it needs to insert these implicit casts. So, int and long together will result in a long;
int or long and float will result in a float; and int, long, or float and double will
result in a double. There are no implicit casts between numeric types and chararrays
or other types.
5.9 Pig Latin
5.9.1 Preliminary Matters
Pig Latin is a dataflow language. Each processing step results in a new data set,
or relation. In input = load 'data', input is the name of the relation that results
from loading the data set data. A relation name is referred to as an alias. Relation
names look like variables, but they are not. Once made, an assignment is permanent. It
is possible to reuse relation names; for example, this is legitimate:
However, it is not recommended. It looks here as if you are reassigning A, but really
you are creating new relations called A, losing track of the old relations called A. It
leads to confusion when trying to read your programs and when reading error
messages.
Both relation and field names must start with an alphabetic character, and then they
can have zero or more alphabetic, numeric, or _ (underscore) characters. All
characters in the name must be ASCII.
Pig Latin cannot decide whether it is case-sensitive. Keywords in Pig Latin are
not case-sensitive; for example, LOAD is equivalent to load. But relation and field
names are. So A = load 'foo'; is not equivalent to a = load 'foo';. UDF names are also
case-sensitive, thus COUNT is not the same UDF as count.
5.9.3 Comments
Pig Latin has two types of comment operators: SQL-style single-line comments
(--) and Java-style multiline comments (/* */). For example:
For example, if you wanted to load your data from HBase, you would use the loader
for HBase:
For example, if you are reading comma-separated text data, PigStorage takes
an argument to indicate which character to use as a separator:
The load statement also can have an as clause, which allows you to specify the
schema of the data you are loading.
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
PigStorage and TextLoader, the two built-in Pig load functions that operate
on HDFS files, support globs
5.9.4.2 Store
After you have finished processing your data, you will want to write it out
somewhere. Pig provides the store statement for this purpose. In many ways it is the
mirror image of the load statement. By default, Pig stores your data on HDFS in a
tab-delimited file using PigStorage
If you do not specify a store function, PigStorage will be used. You can specify a
different store function with a using clause:
5.9.4.3 Dump
In most cases you will want to store your data somewhere when you are done
processing it. But occasionally you will want to see it on the screen. This is
particularly useful during debugging and prototyping sessions. It can also be useful for
quick ad hoc jobs. dump directs the output of your script to your screen:
dump processed;
Up through version 0.7, the output of dump matches the format of constants in Pig
Latin. So, longs are followed by an L, and floats by an F, and maps are surrounded
by [] (brackets), tuples by () (parentheses), and bags by {} (braces).
5.9.4.4Relational Operations
Relational operators are the main tools Pig Latin provides to operate on
your data. They allow you to transform it by sorting, grouping, joining, projecting, and
filtering. This section covers the basic relational operators
1.foreach
foreach takes a set of expressions and applies them to every record in the data
pipeline, hence the name foreach. for example, the following code loads an entire
record, but then removes all but the user and id fields from each record:
2.Expressions in foreach
foreach supports an array of expressions. The simplest are constants and field
references.
prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
gain = foreach prices generate close - open;
gain2 = foreach prices generate $6 - $3;
Null values are viral for all arithmetic operators. That is, x + null = null for all values
of x.
Pig also provides a binary condition operator, often referred to as bincond. It
begins with a Boolean test, followed by a ?, then the value to return if the test is
true, then a :, and finally the value to return if the test is false.
2 == 2 ? 1 : 4 --returns 1
2 == 3 ? 1 : 4 --returns 4
To extract data from complex types, use the projection operators. For maps this
is # (the pound or hash), followed by the name of the key as a string.
position:bag{t:(p:chararray)}, bat:map[]);
3.UDFs in foreach
User Defined Functions (UDFs) can be invoked in foreach. These are
called evaluation functions, or eval funcs.
-- udf_in_foreach.pig
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
--make sure all strings are uppercase
upped = foreach divs generate UPPER(symbol) as symbol, dividends;
grpd = group upped by symbol; --output a bag upped for each value of symbol
5.Filter
The filter statement allows you to select which records will be retained in
your data pipeline. A filter contains a predicate. If that predicate evaluates to true
for a given record, that record will be passed down the pipeline. Otherwise, it will
not.
Predicates can contain the equality operators you expect, including == to
test equality, and !=, >, >=, <, and <=. These comparators can be used on any scalar
data type. == and != can be applied to maps and tuples.
Pig Latin follows the operator precedence that is standard in most
programming languages, where arithmetic operators have precedence over equality
operators. So, x + y == a + b is equivalent to (x + y) == (a + b).
For chararrays, users can test to see whether the chararray matches a regular
expression:
-- filter_matches.pig
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
startswithcm = filter divs by symbol matches 'CM.*';
6.Group
The group statement collects together records with the same key. It is the first
operator we have looked at that shares its syntax with SQL, but it is important to
understand that the grouping operator in Pig Latin is fundamentally different than
the one in SQL.
-- count.pig
daily = load 'NYSE_daily' as (exchange, stock);
grpd = group daily by stock;
cnt = foreach grpd generate group, COUNT(daily);
That example groups records by the key stock and then counts them. It is just
as legitimate to group them and then store them for processing at a later time:
-- group.pig
daily = load 'NYSE_daily' as (exchange, stock);
grpd = group daily by stock;
store grpd into 'by_group';
You can also group on multiple keys, but the keys must be surrounded by
parentheses.
--twokey.pig
daily = load 'NYSE_daily' as (exchange, stock, date, dividends);
grpd = group daily by (exchange, stock);
avg = foreach grpd generate group, AVG(daily.dividends);
describe grpd;
grpd: {group: (exchange: bytearray,stock: bytearray),daily: {exchange: bytearray,
stock: bytearray,date: bytearray,dividends: bytearray}}
You can also use all to group together all of the records in your pipeline:
--countall.pig
The record coming out of group all has the chararray literal all as a key.
7. Order by
The order statement sorts your data for you, producing a total order of your
output data. Total order means that not only is the data sorted in each partition of
your data, it is also guaranteed that all records in partition n are less than all
records in partition n - 1 for all n.
--order.pig
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
bydate = order daily by date;
--order2key.pig
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float,
close:float, volume:int, adj_close:float);
bydatensymbol = order daily by date, symbol;
8.Distinct
The distinct statement is very simple. It removes duplicate records. It works
only on entire records, not on individual fields:
--distinct.pig
-- find a distinct list of ticker symbols for each exchange
-- This load will truncate the records, picking up just the first two fields.
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray);
uniq = distinct daily;
9.Join
join is one of the workhorses of data processing, and it is likely to be in many
of your Pig Latin scripts. join selects records from one input to put together with
records from another input. This is done by indicating keys for each input.
--join.pig
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd = join daily by symbol, divs by symbol;
Like foreach, join preserves the names of the fields of the inputs passed to
it. It also prepends the name of the relation the field came from, followed by a ::.
Adding describe jnd; to the end of the previous example produces:
Pig also supports outer joins. A full outer join means records from both sides are
taken even when they do not have matches:
--leftjoin.pig
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd = join daily by (symbol, date) left outer, divs by (symbol, date);
Pig can also do multiple joins in a single operation, as long as they are all being joined
on the same key(s). This can be done only for inner joins:
A = load 'input1' as (x, y);
B = load 'input2' as (u, v);
C = load 'input3' as (e, f);
alpha = join A by x, B by u, C by e;
Self joins are supported, though the data must be loaded twice:
--selfjoin.pig
-- For each stock, find all dividends that increased between two dates
divs1 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
divs2 = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
jnd = join divs1 by symbol, divs2 by symbol;
increased = filter jnd by divs1::date < divs2::date and
divs1::dividends < divs2::dividends;
10.Limit
Sometimes you want to see only a limited number of results. limit allows you do this:
--limit.pig
divs = load 'NYSE_dividends';
first10 = limit divs 10;
The example here will return at most 10 lines (if your input has less than 10
lines total, it will return them all).
11.Sample
Sample offers a simple way to get a sample of your data. It reads through all
of your data but returns only a percentage of rows. What percentage it returns is
expressed as a double value, between 0 and 1. So, in the following
example, 0.1 indicates 10%:
--sample.pig
divs = load 'NYSE_dividends';
some = sample divs 0.1;
12.Parallel
One of Pig’s core claims is that it provides a language for parallel data
processing.
The parallel clause can be attached to any relational operator in Pig Latin. However,
it controls only reduce-side parallelism, so it makes sense only for operators that
force a reduce phase. These are: group*, order, distinct, join*, limit, cogroup*,
and cross
--parallel.pig
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
bysymbl = group daily by symbol parallel 10;
--register.pig
register 'your_path_to_piggybank/piggybank.jar';
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
backwards = foreach divs generate
org.apache.pig.piggybank.evaluation.string.Reverse(symbol);
--batting_production.pig
register 'production.py' using jython as bballudfs;
players = load 'baseball' as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
nonnull = filter players by bat#'slugging_percentage' is not null and
bat#'on_base_percentage' is not null;
(float)bat#'slugging_percentage',
(float)bat#'on_base_percentage');
--define.pig
register 'your_path_to_piggybank/piggybank.jar';
define reverse org.apache.pig.piggybank.evaluation.string.Reverse();
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
backwards = foreach divs generate reverse(symbol);
17.Calling Static Java Function
Java has a rich collection of utilities and libraries. Because Pig is implemented
in Java, some of these functions can be exposed to Pig users. Any public static Java
function that takes no arguments or some combination
of int, long, float, double, String, or arrays thereof and
returns int, long, float, double, or String can be invoked in this way. Because Pig
Latin does not support overloading on return types, there is an invoker for each
return type: InvokeForInt, InvokeForLong, InvokeForFloat, InvokeForDouble,
and InvokeForString. You must pick the appropriate invoker for the type you wish
to return. For example, if you wanted to use Java’s Integer class to translate
decimal values to hexadecimal values, you could do:
--invoker.pig
define hex InvokeForString('java.lang.Integer.toHexString', 'int');
divs = load 'NYSE_daily' as (exchange, symbol, date, open, high, low,
close, volume, adj_close);
nonnull = filter divs by volume is not null;
inhex = foreach nonnull generate symbol, hex((int)volume);
Fig 5.10.1
If you add -c or -check to the command line, Pig will just parse and run semantic
checks on your script. The -dryrun command-line option will also check your syntax,
expand any macros and imports, and perform parameter substitution.
5.10.1.2 Describe
describe shows you the schema of a relation in your script. This can be very
helpful as you are developing your scripts. It is especially useful as you are learning
Pig Latin and understanding how various operators change the data. describe can be
applied to any relation in your script, and you can have multiple describes in a script:
--describe.pig
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
trimmed = foreach divs generate symbol, dividends;
grpd = group trimmed by symbol;
avgdiv = foreach grpd generate group, AVG(trimmed.dividends);
describe trimmed;
describe grpd;
describe avgdiv;
5.10.1.3 Explain
Explain is particularly helpful when you are trying to optimize your scripts
or debug errors. There are two ways to use explain. You can explain any alias in your
Pig Latin script, which will show the execution plan Pig would use if you stored that
relation. You can also take an existing Pig Latin script and apply explain to the
whole script in Grunt. This has a couple of advantages.
--explain.pig
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
grpd = group divs by symbol;
avgdiv = foreach grpd generate group, AVG(divs.dividends);
store avgdiv into 'average_dividend';
--illustrate.pig
divs = load 'NYSE_dividends' as (e:chararray, s:chararray, d:chararray, div:float);
recent = filter divs by d > '2009-01-01';
trimmd = foreach recent generate s, div;
grpd = group trimmd by s;
avgdiv = foreach grpd generate group, AVG(trimmd.div);
illustrate avgdiv;
--stats.pig
a = load '/user/pig/tests/data/singlefile/studenttab20m' as (name, age, gpa);
b = load '/user/pig/tests/data/singlefile/votertab10k'
as (name, age, registration, contributions);
c = filter a by age < '50';
d = filter b by age < '50';
e = cogroup c by (name, age), d by (name, age) parallel 20;
f = foreach e generate flatten(c), flatten(d);
g = group f by registration parallel 20;
h = foreach g generate group, SUM(f.d::contributions);
i = order h by $1, $0 parallel 20;
store i into 'student_voter_info';
Clicking on the job ID will take you to a screen that summarizes the execution of the
job, including when the job started and stopped, how many maps and reduces it ran,
and the results of all of the counters.
--pigunit.pig
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
grpd = group divs all;
avgdiv = foreach grpd generate AVG(divs.dividends);
store avgdiv into 'average_dividend';
Second, you will need the pigunit.jar JAR file. This is not distributed as part of
the standard Pig distribution, but you can build it from the source code included in
your distribution. To do this, go to the directory your distribution is in and type ant
jar pigunit-jar. Once this is finished, there should be two files in the
directory: pig.jar and pigunit.jar. You will need to place these in your classpath when
running PigUnit tests. Third, you need data to run through your script. You can use an
existing input file, or you can manufacture some input in your test and run that
through your script.
// java/example/PigUnitExample.java
public class PigUnitExample {
private PigTest test;
private static Cluster cluster;
@Test
public void testDataInFile() throws ParseException, IOException {
// Construct an instance of PigTest that will use the script
// pigunit.pig.
test = new PigTest("../pigunit.pig");
// Specify our expected output. The format is a string for each line.
// In this particular case we expect only one line of output.
String[] output = { "(0.27305267014925455)" };
// Run the test and check that the output matches our expectation.
// The "avgdiv" tells PigUnit what alias to check the output value
// against. It inserts a store for that alias and then checks the
// contents of the stored file against output.
test.assertOutput("avgdiv", output);
}
}
5.11 HIVE
Hive is a data warehousing and SQL-like data processing tool built on top of
Apache Hadoop. It was developed by Facebook to simplify querying and analyzing large
-scale datasets stored in Hadoop Distributed File System (HDFS) or other compatible
storage systems.
Key features of Hive include:
HiveQL: Hive Query Language (HiveQL) is a SQL-like language used to write
queries for data processing. It allows users to express complex data
transformations and analytics tasks in a familiar SQL syntax.
Schema on Read: Hive provides a schema-on-read approach, which means the
schema is applied when data is read, rather than when it is ingested. This
flexibility allows Hive to handle semi-structured and unstructured data
efficiently.
Metastore: Hive maintains a metastore, typically backed by a relational
database, to store metadata about the tables, columns, partitions, and other
relevant information. This enables Hive to understand the structure of the
data and optimize query execution.
Data Partitioning and Buckets: Hive supports data partitioning and bucketing,
which improves query performance by organizing data into smaller, manageable
parts.
Integration with Hadoop Ecosystem: Hive seamlessly integrates with other
components of the Hadoop ecosystem, such as Hadoop Distributed File System
(HDFS), Apache HBase, and Apache Spark.
Extensibility: Hive is extensible, allowing users to add custom user-defined
functions (UDFs) and user-defined aggregates (UDAs) to perform specialized
operations on data.
Optimization: Hive optimizes queries by using techniques like query optimization,
predicate pushdown, and join optimization.
Hive is particularly useful for analysts and data engineers who are familiar with
SQL and want to leverage their SQL skills to work with big data. It abstracts
the complexities of the underlying distributed computing infrastructure and
allows users to focus on data analysis.
To use Hive, you typically interact with it using its command-line interface (CLI)
or through various data processing tools that support Hive connectivity. Hive
queries are translated into MapReduce jobs (or other processing engines like
Apache Tez or Apache Spark) for execution on the Hadoop cluster.
Keep in mind that Hive might not be the best choice for real-time data processing
due to its batch-oriented nature. For real-time or interactive analytics, other
technologies like Apache Spark with SparkSQL or Apache Impala might be more
suitable.
5.12 HIVE DATA TYPES AND FILE FORMATS:
In Hive, data types define the type of data that can be stored in a column,
and file formats determine how data is stored physically on disk. Hive supports various
data types and file formats to accommodate different use cases and optimize data
storage and processing. Below are some commonly used data types and file formats in
Hive:
5.12.1Hive Data Types:
1.Primitive Data Types:
TINYINT: 1-byte signed integer (-128 to 127)
SMALLINT: 2-byte signed integer (-32,768 to 32,767)
INT or INTEGER: 4-byte signed integer (-2,147,483,648 to 2,147,483,647)
BIGINT: 8-byte signed integer (-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807)
FLOAT: 4-byte single-precision floating-point number
DOUBLE: 8-byte double-precision floating-point number
BOOLEAN: Boolean (true or false)
STRING: Variable-length character string
CHAR: Fixed-length character string
VARCHAR: Variable-length character string with a specified maximum length
DATE: Date value in the format 'YYYY-MM-DD'
TIMESTAMP: Timestamp value in the format 'YYYY-MM-DD HH:MM:SS.sss'
2.Complex Data Types:
ARRAY: Ordered collection of elements of the same data type
MAP: Collection of key-value pairs, where keys and values can have different
data types
STRUCT: Similar to a struct or record in programming, can have multiple named
fields with different data types
UNIONTYPE: A union of multiple data types
5.12.2 Hive File Formats:
TextFile: Default file format in Hive, which stores data in plain text format.
It is human-readable but not the most space-efficient format for large
datasets.
SequenceFile: A binary file format optimized for large datasets, offering
better compression and efficient serialization/deserialization. It is widely used
in the Hadoop ecosystem.
ORC (Optimized Row Columnar): ORC is a columnar storage format that
provides better compression and improved query performance. It organizes
data into columns, enabling efficient data retrieval for specific columns during
query execution.
Parquet: Parquet is another columnar storage format that offers efficient
compression and encoding techniques. It is commonly used in conjunction with
Apache Spark and other big data processing frameworks.
Avro: Avro is a data serialization system that allows schema evolution. It is a
binary format with a JSON-like schema definition, making it compact and
versatile.
RCFile (Record Columnar File): RCFile is a columnar storage format that splits
data into row groups, reducing the overhead of reading unnecessary columns
during query execution.
Choosing the appropriate data type and file format depends on your data
characteristics, query patterns, and storage and performance requirements. For
example, for analytical workloads with large datasets, ORC or Parquet are often
preferred due to their superior compression and columnar storage optimizations. On
the other hand, for smaller datasets or when human readability is a priority, TextFile
might be suitable.
5.13 HIVEQL DATA DEFINITION:
HiveQL is the Hive query language. Hive offers no support for rowlevel
inserts, updates, and deletes. Hive doesn’t support transactions. which are used for
creating, altering, and dropping databases, tables, views, functions, and indexes.
5.13.1 Databases in Hive
The Hive concept of a database is essentially just a catalog or namespace of tables.
If you don’t specify a database, the default database is used. The simplest syntax for
creating a database is shown in the following example:
5.13.1.1 CREATE DATABASE
hive> CREATE DATABASE financials;
hive> CREATE DATABASE IF NOT EXISTS financials;
You can also use the keyword SCHEMA instead of DATABASE in all the database-
related
commands.
hive> CREATE DATABASE human_resources;
hive> SHOW DATABASES;
default
financials
human_resources
You can override this default location for the new directory as shown in this
example:
hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';
You can add a descriptive comment to the database. DESCRIBE DATABASE
<database> command.
hive> CREATE DATABASE financials
> COMMENT 'Holds all financial tables';
hive> DESCRIBE DATABASE financials;
financials Holds all financial tables
hdfs://master-server/user/hive/warehouse/financials.db
Note that DESCRIBE DATABASE also shows the directory location for the database.
If you are running in pseudo-distributed mode, then the master server will be
localhost. For local mode, the path will be a local path,
file:///user/hive/warehouse/financials.db.
The USE command sets a database as your working database, analogous to changing
working directories in a filesystem:
hive> USE financials;
Now, commands such as SHOW TABLES; will list the tables in this database. Finally, you
can drop a database:
hive> DROP DATABASE IF EXISTS financials;
hive> DROP DATABASE IF EXISTS financials CASCADE;
Using the RESTRICT keyword instead of CASCADE is equivalent to the default
behavior,
where existing tables must be dropped before dropping the database.
5.13.1.2 Alter Database
We can set key-value pairs in the DBPROPERTIES associated with a database
using the ALTER DATABASE command. No other metadata about the database can be
changed,including its name and directory location:
hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');
There is no way to delete or “unset” a DBPROPERTY.
5.13.1.3 Creating Tables
The CREATE TABLE statement follows SQL conventions, but Hive’s version
offers significant extensions to support a wide range of flexibility where the data
files for tables are stored, the formats used, etc.
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
Hive automatically adds two table properties: last_modified_by holds the username of
the last user to modify the table, and last_modified_time holds the epoch time in
seconds of that modification.
The SHOW TABLES command lists the tables. With no additional arguments, it shows
the tables in the current working database.
hive> USE mydb;
hive> SHOW TABLES;
employees
table1
table2
If we aren’t in the same database, we can still list the tables in that database:
hive> USE default;
hive> SHOW TABLES IN mydb;
employees
We can also use the DESCRIBE EXTENDED mydb.employees command to show details
about
the table.
hive> DESCRIBE EXTENDED mydb.employees;
name string Employee name
salary float Employee salary
subordinates array<string> Names of subordinates
deductions map<string,float> Keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int> Home address
Detailed Table Information Table(tableName:employees, dbName:mydb, owner:me,
...
location:hdfs://master-server/user/hive/warehouse/mydb.db/employees,
parameters:{creator=me, created_at='2012-01-02 10:00:00',
last_modified_user=me, last_modified_time=1337544510,
comment:Description of the table, ...}, ...)
If you only want to see the schema for a particular column, append the column to
the
table name. Here, EXTENDED adds no additional output:
hive> DESCRIBE mydb.employees.salary;
salary float Employee salary
5.13.1.4 Managed Tables
The tables we have created so far are called managed tables or sometimes called
internal tables, because Hive controls the lifecycle of their data. When we drop a
managed table Hive deletes the data in the table.
5.13.1.5 External Tables
The following table declaration creates an external table that can read all
the data files for this comma-delimited data in /data/stocks: