0% found this document useful (0 votes)
6 views

pbds unit-5

The document provides an overview of Hadoop-related tools, focusing on HBase and Cassandra. HBase is a column-oriented NoSQL database designed for real-time data processing, while Cassandra is a distributed storage system known for its scalability and fault tolerance. The document also discusses HBase's data model, CRUD operations, and examples of using HBase with Java API.

Uploaded by

webinar trainer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

pbds unit-5

The document provides an overview of Hadoop-related tools, focusing on HBase and Cassandra. HBase is a column-oriented NoSQL database designed for real-time data processing, while Cassandra is a distributed storage system known for its scalability and fault tolerance. The document also discusses HBase's data model, CRUD operations, and examples of using HBase with Java API.

Uploaded by

webinar trainer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

4/14/25, 3:33 PM BDM unit 5 - unit 5

UNIT V HADOOP RELATED


TOOLS
Hbase – data model and implementations – Hbase clients – Hbase examples – praxis.Cas
cassandra data model – cassandra examples – cassandra clients – Hadoop integration. Pi
– pig data model – Pig Latin – developing and testing Pig Latin scripts. Hive – data types a
formats – HiveQL data definition – HiveQL data manipulation – HiveQL queries.

HBASE

HBase is a column-oriented non-relational database management system that runs on top of


Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse
data sets, which are common in many big data use cases. It is well suited for real-time data
processing or random read/write access to large volumes of data.

Unlike relational database systems, HBase does not support a structured query language like
SQL; in fact, HBase isn‘t a relational data store at all. HBase applications are written in Java™
much like a typical Apache MapReduce application. HBase does support writing applications in
Apache Avro, REST and Thrift.

An HBase system is designed to scale linearly. It comprises a set of standard tables with rows
and columns, much like a traditional database. Each table must have an element defined as a
primary key, and all access attempts to HBase tables must use this primary key.

Avro, as a component, supports a rich set of primitive data types including: numeric, binary data
and strings; and a number of complex types including arrays, maps, enumerations and records. A
sort order can also be defined for the data.

HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built into HBase,
but if you‘re running a production cluster, it‘s suggested that you have a dedicated ZooKeeper
cluster that‘s integrated with your HBase cluster.

HBase works well with Hive, a query engine for batch processing of big data, to enable fault-
tolerant big data applications.

HBASE DATA MODEL AND IMPLEMENTATION

As we know, HBase is a column-oriented NoSQL database. Although it looks similar to a


relational database which contains rows and columns, but it is not a relational database.
Relational databases are row oriented while HBase is column-oriented. So, let us first understand
the difference between Column-oriented and Row-oriented databases:

about:blank 1/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Row-oriented vs column-oriented Databases:

 Row-oriented databases store table records in a sequence of rows. Whereas column-


oriented databases store table records in a sequence of columns, i.e. the entries in a
column are stored in contiguous locations on disks.

To better understand it, let us take an example and consider the table below.

If this table is stored in a row-oriented database. It will store the records as shown below:

1, Paul Walker, US, 231, Gallardo,

2, Vin Diesel, Brazil, 520, Mustang

In row-oriented databases data is stored on the basis of rows or tuples as you can see above.

While the column-oriented databases store this data as:

1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang

In a column-oriented databases, all the column values are stored together like first column values
will be stored together, then the second column values will be stored together and data in other
columns are stored in a similar manner.

 When the amount of data is very huge, like in terms of petabytes or exabytes, we use
column-oriented approach, because the data of a single column is stored together and can
be accessed faster.
 While row-oriented approach comparatively handles less number of rows and columns
efficiently, as row-oriented database stores data is a structured format.
 When we need to process and analyze a large set of semi-structured or unstructured data,
we use column oriented approach. Such as applications dealing with Online Analytical
Processing like data mining, data warehousing, applications including analytics, etc.

about:blank 2/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

 Whereas, Online Transactional Processing such as banking and finance domains which
handle structured data and require transactional properties (ACID properties) use row-
oriented approach.

HBase tables has following components, shown in the image below:

 Tables: Data is stored in a table format in HBase. But here tables are in column-oriented
format.
 Row Key: Row keys are used to search records which make searches fast. You would be
curious to know how? I will explain it in the architecture part moving ahead in this blog.
 Column Families: Various columns are combined in a column family. These column
families are stored together which makes the searching process faster because data
belonging to same column family can be accessed together in a single seek.
 Column Qualifiers: Each column‘s name is known as its column qualifier.
 Cell: Data is stored in cells. The data is dumped into cells which are specifically
identified by rowkey and column qualifiers.
 Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is
stored with its timestamp. This makes easy to search for a particular version of data.

In a more simple and understanding way, we can say HBase consists of:

 Set of tables
 Each table with column families and rows
 Row key acts as a Primary key in HBase.
 Any access to HBase tables uses this Primary Key
 Each column qualifier present in HBase denotes attribute corresponding to the object
which resides in the cell.

Now that you know about HBase Data Model, let us see how this data model falls in line with
HBase Architecture and makes it suitable for large storage and faster processing.

about:blank 3/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

HBASE CLIENTS

5.1 CRUD Operations


The initial set of basic operations are often referred to as CRUD, which stands for create, read,
update, and delete.

5.1.1 Put Method


This group of operations can be split into separate types: those that work on single rows and
those that work on lists of rows.

 Single Puts

void put(Put put) throws IOException


It expects one or a list of Put objects that, in turn, are created with one of these constructors:

Put(byte[] row)
Put(byte[] row, RowLock rowLock)
Put(byte[] row, long ts)
Put(byte[] row, long ts, RowLock rowLock)

You need to supply a row to create a Put instance. A row in HBase is identified by a unique row
key and—as is the case with most values in HBase—this is a Java byte[] array.
Once you have created the Put instance you can add data to it. This is done using these methods:

Put add(byte[] family, byte[] qualifier, byte[] value)


Put add(byte[] family, byte[] qualifier, long ts, byte[] value)
Put add(KeyValue kv) throws IOException

Each call to add() specifies exactly one column, or, in combination with an optional timestamp,
one single cell. Note that if you do not specify the timestamp with the add() call, the Put instance
will use the optional timestamp parameter from the constructor (also called ts) and you should
leave it to the region server to set it.

If you want to learn about Operations using Java API, refer to this insightful Blog!

 The KeyValue class

From your code you may have to deal with KeyValue instances directly. These instances contain
the data as well as the coordinates of one specific cell. The coordinates are the row key, name of
the column family, column qualifier, and timestamp. The class provides a plethora of
constructors that allow you to combine all of these in many variations. The fully specified
constructor looks like this:

KeyValue(byte[] row, int roffset, int rlength,


byte[] family, int foffset, int flength, byte[] qualifier, int qoffset,
int qlength, long timestamp, Type type, byte[] value, int voffset, int vlength)

about:blank 4/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

The client API has the ability to insert single Put instances, but it also has the advanced feature
of batching operations together. This comes in the form of the following call:

void put(List<Put> puts) throws IOException

 Atomic compare-and-set

There is a special variation of the put calls that warrants its own section: check and put. The
method signature is:

boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier,


byte[] value, Put put) throws IOException

This call allows you to issue atomic, server-side mutations that are guarded by an accompanying
check. If the check passes successfully, the put operation is executed; otherwise, it aborts the
operation completely. It can be used to update data based on current, possibly related, values.

5.1.2 Get Method


The next step in a client API is to retrieve what was just saved. For that the HTable is
providing you with the Get call and matching classes. The operations are split into those that
operate on a single row and those that retrieve multiple rows in one call.

 Single Gets

First, the method that is used to retrieve specific values from an HBase table:

Result get(Get get) throws IOException

about:blank 5/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Similar to the Put class for the put() call, there is a matching Get class used by the
aforementioned get() function. As another similarity, you will have to provide a row key when
creating an instance of Get, using one of these constructors:

Get(byte[] row)
Get(byte[] row, RowLock rowLock)

 The Result class

When you retrieve data using the get() calls, you receive an instance of the Result class that
contains all the matching cells. It provides you with the means to access everything that was
returned from the server for the given row and matching the specified query, such as column
family, column qualifier, timestamp, and so on.

 List of Gets

Another similarity to the put() calls is that you can ask for more than one row using a single
request. This allows you to quickly and efficiently retrieve related—but also completely random,
if required—data from the remote servers.
The method provided by the API has the following signature:

Result[] get(List<Get> gets) throws IOException

5.1.3. Delete method


This method is used to delete the data from Hbase tables.

 Single Deletes

The variant of the delete() call that takes a single Delete instance is:

void delete(Delete delete) throws IOException

Just as with the get() and put() calls you saw already, you will have to create a Delete instance
and then add details about the data you want to remove. The constructors are:

Delete(byte[] row)
Delete(byte[] row, long timestamp, RowLock rowLock)

You need to provide the row you want to modify, and optionally provide a rowLock, an instance
of RowLock to specify your own lock details, in case you want to modify the same row more
than once subsequently.

about:blank 6/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

 List of Deletes

The list-based delete() call works very similarly to the list-based put(). You need to create a list
of Delete instances, configure them, and call the following method:

void delete(List<Delete> deletes) throws IOException

 Atomic compare-and-delete

There is an equivalent call for deletes that gives you access to server-side, read-and-modify
functionality:

boolean checkAndDelete(byte[] row, byte[] family, byte[] qualifier,


byte[] value, Delete delete) throws IOException

You need to specify the row key, column family, qualifier, and value to check before the actual
delete operation is performed. Should the test fail, nothing is deleted and the call returns a false.
If the check is successful, the delete is applied and true is returned.

5.1.4 Row Locks

Mutating operations—like put(), delete(), checkAndPut(), and so on —are executed exclusively,


which means in a serial fashion, for each row, to guarantee row-level atomicity. The region
servers provide a row lock feature ensuring that only a client holding the matching lock can
modify a row. In practice, though, most client applications do not provide an explicit lock, but
rather rely on the mechanism in place that guards each operation separately. When you send, for
example, a put() call to the server with an instance of Put, created with the following constructor:

Put(byte[] row)

Which is not providing a RowLock instance parameter, the servers will create a lock on your
behalf, just for the duration of the call. In fact, from the client API you cannot even retrieve this
short-lived, server-side lock instance.

about:blank 7/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

HBase Example

Let's see a HBase example to import data of a file in HBase table.

Use Case

We have to import data present in the file into an HBase table by creating it through Java API.

Data_file.txt contains the below data

1. 1,India,Bihar,Champaran,2009,April,P1,1,5
2. 2,India, Bihar,Patna,2009,May,P1,2,10
3. 3,India, Bihar,Bhagalpur,2010,June,P2,3,15
4. 4,United States,California,Fresno,2009,April,P2,2,5
5. 5,United States,California,Long Beach,2010,July,P2,4,10
6. 6,United States,California,San Francisco,2011,August,P1,6,20

The Java code is shown below

This data has to be inputted into a new HBase table to be created through JAVA API. Following
column families have to be created

1. "sample,region,time.product,sale,profit".

Column family region has three column qualifiers: country, state, city

Column family Time has two column qualifiers: year, month

Jar Files

Make sure that the following jars are present while writing the code as they are required by the
HBase.

a. commons-loging-1.0.4
b. commons-loging-api-1.0.4
c. hadoop-core-0.20.2-cdh3u2
d. hbase-0.90.4-cdh3u2
e. log4j-1.2.15
f. zookeper-3.3.3-cdh3u0

about:blank 8/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

PRAXIS.CASSANDRA

Apache Cassandra is an open source, distributed and decentralized/distributed storage system


(database), for managing very large amounts of structured data spread out across the world. It
provides highly available service with no single point of failure.

Listed below are some of the notable points of Apache Cassandra −

 It is scalable, fault-tolerant, and consistent.


 It is a column-oriented database.
 Its distribution design is based on Amazon‘s Dynamo and its data model on Google‘s
Bigtable.
 Created at Facebook, it differs sharply from relational database management systems.
 Cassandra implements a Dynamo-style replication model with no single point of failure,
but adds a more powerful ―column family‖ data model.
 Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Features of Cassandra

Cassandra has become so popular because of its outstanding technical features. Given below are
some of the features of Cassandra:

 Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to


accommodate more customers and more data as per requirement.
 Always on architecture − Cassandra has no single point of failure and it is continuously
available for business-critical applications that cannot afford a failure.
 Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it maintains a
quick response time.
 Flexible data storage − Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate changes
to your data structures according to your need.
 Easy data distribution − Cassandra provides the flexibility to distribute data where you
need by replicating data across multiple data centers.
 Transaction support − Cassandra supports properties like Atomicity, Consistency,
Isolation, and Durability (ACID).
 Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.

History of Cassandra

 Cassandra was developed at Facebook for inbox search.


 It was open-sourced by Facebook in July 2008.
 Cassandra was accepted into Apache Incubator in March 2009.

about:blank 9/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

 It was made an Apache top-level project since February 2010.

The design goal of Cassandra is to handle big data workloads across multiple nodes without any
single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes in a cluster.

 All the nodes in a cluster play the same role. Each node is independent and at the same
time interconnected to other nodes.
 Each node in a cluster can accept read and write requests, regardless of where the data is
actually located in the cluster.
 When a node goes down, read/write requests can be served from other nodes in the
network.

Data Replication in Cassandra

In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is
detected that some of the nodes responded with an out-of-date value, Cassandra will return the
most recent value to the client. After returning the most recent value, Cassandra performs a read
repair in the background to update the stale values.

The following figure shows a schematic view of how Cassandra uses data replication among the
nodes in a cluster to ensure no single point of failure.

about:blank 10/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to
communicate with each other and detect any faulty nodes in the cluster.

Components of Cassandra

The key components of Cassandra are as follows −

 Node − It is the place where data is stored.


 Data center − It is a collection of related nodes.
 Cluster − A cluster is a component that contains one or more data centers.
 Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
 Mem-table − A mem-table is a memory-resident data structure. After commit log, the
data will be written to the mem-table. Sometimes, for a single-column family, there will
be multiple mem-tables.
 SSTable − It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
 Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters are
accessed after every query.

Cassandra Query Language

Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to
work with CQL or separate application language drivers.

Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a
proxy between the client and the nodes holding the data.

Write Operations

Every write activity of nodes is captured by the commit logs written in the nodes. Later the data
will be captured and stored in the mem-table. Whenever the mem-table is full, data will be
written into the SStable data file. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary
data.

Read Operations

During read operations, Cassandra gets values from the mem-table and checks the bloom filter to
find the appropriate SSTable that holds the required data.

about:blank 11/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

CASSANDRA DATA MODEL

he data model of Cassandra is significantly different from what we normally see in an RDBMS.
This chapter provides an overview of how Cassandra stores its data.

Cluster

Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in case
of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format,
and assigns data to them.

Keyspace

Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in
Cassandra are −

 Replication factor − It is the number of machines in the cluster that will receive copies
of the same data.
 Replica placement strategy − It is nothing but the strategy to place replicas in the ring.
We have strategies such as simple strategy (rack-aware strategy), old network topology
strategy (rack-aware strategy), and network topology strategy (datacenter-shared
strategy).
 Column families − Keyspace is a container for a list of one or more column families. A
column family, in turn, is a container of a collection of rows. Each row contains ordered
columns. Column families represent the structure of your data. Each keyspace has at least
one and often many column families.

The syntax of creating a Keyspace is as follows −

CREATE KEYSPACE Keyspace name


WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

The following illustration shows a schematic view of a Keyspace.

about:blank 12/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Column Family

A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered
collection of columns. The following table lists the points that differentiate a column family from
a table of relational databases.

Relational Table Cassandra column Family

A schema in a relational model is fixed. Once weIn Cassandra, although the column famili
define certain columns for a table, while insertinare defined, the columns are not. You can
data, in every row all the columns must be filled freely add any column to any column fam
least with a null value. at any time.

Relational tables define only columns and the usIn Cassandra, a table contains columns,
fills in the table with values. can be defined as a super column family.

A Cassandra column family has the following attributes −

 keys_cached − It represents the number of locations to keep cached per SSTable.


 rows_cached − It represents the number of rows whose entire contents will be cached in
memory.
 preload_row_cache − It specifies whether you want to pre-populate the row cache.

Note − Unlike relational tables where a column family‘s schema is not fixed, Cassandra does not
force individual rows to have all the columns.

The following figure shows an example of a Cassandra column family.

Column

A column is the basic data structure of Cassandra with three values, namely key or column name,
value, and a time stamp. Given below is the structure of a column.

about:blank 13/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

SuperColumn

A super column is a special column, therefore, it is also a key-value pair. But a super column
stores a map of sub-columns.

Generally column families are stored on disk in individual files. Therefore, to optimize
performance, it is important to keep columns that you are likely to query together in the same
column family, and a super column can be helpful here.Given below is the structure of a super
column.

CASSANDRA EXAMPLES

Cassandra Data Model Rules

In Cassandra, writes are not expensive. Cassandra does not support joins, group by, OR clause,
aggregations, etc. So you have to store your data in such a way that it should be completely
retrievable. So these rules must be kept in mind while modelling data in Cassandra.

Maximize the number of writes

In Cassandra, writes are very cheap. Cassandra is optimized for high write performance. So try to
maximize your writes for better read performance and data availability. There is a tradeoff
between data write and data read. So, optimize you data read performance by maximizing the
number of data writes.

Maximize Data Duplication

Data denormalization and data duplication are defacto of Cassandra. Disk space is not more
expensive than memory, CPU processing and IOs operation. As Cassandra is a distributed
database, so data duplication provides instant data availability and no single point of failure.

about:blank 14/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Cassandra Data Modeling Goals

You should have following goals while modelling data in Cassandra:

Spread Data Evenly Around the Cluster

You want an equal amount of data on each node of Cassandra Cluster. Data is spread to different
nodes based on partition keys that is the first part of the primary key. So, try to choose integers
as a primary key for spreading data evenly around the cluster.

Minimize number of partitions read while querying data

Partition are a group of records with the same partition key. When the read query is issued, it
collects data from different nodes from different partitions.

If there will be many partitions, then all these partitions need to be visited for collecting the
query data.

It does not mean that partitions should not be created. If your data is very large, you can‘t keep
that huge amount of data on the single partition. The single partition will be slowed down.

So try to choose a balanced number of partitions.

Good Primary Key in Cassandra

Let‘s take an example and find which primary key is good.

Here is the table MusicPlaylist.

Create table MusicPlaylist


(
SongId int,
SongName text,
Year int,
Singer text,
Primary key(SongId, SongName)
);

In above example, table MusicPlaylist,

 Songid is the partition key, and


 SongName is the clustering column
 Data will be clustered on the basis of SongName. Only one partition will be created with
the SongId. There will not be any other partition in the table MusicPlaylist.

Data retrieval will be slow by this data model due to the bad primary key.

about:blank 15/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Here is another table MusicPlaylist.

Create table MusicPlaylist


(
SongId int,
SongName text,
Year int,
Singer text,
Primary key((SongId, Year), SongName)
);

In above example, table MusicPlaylist,

 Songid and Year are the partition key, and


 SongName is the clustering column.
 Data will be clustered on the basis of SongName. In this table, each year, a new partition
will be created. All the songs of the year will be on the same node. This primary key will
be very useful for the data.

Our data retrieval will be fast by this data model.

Model Your Data in Cassandra

Following things should be kept in mind while modelling your queries:

Determine what queries you want to support

First of all, determine what queries you want.

For example, do you need?

 Joins
 Group by
 Filtering on which column etc.

Create table according to your queries

Create table according to your queries. Create a table that will satisfy your queries. Try to create
a table in such a way that a minimum number of partitions needs to be read.

Handling One to One Relationship in Cassandra

One to one relationship means two tables have one to one correspondence. For example, the
student can register only one course, and I want to search on a student that in which course a
particular student is registered in.

about:blank 16/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

So in this case, your table schema should encompass all the details of the student in
corresponding to that particular course like the name of the course, roll no of the student, student
name, etc.

Create table Student_Course


(
Student rollno int primary key,
Student_name text,
Course_name text,
);

Handling One to Many Relationship in Cassandra

One to many relationships means having one to many correspondence between two tables.

For example, a course can be studied by many students. I want to search all the students that are
studying a particular course.

So by querying on course name, I will have many student names that will be studying a
particular course.

Create table Student_Course


(
Student_rollno int,

about:blank 17/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Student_name text,
Course_name text,
);

I can retrieve all the students for a particular course by the following query.

Select * from Student_Course where Course_name='Course Name';

Handling Many to Many Relationship in Cassandra

Many to many relationships means having many to many correspondence between two tables.

For example, a course can be studied by many students, and a student can also study many
courses.

Many to Many
Relationship in Cassandra

I want to search all the students that are studying a particular course. Also, I want to search all
the course that a particular student is studying.

So in this case, I will have two tables i.e. divide the problem into two cases.

First, I will create a table by which you can find courses by a particular student.

Create table Student_Course


(
Student_rollno int primary key,
Student_name text,
Course_name text,
);

I can find all the courses by a particular student by the following query.

Select * from Student_Course where student_rollno=rollno;

about:blank 18/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Second, I will create a table by which you can find how many students are studying a particular
course.

Create table Course_Student


(
Course_name text primary key,
Student_name text,
student_rollno int
);

I can find a student in a particular course by the following query.

Select * from Course_Student where Course_name=CourseName;

CASSANDRA CLIENT

The RazorSQL Apache Cassandra database client and query tool includes a Cassandra database
browser, SQL editor, table editor, Cassandra import and export tools, Cassandra backup tools,
and other custom Cassandra GUI tools. Listed below are more details on these features.

Cassandra SQL Editor


Execute SQL select, insert, update, and delete statements against Cassandra tables. The SQL
editor includes auto column lookup, auto table lookup, and support for over 20 programming
languages such as SQL, PHP, HTML, XML, Java, and more.
Cassandra Database Browser
Browse the details of a Cassandra database cluster including schemas, tables, materialized
views, columns, and key information. Easily edit, describe, backup, query tables, create
tables, and compare tables or query results with the click of the mouse.
Cassandra GUI Tools
 A GUI Cassandra create table tool that generates Cassandra specific create table SQL
that includes such elements as column names, column types, primary key, and identity
data.
 A GUI Cassandra create keyspace / schema tool that generates Cassandra create
keyspace / schema statements using replication class, replication center, data center
information, etc

Cassandra Export Tool


The Cassandra export tool in RazorSQL allows users to export data from Cassandra in
various formats such as delimited files, Excel spreadsheets, SQL insert statements, HTML,
XML, and text.

about:blank 19/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Cassandra Import Tool


The Cassandra import tool in RazorSQL allows users to import data into the Cassandra
database from Excel files, delimited files, fixed-width files and files containing SQL
statements.
Cassandra Table Editor
The Cassandra table editor allows users to edit Cassandra data in a spreadsheet like format.
Users can edit individual cells, delete rows, copy rows, and update, insert, or delete data.
The table editor automatically generates the appropriate SQL insert, update, or delete
statements to be executed.
Cassandra Backup Tools
Backup individual tables or all tables in a database/schema with the RazorSQL Cassandra
backup tools. The tools generate SQL DDL and SQL insert statements for all tables and
data in the database.

HADOOP DATA INTEGRATION CONCEPTS

The chapter provides an introduction to the basic concepts of Hadoop Data integration using
Oracle Data Integrator.

This chapter includes the following sections:

 Section 2.1, "Hadoop Data Integration with Oracle Data Integrator"


 Section 2.2, "Generate Code in Different Languages with Oracle Data Integrator"
 Section 2.3, "Leveraging Apache Oozie to execute Oracle Data Integrator Projects"
 Section 2.4, "Oozie Workflow Execution Modes"
2.1 Hadoop Data Integration with Oracle Data Integrator

Typical processing in Hadoop includes data validation and transformations that are programmed
as MapReduce jobs. Designing and implementing a MapReduce job requires expert
programming knowledge. However, when you use Oracle Data Integrator, you do not need to
write MapReduce jobs. Oracle Data Integrator uses Apache Hive and the Hive Query Language
(HiveQL), a SQL-like language for implementing MapReduce jobs.

When you implement a big data processing scenario, the first step is to load the data into
Hadoop. The data source is typically in Files or SQL databases.

After the data is loaded, you can validate and transform it by using HiveQL like you use SQL.
You can perform data validation (such as checking for NULLS and primary keys), and
transformations (such as filtering, aggregations, set operations, and derived tables). You can also
include customized procedural snippets (scripts) for processing the data.

about:blank 20/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

When the data has been aggregated, condensed, or processed into a smaller data set, you can load
it into an Oracle database, other relational database, HDFS, HBase, or Hive for further
processing and analysis. Oracle Loader for Hadoop is recommended for optimal loading into an
Oracle database.

2.2 Generate Code in Different Languages with Oracle Data Integrator

By default, Oracle Data Integrator (ODI) uses HiveQL to implement the mappings. However,
Oracle Data Integrator also lets you to implement the mappings using Pig Latin and Spark
Python. Once your mapping is designed, you can either implement it using the default HiveQL,
or choose to implement it using Pig Latin or Spark Python.

Support for Pig Latin and Spark Python in ODI is achieved through a set of component KMs that
are specific to these languages. These component KMs are used only when a Pig data server or a
Spark data server is used as the staging location for your mapping.

For example, if you use a Pig data server as the staging location, the Pig related KMs are used to
implement the mapping and Pig Latin code is generated. Similarly, to generate Spark Python
code, you must use a Spark data server as the staging location for your mapping.

2.3 Leveraging Apache Oozie to execute Oracle Data Integrator Projects

Apache Oozie is a workflow scheduler system that helps you orchestrate actions in Hadoop. It is
a server-based Workflow Engine specialized in running workflow jobs with actions that run
Hadoop MapReduce jobs. Implementing and running Oozie workflow requires in-depth
knowledge of Oozie.

However, Oracle Data Integrator does not require you to be an Oozie expert. With Oracle Data
Integrator you can easily define and execute Oozie workflows.

Oracle Data Integrator allows you to automatically generate an Oozie workflow definition by
executing an integration project (package, procedure, mapping, or scenario) on an Oozie engine.
The generated Oozie workflow definition is deployed and executed into an Oozie workflow
system. You can also choose to only deploy the Oozie workflow to validate its content or
execute it at a later time.

Information from the Oozie logs is captured and stored in the ODI repository along with links to
the Oozie UIs. This information is available for viewing within ODI Operator and Console.

2.4 Oozie Workflow Execution Modes

ODI provides the following two modes for executing the Oozie workflows:

 TASK

Task mode generates an Oozie action for every ODI task. This is the default mode.

about:blank 21/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

The task mode cannot handle the following:

KMs with scripting code that spans across multiple tasks.


o
KMs with transactions.
o
KMs with file system access that cannot span file access across tasks.
o
ODI packages with looping constructs.
o
 SESSION

Session mode generates an Oozie action for the entire session.

ODI automatically uses this mode if any of the following conditions is true:

o Any task opens a transactional connection.


o Any task has scripting.
o A package contains loops.

Note that loops in a package are not supported by Oozie engines and may not
function properly in terms of execution and/or session log content retrieval, even
when running in SESSION mode.

PIG

What is Pig in Hadoop?

Pig Hadoop is basically a high-level programming language that is helpful for the analysis of
huge datasets. Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to
perform a lot of data administration operations.

For writing data analysis programs, Pig renders a high-level programming language called Pig
Latin. Several operators are provided by Pig Latin using which personalized functions for
writing, reading, and processing of data can be developed by programmers.

For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these
scripts need to be transformed into MapReduce tasks. This is achieved with the help of Pig
Engine.

about:blank 22/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Why Apache Pig?

By now, we know that Apache Pig is used with Hadoop, and Hadoop is based on the Java
programming language. Now, the question that arises in our minds is ‗Why Pig?‘ The need for
Apache Pig came up when many programmers weren‘t comfortable with Java and were facing a
lot of struggle working with Hadoop, especially, when MapReduce tasks had to be performed.
Apache Pig came into the Hadoop world as a boon for all such programmers.

 After the introduction of Pig Latin, now, programmers are able to work
on MapReduce tasks without the use of complicated codes as in Java.
 To reduce the length of codes, the multi-query approach is used by Apache Pig, which
results in reduced development time by 16 folds.
 Since Pig Latin is very similar to SQL, it is comparatively easy to learn Apache Pig if we
have little knowledge of SQL.

Features of Pig Hadoop

There are several features of Apache Pig:

1. In-built operators: Apache Pig provides a very good set of operators for performing several
data operations like sort, join, filter, etc.
2. Ease of programming: Since Pig Latin has similarities with SQL, it is very easy to write a
Pig script.
3. Automatic optimization: The tasks in Apache Pig are automatically optimized. This makes
the programmers concentrate only on the semantics of the language.
4. Handles all kinds of data: Apache Pig can analyze both structured and unstructured data
and store the results in HDFS.

Apache Pig Architecture

The main reason why programmers have started using Hadoop Pig is that it converts the scripts
into a series of MapReduce tasks making their job easy. Below is the architecture of Pig Hadoop:

about:blank 23/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Pig Hadoop framework has four main components:

1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser. The
parser is responsible for checking the syntax of the script, along with other miscellaneous
checks. Parser gives an output in the form of a Directed Acyclic Graph (DAG) that contains
Pig Latin statements, together with other logical operators represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is passed to
a logical optimizer. The optimizer is responsible for carrying out the logical optimizations.
3. Compiler: The role of the compiler comes in when the output from the optimizer is
received. The compiler compiles the logical plan sent by the optimizing The logical plan is
then converted into a series of MapReduce tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs are
sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop for
yielding the desired result.

about:blank 24/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

GRUNT

After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that, there
are certain useful shell and utility commands provided by the Grunt shell. This chapter explains
the shell and utility commands provided by the Grunt shell.

Note − In some portions of this chapter, the commands like Load and Store are used. Refer the
respective chapters to get in-detail information on them.

Shell Commands

The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can
invoke any shell commands using sh and fs.

sh Command

Using sh command, we can invoke any shell commands from the Grunt shell.
Using sh command from the Grunt shell, we cannot execute the commands that are a part of the
shell environment (ex − cd).

Syntax

Given below is the syntax of sh command.

grunt> sh shell command parameters

Example

We can invoke the ls command of Linux shell from the Grunt shell using the sh option as shown
below. In this example, it lists out the files in the /pig/bin/ directory.

grunt> sh ls

pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command

Using the fs command, we can invoke any FsShell commands from the Grunt shell.

Syntax

Given below is the syntax of fs command.

about:blank 25/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

grunt> sh File System command parameters

Example

We can invoke the ls command of HDFS from the Grunt shell using fs command. In the
following example, it lists the files in the HDFS root directory.

grunt> fs –ls

Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data

In the same way, we can invoke all the other file system shell commands from the Grunt shell
using the fs command.

Utility Commands

The Grunt shell provides a set of utility commands. These include utility commands such
as clear, help, history, quit, and set; and commands such as exec, kill, and run to control Pig
from the Grunt shell. Given below is the description of the utility commands provided by the
Grunt shell.

clear Command

The clear command is used to clear the screen of the Grunt shell.

Syntax

You can clear the screen of the grunt shell using the clear command as shown below.

grunt> clear

help Command

The help command gives you a list of Pig commands or Pig propertie

Usage

You can get a list of Pig commands using the help command as shown below.

grunt> help

about:blank 26/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Commands: <pig latin statement>; - See the PigLatin manual for details:
http://hadoop.apache.org/pig

File system commands: fs <fs arguments> - Equivalent to Hadoop dfs command:


http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic Commands: describe <alias>[::<alias] - Show the schema for the alias.
Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml]
[-param <param_name>=<pCram_value>]
[-param_file <file_name>] [<alias>] -
Show the execution plan to compute the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
-dot - Generate the output in .dot format. Default is text format.
-xml - Generate the output in .xml format. Default is text format.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
alias - Alias to explain.
dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands: exec [-param <param_name>=param_value] [-param_file <file_name>]


<script> -
Execute the script with access to grunt environment including aliases.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script - Script to be executed.
run [-param <param_name>=param_value] [-param_file <file_name>] <script> -
Execute the script with access to grunt environment.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script - Script to be executed.
sh <shell command> - Invoke a shell command.
kill <job_id> - Kill the hadoop job specified by the hadoop job id.
set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.
The following keys are supported:
default_parallel - Script-level reduce parallelism. Basic input size heuristics used
by default.

about:blank 27/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

debug - Set debug on or off. Default is off.


job.name - Single-quoted name for jobs. Default is PigLatin:<script name>
job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high.
Default is normal stream.skippath - String that contains the path.
This is used by streaming any hadoop property.
help - Display this message.
history [-n] - Display the list statements in cache.
-n Hide line numbers.
quit - Quit the grunt shell.
history Command

This command displays a list of statements executed / used so far since the Grunt sell is invoked.

Usage

Assume we have executed three statements since opening the Grunt shell.

grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',');

grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',');

Then, using the history command will produce the following output.

grunt> history

customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',');

orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');

student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',');

set Command

The set command is used to show/assign values to keys used in Pig.

Usage

Using this command, you can set values to the following keys.

about:blank 28/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Key Description and values

You can set the number of reducers for a map job by pass
default_parallel
any whole number as a value to this key.

You can turn off or turn on the debugging freature in Pig b


debug
passing on/off to this key.

You can set the Job name to the required job by passing a
job.name
string value to this key.

You can set the job priority to a job by passing one of the
following values to this key −
 very_low
job.priority
 low
 normal
 high
 very_high

For streaming, you can set the path from where the data i
stream.skippath to be transferred, by passing the desired path in the form
string to this key.

quit Command

You can quit from the Grunt shell using this command.

Usage

Quit from the Grunt shell as shown below.

grunt> quit

Let us now take a look at the commands using which you can control Apache Pig from the Grunt
shell.

exec Command

Using the exec command, we can execute Pig scripts from the Grunt shell.

Syntax

about:blank 29/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Given below is the syntax of the utility command exec.

grunt> exec [–param param_name = param_value] [ –param_file file_name] [script]

Example

Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the
following content.

Student.txt

001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, assume we have a script file named sample_script.pig in the /pig_data/ directory of HDFS
with the following content

Sample_script.pig

student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',')


as (id:int,name:chararray,city:chararray);

Dump student;

Now, let us execute the above script from the Grunt shell using the exec command as shown
below.

grunt> exec /sample_script.pig

Output

The exec command executes the script in the sample_script.pig. As directed in the script, it
loads the student.txt file into Pig and gives you the result of the Dump operator displaying the
following content.

(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)

about:blank 30/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Kill Command

You can kill a job from the Grunt shell using this command.

Syntax

Given below is the syntax of the kill command.

grunt> kill JobId

Example

Suppose there is a running Pig job having id Id_0055, you can kill it from the Grunt shell using
the kill command, as shown below.

grunt> kill Id_0055


run Command

You can run a Pig script from the Grunt shell using the run command

Syntax

Given below is the syntax of the run command.

grunt> run [–param param_name = param_value] [–param_file file_name] script

Example

Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the
following content.

Student.txt

001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, assume we have a script file named sample_script.pig in the local filesystem with the
following content

Sample_script.pig

student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING


PigStorage(',') as (id:int,name:chararray,city:chararray);

about:blank 31/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Now, let us run the above script from the Grunt shell using the run command as shown below.

grunt> run /sample_script.pig

You can see the output of the script using the Dump operator as shown below

grunt> Dump;

(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)

PIG’S DATA MODEL

Types

Pig‘s data types can be divided into two categories: scalar types, which contain a single
value, and complex types, which contain other types.

Scalar Types

Pig‘s scalar types are simple types that appear in most programming languages. With the
exception of bytearray, they are all represented in Pig interfaces by java.lang classes, making
them easy to work with in UDFs:

int

An integer. Ints are represented in interfaces by java.lang.Integer. They store a four-byte


signed integer. Constant integers are expressed as integer numbers, for example, 42.

long

A long integer. Longs are represented in interfaces by java.lang.Long. They store an


eight-byte signed integer. Constant longs are expressed as integer numbers with
an L appended, for example, 5000000000L.

float

about:blank 32/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

A floating-point number. Floats are represented in interfaces by java.lang.Float and use


four bytes to store their value. For calculations that require no loss of precision, you
should use an int or long instead. Constant floats are expressed as a floating-point number
with an f appended. Floating-point numbers can be expressed in simple format, 3.14f, or
in exponent format, 6.022e23f.

double

A double-precision floating-point number. Doubles are represented in interfaces


by java.lang.Double and use eight bytes to store their value. Note that because this is a
floating-point number, in some calculations it will lose precision. For calculations that
require no loss of precision, you should use an int or long instead. Constant doubles are
expressed as a floating-point number in either simple format, 2.71828, or in exponent
format, 6.626e-34.

chararray

A string or character array. Chararrays are represented in interfaces by java.lang.String.


Constant chararrays are expressed as string literals with single quotes, for example, 'fred'.
In addition to standard alphanumeric and symbolic characters, you can express certain
characters in chararrays by using backslash codes, such as \t for Tab and \n for Return.
Unicode characters can be expressed as \u followed by their four-digit hexadecimal
Unicode value. For example, the value for Ctrl-A is expressed as \u0001.

bytearray

A blob or array of bytes. Bytearrays are represented in interfaces by a Java


class DataByteArray that wraps a Java byte[]. There is no way to specify a constant
bytearray.

about:blank 33/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Complex Types

Pig has three complex data types: maps, tuples, and bags. All of these types can contain data of
any type, including other complex types. So it is possible to have a map where the value field is a
bag, which contains a tuple where one of the fields is a map.

Map

A map in Pig is a chararray to data element mapping, where that element can be any Pig type,
including a complex type. The chararray is called a key and is used as an index to find the
element, referred to as the value.

Because Pig does not know the type of the value, it will assume it is a bytearray. However, the
actual value might be something different. If you know what the actual type is (or what you want
it to be), you can cast it; see Casts. If you do not cast the value, Pig will make a best guess based
on how you use the value in your script. If the value is of a type other than bytearray, Pig will
figure that out at runtime and handle it. See Schemas for more information on how Pig handles
unknown types.

Map constants are formed using brackets to delimit the map, a hash between keys and values,
and a comma between key-value pairs. For example, ['name'#'bob', 'age'#55] will create a map
with two keys, ―name‖ and ―age‖. The first value is a chararray, and the second is an integer.

Tuple

A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields,
with each field containing one data element. These elements can be of any type —they do not all
need to be the same type. A tuple is analogous to a row in SQL, with the fields being SQL
columns. Because tuples are ordered, it is possible to refer to the fields by position;
see Expressions in foreach for details. A tuple can, but is not required to, have a schema
associated with it that describes each field‘s type and provides a name for each field. This allows
Pig to check that the data in the tuple is what the user expects, and it allows the user to reference
the fields of the tuple by name.

about:blank 34/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Tuple constants use parentheses to indicate the tuple and commas to delimit fields in the tuple.
For example, ('bob', 55) describes a tuple constant with two fields.

Bag

A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference
tupes in a bag by position. Like tuples, a bag can, but is not required to, have a schema
associated with it. In the case of a bag, the schema describes all tuples within the bag.

Bag constants are constructed using braces, with tuples in the bag separated by commas. For
example, {('bob', 55), ('sally', 52), ('john', 25)} constructs a bag with three tuples, each with two
fields.

Pig users often notice that Pig does not provide a list or set type that can store items of any type.
It is possible to mimic a set type using the bag, by wrapping the desired type in a tuple of one
field. For instance, if you want to store a set of integers, you can create a bag with a tuple with
one field, which is an int. This is a bit cumbersome, but it works.

Bag is the one type in Pig that is not required to fit into memory. As you will see later, because
bags are used to store collections when grouping, bags can become quite large. Pig has the ability
to spill bags to disk when necessary, keeping only partial sections of the bag in memory. The
size of the bag is limited to the amount of local disk available for spilling the bag.

PIG LATIN

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.

Pig Latin Statements

The Pig Latin statements are used to process the data. It is an operator that accepts a relation as
an input and generates another relation as an output.

o It can span multiple lines.


o Each statement must end with a semi-colon.
o It may include expression and schemas.

about:blank 35/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

o By default, these statements are processed using multi-query execution


o Pig Latin Conventions

Convention Description

() The parenthesis can enclose one or more items. It can also be used to indicate
data type.
Example - (10, xyz, (3,6,9))

[] The straight brackets can enclose one or more items. It can also be used to indi
map data type
Example - [INNER | OUTER]

{} The curly brackets enclose two or more items. It can also be used to indicate t
data type
Example - { block | nested_block }

... The horizontal ellipsis points indicate that you can repeat a portion of the
Example - cat path [path ...]

o Latin Data Types


o Simple Data Types

Type Description

int It defines the signed 32-bit intege


Example - 2

long It defines the signed 64-bit intege


Example - 2L or 2l

float It defines 32-bit floating point numbe


Example - 2.5F or 2.5f or 2.5e2f or 2.5.E2F

double It defines 64-bit floating point numbe


Example - 2.5 or 2.5 or 2.5e2f or 2.5.E2F

chararray It defines character array in Unicode UTF-8 form


Example - javatpoint

bytearray It defines the byte array.

boolean It defines the boolean type value


Example - true/false

about:blank 36/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

datetime It defines the values in datetime ord


Example - 1970-01- 01T00:00:00.000+00:00

biginteger It defines Java BigInteger value


Example - 5000000000000

bigdecimal It defines Java BigDecimal value


Example - 52.232344535345

o Complex Types

Type Description

tuple It defines an ordered set of field


Example - (15,12)

bag It defines a collection of tuples


Example - {(15,12), (12,15)}

map It defines a set of key-value pair


Example - [open#apache]

DEVELOPING AND TESTING PIG LATIN SCRIPTS

Development Tools

Pig provides several tools and diagnostic operators to help you develop your applications. In this
section we will explore these and also look at some tools others have written to make it easier to
develop Pig with standard editors and integrated development environments (IDEs).

Syntax Highlighting and Checking

Syntax highlighting often helps users write code correctly, at least syntactically, the first time
around. Syntax highlighting packages exist for several popular editors. The packages listed in
Table 7-1 were created and added at various times, so how their highlighting conforms with
current Pig Latin syntax varies.

about:blank 37/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Table 7-1. Pig Latin syntax highlighting packages

Tool URL

Eclipse http://code.google.com/p/pig-eclipse

Emacs http://github.com/cloudera/piglatin-mode, http://sf.net/projects/pig-m

TextMate http://www.github.com/kevinweil/pig.tmbundle

Vim http://www.vim.org/scripts/script.php?script_id=2186

In addition to these syntax highlighting packages, Pig will also let you check the syntax of your
script without running it. If you add -c or -check to the command line, Pig will just parse and run
semantic checks on your script. The -dryrun command-line option will also check your syntax,
expand any macros and imports, and perform parameter substitution.

describe

describe shows you the schema of a relation in your script. This can be very helpful as you are
developing your scripts. It is especially useful as you are learning Pig Latin and understanding
how various operators change the data. describe can be applied to any relation in your script, and
you can have multiple describes in a script:

--describe.pig
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
trimmed = foreach divs generate symbol, dividends;
grpd = group trimmed by symbol;
avgdiv = foreach grpd generate group, AVG(trimmed.dividends);

describe trimmed;
describe grpd;
describe avgdiv;

trimmed: {symbol: chararray,dividends: float}


grpd: {group: chararray,trimmed: {(symbol: chararray,dividends: float)}}
avgdiv: {group: chararray,double}

describe uses Pig‘s standard schema syntax. For information on this syntax, see Schemas. So, in
this example, the relation trimmed has two fields: symbol, which is a chararray, and dividends,
which is a float. grpd also has two fields, group (the name Pig always assigns to the group by
key) and a bag trimmed, which matches the name of the relation that Pig grouped to produce the
bag. Tuples in trimmed have two fields: symbol and dividends. Finally, in avgdiv there are two
fields, group and a double, which is the result of the AVG function and is unnamed.

about:blank 38/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

explain

One of Pig‘s goals is to allow you to think in terms of data flow instead of MapReduce. But
sometimes you need to peek into the barn and see how Pig is compiling your script into
MapReduce jobs. Pig provides explain for this. explain is particularly helpful when you are
trying to optimize your scripts or debug errors. It was written so that Pig developers could
examine how Pig handled various scripts, thus its output is not the most user-friendly. But with
some effort, explain can help you write better Pig Latin.

There are two ways to use explain. You can explain any alias in your Pig Latin script, which will
show the execution plan Pig would use if you stored that relation. You can also take an existing
Pig Latin script and apply explain to the whole script in Grunt. This has a couple of advantages.
One, you do not have to edit your script to add the explain line. Two, it will work with scripts
that do not have a single store, showing how Pig will execute the entire script:

--explain.pig
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
grpd = group divs by symbol;
avgdiv = foreach grpd generate group, AVG(divs.dividends);
store avgdiv into 'average_dividend';

bin/pig -x local -e 'explain -script explain.pig'

This will produce a printout of several graphs in text format; we will examine this output
momentarily. When using explain on a script in Grunt, you can also have it print out the plan in
graphical format. To do this, add -dot -out filename to the preceding command line. This prints
out a file in DOT language containing diagrams explaining how your script will be executed.
Tools that can read this language and produce graphs can then be used to view the graphs. For
some tools, you might need to split the three graphs in the file into separate files.

Pig goes through several steps to transform a Pig Latin script to a set of MapReduce jobs. After
doing basic parsing and semantic checking, it produces a logical plan. This plan describes the
logical operators that Pig will use to execute the script. Some optimizations are done on this plan.
For example, filters are pushed as far up[19] as possible in the logical plan. The logical plan for
the preceding example is shown in Figure 7-1. I have trimmed a few extraneous pieces to make
the output more readable (scary that this is more readable, huh?). If you are using Pig 0.9, the
output will look slightly different, but close enough that it will be recognizable.

The flow of this chart is bottom to top so that the Load operator is at the very bottom. The lines
between operators show the flow. Each of the four operators created by the script (Load,
CoGroup, ForEach, and Store) can be seen. Each of these operators also has a schema, described
in standard schema syntax. The CoGroup and ForEach operators also have expressions attached
to them (the lines dropping down from those operators). In the CoGroup operator, the projection
indicates which field is the grouping key (in this case, field 1). The ForEach operator has a
projection expression that projects field 0 (the group field) and a UDF expression, which
indicates that the UDF being used is org.apache.pig.builtin.AVG. Notice how each of the Project

about:blank 39/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

operators has an Input field, indicating from which operator they are drawing their input.
Figure 7-2 shows how this plan looks when the -dot option is used instead.

Logical plan diagram

about:blank 40/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

HIVE

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:

about:blank 41/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

This component diagram contains different units. The following table describes each unit:

Unit Name Operation


Hive is a data warehouse infrastructure software that can create interac
User Interface between user and HDFS. The user interfaces that Hive supports are Hi
Web UI, Hive command line, and Hive HD Insight (In Windows server).
Hive chooses respective database servers to store the schema or Meta
Meta Store
tables, databases, columns in a table, their data types, and HDFS map
HiveQL is similar to SQL for querying on schema info on the Metastore
HiveQL Process one of the replacements of traditional approach for MapReduce program
Engine Instead of writing MapReduce program in Java, we can write a query fo
MapReduce job and process it.
The conjunction part of HiveQL process Engine and MapReduce is Hiv
Execution EngineExecution Engine. Execution engine processes the query and generate
results as same as MapReduce results. It uses the flavor of MapReduc
Hadoop distributed file system or HBASE are the data storage techniqu
HDFS or HBASE
store data into file system.

HIVE DATA TYPES AND FILE FORMATS

The different data types in Hive, which are involved in the table creation. All the data types in
Hive are classified into four types, given as follows:

 Column Types
 Literals
 Null Values
 Complex Types

Column Types

Column type are used as column data types of Hive. They are as follows:

Integral Types

Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.

about:blank 42/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

The following table depicts various INT data types:

Type Postfix Example


TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L

String Types

String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp

It supports traditional UNIX timestamp with optional nanosecond precision. It supports


java.sql.Timestamp format ―YYYY-MM-DD HH:MM:SS.fffffffff‖ and format ―yyyy-mm-dd
hh:mm:ss.ffffffffff‖.

Dates

DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)
Union Types

Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:

about:blank 43/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals

The following literals are used in Hive:

Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.

Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type.
The range of decimal type is approximately -10-308 to 10308.

Null Value

Missing values are represented by the special value NULL.

Complex Types

The Hive complex data types are as follows:

Arrays

Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>
Maps

Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>

about:blank 44/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Structs

Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

HIVEQL DATA DEFINITION

Databases in Hive

The Hive concept of a database is essentially just a catalog or namespace of tables. However,
they are very useful for larger clusters with multiple teams and users, as a way of avoiding table
name collisions. It‘s also common to use databases to organize production tables into logical
groups.

If you don‘t specify a database, the default database is used.

The simplest syntax for creating a database is shown in the following example:

hive> CREATE DATABASE financials;

Hive will throw an error if financials already exists. You can suppress these warnings with this
variation:

hive> CREATE DATABASE IF NOT EXISTS financials;

While normally you might like to be warned if a database of the same name already exists, the IF
NOT EXISTS clause is useful for scripts that should create a database on-the-fly, if necessary,
before proceeding.

You can also use the keyword SCHEMA instead of DATABASE in all the database-related
commands.

At any time, you can see the databases that already exist as follows:

hive> SHOW DATABASES;


default
financials

hive> CREATE DATABASE human_resources;

about:blank 45/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

hive> SHOW DATABASES;


default
financials
human_resources

If you have a lot of databases, you can restrict the ones listed using a regular expression, a
concept we‘ll explain in LIKE and RLIKE, if it is new to you. The following example lists only
those databases that start with the letter h and end with any other characters (the .* part):

hive> SHOW DATABASES LIKE 'h.*';


human_resources
hive> ...

Hive will create a directory for each database. Tables in that database will be stored in
subdirectories of the database directory. The exception is tables in the default database, which
doesn‘t have its own directory.

The database directory is created under a top-level directory specified by the property
hive.metastore.warehouse.dir, which we discussed in Local Mode Configuration and Distributed
and Pseudodistributed Mode Configuration. Assuming you are using the default value for this
property, /user/hive/warehouse, when the financials database is created, Hive will create the
directory /user/hive/warehouse/financials.db. Note the .db extension.

You can override this default location for the new directory as shown in this example:

hive> CREATE DATABASE financials


> LOCATION '/my/preferred/directory';

You can add a descriptive comment to the database, which will be shown by the DESCRIBE
DATABASE <database> command.

hive> CREATE DATABASE financials


> COMMENT 'Holds all financial tables';

hive> DESCRIBE DATABASE financials;


financials Holds all financial tables
hdfs://master-server/user/hive/warehouse/financials.db

Note that DESCRIBE DATABASE also shows the directory location for the database. In this
example, the URI scheme is hdfs. For a MapR installation, it would be maprfs. For an Amazon
Elastic MapReduce (EMR) cluster, it would also be hdfs, but you could set
hive.metastore.warehouse.dir to use Amazon S3 explicitly (i.e., by specifying
s3n://bucketname/… as the property value). You could use s3 as the scheme, but the newer s3n is
preferred.

about:blank 46/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

In the output of DESCRIBE DATABASE, we‘re showing master-server to indicate the URI
authority, in this case a DNS name and optional port number (i.e., server:port) for the ―master
node‖ of the filesystem (i.e., where the NameNode service is running for HDFS). If you are
running in pseudo-distributed mode, then the master server will be localhost. For local mode, the
path will be a local path, file:///user/hive/warehouse/financials.db.

If the authority is omitted, Hive uses the master-server name and port defined by the property
fs.default.name in the Hadoop configuration files, found in the $HADOOP_HOME/conf
directory.

To be clear, hdfs:///user/hive/warehouse/financials.db is equivalent to hdfs://master-


server/user/hive/warehouse/financials.db, where master-server is your master node‘s DNS name
and optional port.

For completeness, when you specify a relative path (e.g., some/relative/path), Hive will put this
under your home directory in the distributed filesystem (e.g., hdfs:///user/<user-name> ) for
HDFS. However, if you are running in local mode, your current working directory is used as the
parent of some/relative/path.

For script portability, it‘s typical to omit the authority, only specifying it when referring to
another distributed filesystem instance (including S3 buckets).

Lastly, you can associate key-value properties with the database, although their only function
currently is to provide a way of adding information to the output of DESCRIBE DATABASE
EXTENDED <database>:

hive> CREATE DATABASE financials


> WITH DBPROPERTIES ('creator' = 'Mark Moneybags', 'date' = '2012-01-02');

hive> DESCRIBE DATABASE financials;


financials hdfs://master-server/user/hive/warehouse/financials.db

hive> DESCRIBE DATABASE EXTENDED financials;


financials hdfs://master-server/user/hive/warehouse/financials.db
{date=2012-01-02, creator=Mark Moneybags);

The USE command sets a database as your working database, analogous to changing working
directories in a filesystem:

hive> USE financials;

Now, commands such as SHOW TABLES; will list the tables in this database.

Unfortunately, there is no command to show you which database is your current working
database! Fortunately, it‘s always safe to repeat the USE … command; there is no concept in
Hive of nesting of databases.

about:blank 47/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Recall that we pointed out a useful trick in Variables and Properties for setting a property to prin t
the current database as part of the prompt (Hive v0.8.0 and later):

hive> set hive.cli.print.current.db=true;

hive (financials)> USE default;

hive (default)> set hive.cli.print.current.db=false;

hive> ...

Finally, you can drop a database:

hive> DROP DATABASE IF EXISTS financials;

The IF EXISTS is optional and suppresses warnings if financials doesn‘t exist.

By default, Hive won‘t permit you to drop a database if it contains tables. You can either drop
the tables first or append the CASCADE keyword to the command, which will cause the Hive to
drop the tables in the database first:

hive> DROP DATABASE IF EXISTS financials CASCADE;

Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior,
where existing tables must be dropped before dropping the database.

When a database is dropped, its directory is also deleted.

Alter Database

You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTER DATABASE command. No other metadata about the database can be changed,
including its name and directory location:

hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');

There is no way to delete or ―unset‖ a DBPROPERTY.

Creating Tables

The CREATE TABLE statement follows SQL conventions, but Hive‘s version offers significant
extensions to support a wide range of flexibility where the data files for tables are stored, the
formats used, etc. We discussed many of these options in Text File Encoding of Data Values and
we‘ll return to more advanced options later in Chapter 15. In this section, we describe the other

about:blank 48/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

options available for the CREATE TABLE statement, adapting the employees table declaration
we used previously in Collection Data Types:

CREATE TABLE IF NOT EXISTS mydb.employees (


name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
LOCATION '/user/hive/warehouse/mydb.db/employees'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...);

First, note that you can prefix a database name, mydb in this case, if you‘re not currently
working in the target database.

If you add the option IF NOT EXISTS, Hive will silently ignore the statement if the table already
exists. This is useful in scripts that should create a table the first time they run.

However, the clause has a gotcha you should know. If the schema specified differs from the
schema in the table that already exists, Hive won‘t warn you. If your intention is for this table to
have the new schema, you‘ll have to drop the old table, losing your data, and then re-create it.
Consider if you should use one or more ALTER TABLE statements to change the existing table
schema instead. See Alter Table for details.

Dropping Tables

The familiar DROP TABLE command from SQL is supported:

DROP TABLE IF EXISTS employees;

The IF EXISTS keywords are optional. If not used and the table doesn‘t exist, Hive returns an
error.

For managed tables, the table metadata and data are deleted.

Alter Table

Most table properties can be altered with ALTER TABLE statements, which change metadata
about the table but not the data itself. These statements can be used to fix mistakes in schema,
move partition locations (as we saw in External Partitioned Tables), and do other operations.

Renaming a Table

about:blank 49/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Use this statement to rename the table log_messages to logmsgs:

ALTER TABLE log_messages RENAME TO logmsgs;

Adding, Modifying, and Dropping a Table Partition

As we saw previously, ALTER TABLE table ADD PARTITION … is used to add a new
partition to a table (usually an external table). Here we repeat the same command shown
previously with the additional options available:

ALTER TABLE log_messages ADD IF NOT EXISTS


PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'
...;

Multiple partitions can be added in the same query when using Hive v0.8.0 and later. As always,
IF NOT EXISTS is optional and has the usual meaning.

Changing Columns

You can rename a column, change its position, type, or comment:

ALTER TABLE log_messages


CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;

You have to specify the old name, a new name, and the type, even if the name or type is not
changing. The keyword COLUMN is optional as is the COMMENT clause. If you aren‘t moving
the column, the AFTER other_column clause is not necessary. In the example shown, we move
the column after the severity column. If you want to move the column to the first position, use
FIRST instead of AFTER other_column.

As always, this command changes metadata only. If you are moving columns, the data must
already match the new schema or you must change it to match by some other means.

Adding Columns

You can add new columns to the end of the existing columns, before any partition columns.

ALTER TABLE log_messages ADD COLUMNS (


app_name STRING COMMENT 'Application name',
session_id BIGINT COMMENT 'The current session id');

about:blank 50/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

The COMMENT clauses are optional, as usual. If any of the new columns are in the wrong
position, use an ALTER COLUMN table CHANGE COLUMN statement for each one to move
it to the correct position.

Deleting or Replacing Columns

The following example removes all the existing columns and replaces them with the new
columns specified:

ALTER TABLE log_messages REPLACE COLUMNS (


hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');

This statement effectively renames the original hms column and removes the server and
process_id columns from the original schema definition. As for all ALTER statements, only the
table metadata is changed.

The REPLACE statement can only be used with tables that use one of the native SerDe modules:
DynamicSerDe or MetadataTypedColumnsetSerDe. Recall that the SerDe determines how
records are parsed into columns (deserialization) and how a record‘s columns are written to
storage (serialization). See Chapter 15 for more details on SerDes.

Alter Table Properties

You can add additional table properties or modify existing properties, but not remove them:

ALTER TABLE log_messages SET TBLPROPERTIES (


'notes' = 'The process id is no longer captured; this column is always NULL');

Alter Storage Properties

There are several ALTER TABLE statements for modifying format and SerDe properties.

The following statement changes the storage format for a partition to be SEQUENCEFILE, as
we discussed in Creating Tables (see Sequence Files and Chapter 15 for more information):

ALTER TABLE log_messages


PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;

The PARTITION clause is required if the table is partitioned.

You can specify a new SerDe along with SerDe properties or change the properties for the
existing SerDe. The following example specifies that a table will use a Java class named
com.example.JSONSerDe to process a file of JSON-encoded records:

about:blank 51/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

ALTER TABLE table_using_JSON_storage


SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES (
'prop1' = 'value1',
'prop2' = 'value2');

The SERDEPROPERTIES are passed to the SerDe module (the Java class
com.example.JSONSerDe, in this case). Note that both the property names (e.g., prop1) and the
values (e.g., value1) must be quoted strings.

The SERDEPROPERTIES feature is a convenient mechanism that SerDe implementations can


exploit to permit user customization. We‘ll see a real-world example of a JSON SerDe and how
it uses SERDEPROPERTIES in JSON SerDe.

The following example demonstrates how to add new SERDEPROPERTIES for the current
SerDe:

ALTER TABLE table_using_JSON_storage


SET SERDEPROPERTIES (
'prop3' = 'value3',
'prop4' = 'value4');

You can alter the storage properties that we discussed in Creating Tables:

ALTER TABLE stocks


CLUSTERED BY (exchange, symbol)
SORTED BY (symbol)
INTO 48 BUCKETS;

The SORTED BY clause is optional, but the CLUSTER BY and INTO … BUCKETS are
required. (See also Bucketing Table Data Storage for information on the use of data bucketing.)

Miscellaneous Alter Table Statements

In Execution Hooks, we‘ll discuss a technique for adding execution ―hooks‖ for various
operations. The ALTER TABLE … TOUCH statement is used to trigger these hooks:

ALTER TABLE log_messages TOUCH


PARTITION(year = 2012, month = 1, day = 1);

The PARTITION clause is required for partitioned tables. A typical scenario for this statement is
to trigger execution of the hooks when table storage files have been modified outside of Hive.
For example, a script that has just written new files for the 2012/01/01 partition for log_message
can make the following call to the Hive CLI:

hive -e 'ALTER TABLE log_messages TOUCH PARTITION(year = 2012, month = 1, day = 1);'

about:blank 52/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

This statement won‘t create the table or partition if it doesn‘t already exist. Use the appropriate
creation commands in that case.

The ALTER TABLE … ARCHIVE PARTITION statement captures the partition files into a
Hadoop archive (HAR) file. This only reduces the number of files in the filesystem, reducing the
load on the NameNode, but doesn‘t provide any space savings (e.g., through compression):

ALTER TABLE log_messages ARCHIVE


PARTITION(year = 2012, month = 1, day = 1);

To reverse the operation, substitute UNARCHIVE for ARCHIVE. This feature is only available
for individual partitions of partitioned tables.

Finally, various protections are available. The following statements prevent the partition from
being dropped and queried:

ALTER TABLE log_messages


PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;

ALTER TABLE log_messages


PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;

To reverse either operation, replace ENABLE with DISABLE. These operations also can‘t be
used with nonpartitioned tables.

HIVEQL DATA MANIPULATION

After learning basic Commands in Hive, let us now study Hive DML Commands. Hive Data
Manipulation Language commands are used for inserting, retrieving, modifying, deleting, and
updating data in the Hive table.

There are many Hive DML commands like LOAD, INSERT, UPDATE, etc. We will explore
each of these DML commands individually, along with their syntax and examples.

Introduction to Hive DML commands

Hive DML (Data Manipulation Language) commands are used to insert, update, retrieve, and
delete data from the Hive table once the table and database schema has been defined using Hive
DDL commands.

about:blank 53/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

The various Hive DML commands are:

1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT

Let us now learn each DML command individually.

[ps2id id=’LOAD-command’ target=”/]1. LOAD Command

The LOAD statement in Hive is used to move data files into the locations corresponding to Hive
tables.

 If a LOCAL keyword is specified, then the LOAD command will look for the file path in
the local filesystem.
 If the LOCAL keyword is not specified, then the Hive will need the absolute URI of the
file.
 In case the keyword OVERWRITE is specified, then the contents of the target
table/partition will be deleted and replaced by the files referred by filepath.
 If the OVERWRITE keyword is not specified, then the files referred by filepath will be
appended to the table.

Want to practice these Hive commands yourself? Follow the Hive-3.1.2 installation guide to
install the latest Hive version on your system.

Syntax:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename


[PARTITION (partcol1=val1, partcol2=val2 ...)];

[ps2id id=’SELECT-command’ target=”/]2. SELECT COMMAND

The SELECT statement in Hive is similar to the SELECT statement in SQL used for retrieving
data from the database.

Syntax:

SELECT col1,col2 FROM tablename;

about:blank 54/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

ps2id id=’INSERT-command’ target=”/]3. INSERT Command

The INSERT command in Hive loads the data into a Hive table. We can do insert to both the
Hive table or partition.

a. INSERT INTO

The INSERT INTO statement appends the data into existing data in the table or partition.
INSERT INTO statement works from Hive version 0.8.

Syntax:

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]


select_statement1 FROM from_statement;

b. INSERT OVERWRITE

The INSERT OVERWRITE table overwrites the existing data in the table or partition.

Syntax:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, ..) [IF NOT


EXISTS]] select_statement FROM from_statement;

c. INSERT .. VALUES

INSERT ..VALUES statement in Hive inserts data into the table directly from SQL. It is
available from Hive 0.14.

Syntax:

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)]


VALUES values_row [, values_row ...];

[ps2id id=’DELETE-command’ target=”/]4. DELETE command

The DELETE statement in Hive deletes the table data. If the WHERE clause is specified, then it
deletes the rows that satisfy the condition in where clause.

The DELETE statement can only be used on the hive tables that support ACID.

Syntax:

DELETE FROM tablename [WHERE expression];

about:blank 55/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

[ps2id id=’UPDATE-command’ target=”/]5. UPDATE Command

The update can be performed on the hive tables that support ACID.

The UPDATE statement in Hive deletes the table data. If the WHERE clause is specified, then it
updates the column of the rows that satisfy the condition in WHERE clause.

Partitioning and Bucketing columns cannot be updated.

Syntax:

UPDATE tablename SET column = value [, column = value ...] [WHERE expression];

[ps2id id=’EXPORT-command’ target=”/]6. EXPORT Command

The Hive EXPORT statement exports the table or partition data along with the metadata to the
specified output location in the HDFS.

Metadata is exported in a _metadata file, and data is exported in a subdirectory ‘data.’

Syntax:

EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]

TO 'export_target_path' [ FOR replication('eventid') ];

[ps2id id=’IMPORT-command’ target=”/]7. IMPORT Command

The Hive IMPORT command imports the data from a specified location to a new table or already
existing table.

Syntax:

IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION


(part_column="value"[, ...])]]

FROM 'source_path' [LOCATION 'import_target_path'];

about:blank 56/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

TYPES OF HIVEQL QUERIES

Given below are the types of HiveQL queries that are widely used:

1. HiveQL query for information_schema database

Hive queries can be written to get information about Hive privileges, tables, views or columns.
Information_schema data is a read-only and user-friendly way to know the state of the system
similar to sys database data.

Example:

Code:

Select * from information_schema.columns where table_schema = ‗database_name‘

This will retrieve all the columns in the database table specified.

2. Creation and loading of data into a table

The bulk load operation is used to insert data into managed tables as Hive does not support row-
level insert, delete or update.

Code:

LOAD DATA LOCAL INPATH ‗$Home/students_address‘ OVERWRITE INTO TABLE


students
PARTITION (class = ―12‖, section = ―science‖);

With the above command, a directory is first created for the partition, and then all the files are
copied in the directory. The keyword ―local‖ is used to specify that the data is present in the local
file system. ―Partition‖ keyword can be omitted if the table does not have a partition key. Hive
query will not check for the data being loaded to match the schema of the table.

The ―INSERT‖ command is used to load data from a query into a table. ―OVERWRITE‖
keyword is used to replace the data in a table. In Hive v0.8.0 or later, data will get appended into
a table if overwrite keyword is omitted.

about:blank 57/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

Code:

INSERT OVERWRITE TABLE students


PARTITION ( class = ―12‖, section = ―science‖)
Select * from students_data where class = ―12‖ and section = ―science‖

All the partitions of the table students_data can be dynamically inserted by setting below
properties:

Set hive.exec.dynamic.partition = True;

Set hive.exec.dynamic.partition.mode = unstrict

Set hive.exec.max.dynamic.partition.pernode = 1000;

CREATE TABLE clause will also create a table, and schema will be taken from the select
clause.

3. Merge data in tables

Data can be merged from tables using classic SQL joins like inner, full outer, left, right join.

Code:

Select a.roll_number, class, section from students as a


inner join pass_table as b
on a.roll_number = b.roll_number

This will return class and section of all the roll numbers who have passed. Using a left join to
this will return the ―grade‖ for only pass students and ―NULL‖ for the failed ones.

Code:

Select a.roll_number, class, section, b.grade from students as a


Left join pass_table as b
on a.roll_number = b.roll_number

UNION ALL and UNION are also used to append data present in two tables. However, few
things need to be taken care of on doing so like, Schema of both the tables should be same.
UNION is used to append the table and return unique records while UNION ALL returns all the
records, including duplicates.

about:blank 58/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

4. Ordering a table

ORDER BY clause enables total ordering of the data set by passing all data through one reducer.
This may take a long time for large data tables, so SORT BY clause can be used to achieve
partial sorting, by sorting each reducer.

Code:

Select customer_id, spends from customer as a order by spends DESC limit 100

This will return the top 100 customers with highest spends.

5. Aggregation of data in a table

Aggregation is done using aggregate functions that returns a single value after doing
computation on many rows. These are count(col), sum(col), avg(col), min(col), max(col),
stddev_pop(col), percentile_approx(int_expr, P, NB), where NB is number of histogram bins for
estimation), collect_set(col), this returns duplicate elements after removing collection column.

The set property which helps in improving the performance of aggregation is hive.map.aggr =
true.

―GROUP BY‖ clause is used with an aggregate function.

Example:

Code:

Select year(date_yy), avg(spends) from customer_spends where merchant = ―Retail‖ group by


year(date_yy)

HAVING clause is used to restrict the output from GROUP BY, which is done using a subquery.

6. Conditional statements

CASE…WHEN…THEN clause is similar to if-else statements to perform a conditional


operation on any column in a query.

For example:

Code:

Select customer,
Case when percentage <40 then ―Fail‖
When percentage >=40 and percentage <80 then ―Average‖ Else ―Excellent‖
End as rank From students;

about:blank 59/60
4/14/25, 3:33 PM BDM unit 5 - unit 5

7. Filtering of data

WHERE clause is used to filter data in HiveQL. LIKE is used along with WHERE clause as a
predicate operator to match a regular expression in a record.

8. Way to escape an illegal identifier

There is a way to use special characters or keywords or space in columns or partition names by
enclosing it in backticks ( ` ).

Comments in Hive Scripts:

There is a way to add comment lines to the Hive script by starting it with the string ‗- -‗.

Example:

Below is the code to display students data.

Code:

Select * from student_table;

This only works in scripts, if pasted in CLI error messages will get displayed.

about:blank 60/60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy