BDA Unit-5
BDA Unit-5
BDA Unit-5
Applying Structure to Hadoop Data with Hive: Saying Hello to Hive, Seeing How the Hive is
Put Together, Getting Started with Apache Hive, Examining the Hive Clients, Working with
Hive Data Types, Creating and Managing Databases and Tables, Seeing How the Hive Data
Manipulation Language Works, Querying and Analyzing Data
Apache Hive is a Data Warehousing package built on top of Hadoop and is used for data
analysis. Hive is targeted towards users who are comfortable with SQL. It is similar to SQL and
called HiveQL, used for managing and querying structured data. Apache Hive is used to abstract
complexity of Hadoop. This language also allows traditional map/reduce programmers to plug in
their custom mappers and reducers. The popular feature of Hive is that there is no need to learn
Java.
Hive, an open source peta-byte scale date warehousing framework based on Hadoop, was
developed by the Data Infrastructure Team at Facebook. Hive is also one of the technologies that
are being used to address the requirements at Facebook. Hive is very popular with all the users
internally at Facebook and is being used to run thousands of jobs on the cluster with hundreds of
users, for a wide variety of applications. Hive-Hadoop cluster at Facebook stores more than 2PB
of raw data and regularly loads 15 TB of data on a daily basis.
Figure 1
Let’s look at some of its features that makes it popular and user friendly:
Before implementing Hive, Facebook faced a lot of challenges as the size of data being
generated increased or rather exploded, making it really difficult to handle them. The traditional
RDBMS couldn’t handle the pressure and as a result Facebook was looking out for better
options. To solve this impending issue, Facebook initially tried using Hadoop MapReduce, but
with difficulty in programming and mandatory knowledge in SQL, made it an impractical
solution. Hive allowed them to overcome the challenges they were facing.
With Hive, they are now able to perform the following:
Figure 2
Figure 3
Data Mining
Log Processing
Document Indexing
Customer Facing Business Intelligence
Predictive Modelling
Hypothesis Testing
Hive Architecture:
Figure 5
● Hive Clients – Apache Hive supports all application written in languages like C++, Java,
Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client
application written in a language of their choice.
● Hive Services – Hive provides various services like web Interface, CLI etc. to perform
queries.
● Processing framework and Resource Management – Hive internally uses Hadoop Map
Reduce framework to execute the queries.
● Distributed Storage – As seen above that Hive is built on the top of Hadoop, so it uses
the underlying HDFS for the distributed storage.
Components of Hive:
Figure5
Hive Clients
The Hive supports different types of client applications for performing queries. These clients are
categorized into 3 types:
● Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request
from all those languages that support Thrift.
● JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC
driver. It is defined in the class apache.hadoop.hive.jdbc.HiveDriver.
● ODBC Clients – ODBC Driver allows applications that support ODBC protocol to
connect to Hive. For example JDBC driver, ODBC uses Thrift to communicate with the
Hive server.
Hive Services
Apache Hive provides various services as shown in above diagram. Now, let us look at each in
detail:
a) CLI (Command Line Interface) – This is the default shell that Hive provides, in which you
can execute your Hive queries and command directly.
b) Web Interface – Hive also provides web based GUI for executing Hive queries and
commands. See here different Hive Data types and operators.
c) Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows
different clients to submit requests to Hive and retrieve the final result.
d) Hive Deriver – Driver is responsible for receiving the queries submitted Thrift, JDBC,
ODBC, CLI, Web UL interface by a Hive client.
● Complier –After that hive driver passes the query to the compiler. Where parsing, type
checking, and semantic analysis takes place with the help of schema present in the
metastore.
● Optimizer – It generates the optimized logical plan in the form of a DAG (Directed
Acyclic Graph) of MapReduce and HDFS tasks.
● Executor – Once compilation and optimization complete, execution engine executes
these tasks in the order of their dependencies using Hadoop.
e) Metastore – Metastore is the central repository of Apache Hive metadata in the Hive
Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in
a relational database. It provides client access to this information by using metastore service API.
The Metastore stores the information about the tables, partitions, the columns within the tables.
There are 3 ways of storing in Metastore: Embedded Metastore, Local Metastore and Remote
Metastore. Mostly, Remote Metastore will be used in production mode.
Figure 6
Limitations of Hive:
Figure 7
Hive has the following limitations and cannot be used under such circumstances:
Not designed for online transaction processing.
Provides acceptable latency for interactive data browsing.
Does not offer real-time queries and row level updates.
Latency for Hive queries is generally very high.
Working of Hive
The following Figure 9 depicts the workflow between Hive and Hadoop.
Figure 8
The following table 1 defines how Hive interacts with Hadoop framework:
Step
Operation
No.
Execute Query
1
The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
2
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
Get Metadata
3
The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5 The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
Execute Plan
6
The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a MapReduce job. The
7 execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here,
the query executes MapReduce job.
Metadata Ops
7.1 Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
Fetch Result
8
The execution engine receives the results from Data nodes.
Send Results
9
The execution engine sends those resultant values to the driver.
Send Results
10
The driver sends the results to Hive Interfaces.
Table 1
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
Table 2
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Table 3
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type.
The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the
same name already exists. We can use SCHEMA in place of DATABASE in this command.
The following query is executed to create a database named userdb:
or
Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as
follows:
The following query drops the database using CASCADE. It means dropping respective tables
before dropping the database.
Create Table is a statement used to create a table in Hive. The syntax and example are as
follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
Example
Let us assume you need to create a table named employee using CREATE TABLE statement.
The following table 4 lists the fields and their data types in employee table:
1 Eid int
2 Name String
3 Salary Float
4 Designation string
Table 4
The following data is a Comment, Row formatted fields such as Field terminator, Lines
terminator, and Stored File type.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String,
destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already
exists. On successful creation of table, you get to see the following response:
OK
Time taken: 5.905 seconds
hive>
Generally, after creating a table in SQL, we can insert data using the Insert statement. But in
Hive, we can insert data using the LOAD DATA statement.
While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are
two ways to load data: one is from local file system and second is from Hadoop file system.
Syntax
The syntax for load data is as follows:
OK
Time taken: 15.905 seconds
hive>
Field Name Convert from Data Type Change Field Name Convert to Data Type
Table 5
The following queries rename the column name and column data type using the above data:
Replace Statement
The following query deletes all the columns from the employee table and replaces it
with emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int, ename STRING
name String);
On successful execution of the query, you get to see the following response:
OK
Time taken: 5.3 seconds
hive>
Adding a Partition
We can add partitions to a table by altering the table. Let us assume we have a table
called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.
Syntax:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
Dropping a Partition
The following syntax is used to drop a partition:
Table 6
Example
Let us assume the employee table is composed of fields named Id, Name, Salary, Designation,
and Dept as shown below. Generate a query to retrieve the employee details whose Id is 1205.
The following query is executed to retrieve the employee details whose salary is more than or
equal to Rs 40000.
Arithmetic Operators
These operators support various common arithmetic operations on the operands. All of them
return number types. The following table 7 describes the arithmetic operators available in Hive:
A&B all number types Gives the result of bitwise AND of A and B.
A^B all number types Gives the result of bitwise XOR of A and B.
Table 7
Example
The following query adds two numbers, 20 and 30.
hive> SELECT 20+30 ADD FROM temp;
On successful execution of the query, you get to see the following response:
ADD
50
Logical Operators
The operators are logical expressions show in table 8. All of them return either TRUE or
FALSE.
A || B boolean Same as A OR B.
Table 8
Example
The following query is used to retrieve employee details whose Department is TP and Salary is
more than Rs 40000.
A[n] A is an Array and n is an It returns the nth element in the array A. The
int first element has index 0.
M[key] M is a Map<K, V> and key It returns the value corresponding to the key in
has type K the map.
Example
The following queries demonstrate some built-in functions:
round() function
hive> SELECT round(2.6) from temp;
floor() function
hive> SELECT floor(2.6) from temp;
ceil() function
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage of these functions is as
same as the SQL aggregate functions.
Example
Let us take an example for view. Assume employee table as given below, with the fields Id,
Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000. We store the result in a view named emp_30000.
The following query retrieves the employee details using the above scenario:
Creating an Index
An Index is nothing but a pointer on a particular column of a table. Creating an index means
creating a pointer on a particular column of a table. Its syntax is as follows:
Example
Let us take an example for index. Use the same employee table that we have used earlier with
the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the
salary column of the employee table.
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
It is a pointer to the salary column. If the column is modified, the changes are stored using an
index value.
The following query creates an index:
Dropping an Index
The following syntax is used to drop an index:
DROP INDEX <index_name> ON <table_name>
The following query drops an index named index_salary:
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.
The following query retrieves the employee details using the above scenario:
HiveQL - Select-Order By
The ORDER BY clause is used to retrieve the details based on one column and sort the result
set by ascending or descending order.
Syntax
Given below is the syntax of the ORDER BY clause:
HiveQL - Select-Group By
The GROUP BY clause is used to group all the records in a result set using a particular
collection column. It is used to query a group of records.
Syntax
The syntax of GROUP BY clause is as follows:
Example
Let us take an example of SELECT…GROUP BY clause. Assume employee table as given
below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the
number of employees in each department.
The following query retrieves the employee details using the above scenario.
HiveQL - Select-Joins
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the database. It is
more or less similar to SQL JOIN.
Syntax
join_table:
table_reference JOIN table_factor [join_condition]
table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
table_reference LEFT SEMI JOIN table_reference join_condition
table_reference CROSS JOIN table_reference [join_condition]
Example
We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS..
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign
keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN
still returns a row in the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDER tables.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER
tables: