Hive

Chapter 9
Introduction to Hive
Copyright 2015, WILEY INDIA PVT. LTD.

Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes

Introduction to Hive
1.To study the Hive Architecture a)To understand the hive

architecture.
2.To study the Hive File format b)To create databases, tables and
execute data manipulation
3.To study the Hive Query language statements on it.
Language c)To differentiate between static
and dynamic partitions.
d)To differentiate between
managed and external tables.

Agenda
 What is Hive?
 Hive Architecture
 Hive Data Types
 Primitive Data Types
 Collection Data Types
 Hive File Format
 Text File
 Sequential File
 RCFile (Record Columnar
File)
Hive Queries

What is Hive?
Hive is a Data Warehousing tool. Hive is used to query

structured data built on top of Hadoop. Facebook created
Hive component to manage their ever- growing volumes of
data. Hive makes use of the following:
1.HDFS for Storage
2.MapReduce for execution
3.Stores metadata in an RDBMS.
Hive is suitable for data warehousing applications,

processing batch jobs on huge data that is immutable.
Examples: Eg: Analysis of web logs, application logs.
Features of Hive
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types/collection data types such

as structs, lists, and maps.
4. Custom types , custom functions can be defined.
5. Hive supports SQL filters, group-by and order-by

clauses.
Points to remember
 Hive Query Language is similar to SQL and gets

compiled into map reduce jobs and then runs the job
in the Hadoop cluster.
 Hive's default database is derby.

l
Hive integration and work flow
l
Hourly log data can be stored directly into HDFS and then
data cleansing is performed on the log file. Finally, Hive
tables can be created to query the log file.
Hive Data Units
• Databases
• Tables
• Partitions
• Buckets (Clusters)
• In Hive, tables are stored as a folder, partitions are stored as a sub-directory

and buckets are stored as a file.
Database
Tables
Partitions Buckets
Columns Columns
Data Data

Hive Architecture
l
CLI:Interface to interact with Hive
l
Web interface: It is a graphical
user interface to interact with Hive
l
Hive server(Thrift): It is used to
submit Hive jobs from a client.
Apache Thrift is a software
framework for scalable cross-
language services development. can
be used when developing a web
service that uses a service
developed in one language access
that is in another language
l
JDBC/ODBC: Java code can be
written to connect to Hive and
submit jobs on it.
l
Driver: It compiles, optimizes and
executes Hive queries
HIVE METASTORE
Hive metastore is a database that stores metadata about your Hive
tables (eg. table name, column names and types, table location, storage
handler being used, number of buckets in the table, sorting columns if
any, partition columns if any, etc.).
When you create a table, this metastore gets updated with the
information related to the new table which gets queried when you
issue queries on that table.
Metadata includes:
Table definitions and mappings to the data
IDS of Database, Tables , Indexes
Table creation time
Input/Output format used for a table.
Metadata is updated whenever a table is created or deleted from HIVE

HIVE METASTORE
Three Kinds:
Embedded Metastore : Both the metastore database and the

metastore service run embedded in the main HiveServer
process
Local Metastore: In this mode the Hive metastore service runs

in the same process as the main HiveServer process, but the
metastore database runs in a separate process, and can be on a
separate host. The embedded metastore service communicates
with the metastore database over JDBC.
Remote Metastore: Hive driver and metastore interface run on

different JVMs.
HIVE METASTORE
Use this mode for experimental purposes only. This is the default metastore
deployment mode. In this mode the metastore uses a Derby database, and
both the database and the metastore service run embedded in the main
HiveServer process. Both are started for you when you start the HiveServer
process. This mode requires the least amount of effort to configure, but it can
support only one active user at a time and is not certified for production use.
HIVE METASTORE
In this mode the Hive metastore service runs in the same process
as the main HiveServer process, but the metastore database runs
in a separate process, and can be on a separate host. Metadata
can be stored in any RDBMS component like MySQL.
HIVE METASTORE
Hive File Format: It specifies how records are
encoded in a file
• Text File
The default file format is text file. Each record is a line in the file.
In text file, different control characters are used as delimiters

^A(octal 001, separates fields)
^B(octal 002, separates elements in array/struct)
^C(octal 003, separates key-value pairs).
\n record delimiter
Formats supported csv, tsv, XML, JSON
• Sequential File
Sequential files are flat files that store binary key-value pairs. Includes
compression support which optimizes I/O requirements.
• RCFile (Record Columnar File)

RCFile stores the data in Column Oriented Manner which ensures that
Aggregation operation is not an expensive operation.
RCfile
• It stores the data in columnd oriented manner to ensure aggregation operation is not an
expensive operation.
• In table 9.2, Table from 9.1 is partitioned horizontally
• like row-oriented DBMS.
• Now, in every row group, RCFile partitions the data vertically like column-store.
Hive Query Language (HQL)
 Create and manage tables and partitions.
 Support various Relational, Arithmetic, and Logical Operators.
 Evaluate functions.
 Download the contents of a table to a local directory or result of
queries to HDFS directory.
DDL and DML statements
• DDL statements are used to build and modify the tables and other
objects. They deal with
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe
• DML statements are used retrive , store, modify delete and

update data in database. They deal with
1. Loading files into table.
2. Inserting data into Hive table from queries.
Database
Starting HIVE shell

Go to its installation path and type hive (hadoop needs to be started in another terminal)
To create a database named “STUDENTS” with comments and database properties.
CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT

'STUDENT Details' WITH DBPROPERTIES ('creator' = 'JOHN');
hive> show databases;

hive> describe database students; //shows DB name, comment and directory
hive> describe database extended students; //will show the properties also
Database
To alter the database properties
hive> alter database students set dbproperties ('edited by'= 'David');

hive> describe database extended students;
hive>use students //to make it current database

hive>drop database students;

Tables
Hive provides two kinds of tables: Managed and External table
Metadata of a table: When we create a table the masternode will keep
information about the location, schema, list of partitions etc .
Managed Table
When the internal table is dropped, it drops both the data and metadata.
Table is stored under the warehouse folder under Hive.
Life cycle of the table is managed by Hive
External Table –
When you drop this table, it retains the data in the underlying location
External keyword is used to create an external table
Location needs to be specified to store the dataset in that particular
location
hive>describe formatted <tablename> (to see type)
Tables
To create managed table named ‘STUDENT’.
Format
CREATE [temporary] [external] TABLE [ IF NOT EXISTS] STUDENT(rollno INT,name
STRING,gpa FLOAT) [COMMENT] ROW FORMAT DELIMITED FIELDS TERMINATED BY
'\t';
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW

FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive>describe STUDENT //will show the schema of the table

Tables
To create external table named ‘EXT_STUDENT’.
CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name STRING,gpa

FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
‘/STUDENT_INFO';
The default location of Hive table is overwritten by using LOCATION
ALTER TABLE name RENAME TO new_name
hive> ALTER TABLE employee RENAME TO emp;
DROP TABLE [IF EXISTS] table_name;

hive> DROP TABLE IF EXISTS employee;
Loading data into table from file
To load data into the table from file named student.tsv.
LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE INTO

TABLE EXT_STUDENT;
Local is used to load data from the local file system . To load the data from
HDFS remove local key word.
To retrieve the student details from “EXT_STUDENT” table.
hive>show tables;
SELECT * from EXT_STUDENT
hive> select roll no, name, from Student; //to view selected fields
Collection Data Types
ARRAY : Syntax ARRAY<data_type>,

MAP : MAP<primitive_type, data_type>
STRUCT : STRUCT<col_name:data_type>
hive> create table student_info(rollno INT, name STRING, sub ARRAY<STRING >, marks
MAP<STRING,INT>, addr struct<city:STRING, state:STRING, pin:BIGINT>)row format delimited
fields terminated by ',' collection items terminated by ':' map keys terminated by '!';
hive> load data local inpath '/home/hduser/hive/collection_demo.csv' overwrite into table

student_info;
1001,John,[English:Hindi],{Mark1!45:Mark2!65},{delhi:delhi:897654}
1002,Jill,[Physics:Maths],{Mark1!43:Mark2!89},{bhopal:MP:432024}

Collection Data Types-Array
hive> select *from student_info ;
1001 John [“English",“Hindi"] {"Mark1":45,"Mark2":65}
1002 Jill [“Physics",“Maths"] {"Mark1":43,"Mark2":89}
//Accessing complex data types

hive> select name, sub from student_info; //select whole array
John [“English",“Hindi"]
Jill [“Physics",“Maths"]
hive> select name, sub[1] from student_info where rollno=1001;

hive> select name, sub[0] from student_info; //array element

Collection Data Types - Map
hive> select name, marks from student_info; //display the whole map collection
John {"Mark1":45,"Mark2":65}
Jill {"Mark1":43,"Mark2":89}
hive >select name, marks[“mark1”], marks[“mark2”] from student_info;

John 45 65
Jill 43 89
hive>select name, addr.city from student_info;

John delhi
Jill bhopal
https://acadgild.com/blog/hive-complex-data-types-with-examples

Partitions -> To execute hive query faster
Very often users need to filter the data on specific column
values. Example: select the employees whose salary is above
50000 in a table of 10 lakh entries.
If Users understand the domain of the data on which they are
doing analysis, they can identify frequently queried columns
and then partitioning can be applied on those columns.
When “where clause” is applied, Hive reads the entire
dataset.
This decreases the efficiency and becomes a bottleneck
when we are required to run the queries on large tables.
This issue can be overcome by implementing partitions in
l
Partition splits the large dataset into more meaningful chunks.
l
Two kinds of partitions:
l
1) Static partition (It comprises columns whose values are
known at compile time)
l
2) Dynamic partition (It comprises columns whose values are
known only at execution time)
l
Static partition:
l
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
l
LOAD DATA LOCAL INPATH '/home/hduser/hive/student_data.csv' OVERWRITE
INTO TABLE student;
l
create table if not exists static_student(rollno int,name string) partitioned by (gpa float)
row format delimited fields terminated by ',';
l
insert overwrite table static_student partition (gpa=6.5) select rollno,name from
student where gpa=6.5;
l
alter table static_student add partition (gpa=8.0);
l
select * from static_student;
l
Dynamic partition:
l
To use the dynamic partitioning in hive we need to set the
below parameters in hive shell or in hive-site.xml file.
l
Now we will enable the dynamic partition using the following commands are as follows.
hive> set hive.exec.dynamic.partition=true;

hive> set hive.exec.dynamic.partition.mode=nonstrict;
l create table if not exists dynamic_student(rollno int,name string) partitioned by (gpa float) row format delimited
fields terminated by ',';
insert overwrite table dynamic_student partition (gpa) select rollno,name,gpa from

student;
select * from
dynamic_student;
Buckets
To use bucketing in hive we need to set the below parameters in hive shell or in hive-site.xml
file.
set hive.enforce.bucketing=true;
• To create a bucketed table having 3 buckets.
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name STRING,gpa FLOAT)

CLUSTERED BY (gpa) into 3 buckets;
• Load data to bucketed table.
FROM STUDENT INSERT OVERWRITE TABLE STUDENT_BUCKET SELECT rollno,name,gpa;
• To display the content of first bucket.
SELECT DISTINCT GRADE FROM STUDENT_BUCKET TABLESAMPLE(BUCKET 1 OUT OF 3 ON

GRADE);
l
When you create the table and bucket it using the “clustered by” clause into 32
buckets (as an example), hive buckets your data into 32 buckets using
deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x
OUT OF y), hive divides your buckets into groups of y buckets and then picks
the x'th bucket of each group. For example:
l
If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32
buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then
picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.
l
If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your
32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then
picks the 23rd bucket as your result.
Partitioning VS Bucketing
Basically both Partitioning and Bucketing slice the data for executing the query much more efficiently
than on the non-sliced data.
The major difference is that the number of slices will keep on changing in the case of partitioning as
data is modified, but with bucketing the number of slices are fixed which are specified while creating
the table.
Bucketing happen by using a Hash algorithm and then a modulo on the number of buckets. So, a row
might get inserted into any of the bucket. Bucketing can be used for sampling of data, as well also for
joining two data sets much more effectively and much more.
when we do partitioning, we create a partition for each unique value of the column. But there may be
situation where we need to create lot of tiny partitions.
But if you use bucketing, you can limit it to a number which you choose and decompose your data
into those buckets. In hive a partition is a directory but a bucket is a file.
Aggregations
Hive supports aggregation functions like avg, count, etc.
To write the average and count aggregation function.
SELECT avg(gpa) FROM STUDENT;

 count(*) FROM STUDENT;
SELECT

Group by and Having
select gpa from student group by gpa having count(gpa)

>= 3;
HAVING Clause enables you to specify conditions that filter which

group results appear in the results.

VIEWS: Views in SQL are kind of virtual tables. A view also has rows and columns as they are in a real
table in the database. We can create a view by selecting fields from one or more tables present in the
database. A View can either have all the rows of a table or specific rows based on certain condition.
CREATE VIEW DetailsView AS SELECT NAME, ADDRESS FROM StudentDetails

WHERE S_ID < 5;
CREATE VIEW MarksView AS SELECT StudentDetails.NAME, StudentDetails.ADDRESS,

StudentMarks.MARKS FROM StudentDetails, StudentMarks WHERE StudentDetails.NAME =
StudentMarks.NAME;

join
Join- similar to the SQL JOIN
JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used
to combine records from two or more tables in the database.

SerDer (Serializer converts java objects into something that Hive can
write to HDFS. Deserializer converts binary representation
of records and converts it into java objects.
• SerDer stands for Serializer/Deserializer.
• Contains the logic to convert unstructured data into records.
• Implemented using Java.
• Serializers are used at the time of writing.
• Deserializers are used at query time (SELECT Statement).

Answer a few quick questions …

Fill in the blanks
 The metastore consists of and a .

 The most commonly used interface to interact with Hive is .
 The default metastore for Hive is .
 Metastore contains of Hive tables.
 __________________is responsible for compilation, optimization, and execution
of Hive queries.

Summary please…
Ask a few participants of the learning program to summarize the

lecture.

References …

Further Readings
 http://en.wikipedia.org/wiki/RCFile
 https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

Thank you

Hive

Uploaded by

Copyright:

Available Formats

Hive

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hive

Uploaded by

Copyright:

Available Formats

Chapter 9

Copyright 2015, WILEY INDIA PVT. LTD.

Learning Objectives Learning Outcomes

1.To study the Hive Architecture a)To understand the hive

Copyright 2015, WILEY INDIA PVT. LTD.

Copyright 2015, WILEY INDIA PVT. LTD.

Hive is a Data Warehousing tool. Hive is used to query

2.MapReduce for execution

3.Stores metadata in an RDBMS.

Hive is suitable for data warehousing applications,

2. HQL is easy to code.

3. Hive supports rich data types/collection data types such

4. Custom types , custom functions can be defined.

5. Hive supports SQL filters, group-by and order-by

 Hive Query Language is similar to SQL and gets

 Hive's default database is derby.

• In Hive, tables are stored as a folder, partitions are stored as a sub-directory

Copyright 2015, WILEY INDIA PVT. LTD.

Table definitions and mappings to the data

IDS of Database, Tables , Indexes

Table creation time

Input/Output format used for a table.

Metadata is updated whenever a table is created or deleted from HIVE

Embedded Metastore : Both the metastore database and the

Local Metastore: In this mode the Hive metastore service runs

Remote Metastore: Hive driver and metastore interface run on

In text file, different control characters are used as delimiters

• RCFile (Record Columnar File)

 Create and manage tables and partitions.

 Support various Relational, Arithmetic, and Logical Operators.

• DML statements are used retrive , store, modify delete and

Starting HIVE shell

To create a database named “STUDENTS” with comments and database properties.

CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT

hive> show databases;

To alter the database properties

hive> alter database students set dbproperties ('edited by'= 'David');

hive>use students //to make it current database

Copyright 2015, WILEY INDIA PVT. LTD.

To create managed table named ‘STUDENT’.

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW

hive>describe STUDENT //will show the schema of the table

Copyright 2015, WILEY INDIA PVT. LTD.

To create external table named ‘EXT_STUDENT’.

CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name STRING,gpa

The default location of Hive table is overwritten by using LOCATION

ALTER TABLE name RENAME TO new_name

hive> ALTER TABLE employee RENAME TO emp;

DROP TABLE [IF EXISTS] table_name;

LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE INTO

To retrieve the student details from “EXT_STUDENT” table.

ARRAY : Syntax ARRAY<data_type>,

hive> load data local inpath '/home/hduser/hive/collection_demo.csv' overwrite into table

Copyright 2015, WILEY INDIA PVT. LTD.

//Accessing complex data types

hive> select name, sub[1] from student_info where rollno=1001;

Copyright 2015, WILEY INDIA PVT. LTD.

hive >select name, marks[“mark1”], marks[“mark2”] from student_info;

hive>select name, addr.city from student_info;