Hive

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

Chapter 9

Introduction to Hive

Copyright 2015, WILEY INDIA PVT. LTD.


Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes


Introduction to Hive

1.To study the Hive Architecture a)To understand the hive


architecture.
2.To study the Hive File format b)To create databases, tables and
execute data manipulation
3.To study the Hive Query language statements on it.
Language c)To differentiate between static
and dynamic partitions.
d)To differentiate between
managed and external tables.

Copyright 2015, WILEY INDIA PVT. LTD.


Agenda

 What is Hive?
 Hive Architecture
 Hive Data Types
 Primitive Data Types
 Collection Data Types
 Hive File Format
 Text File
 Sequential File
 RCFile (Record Columnar
File)
Hive Queries

Copyright 2015, WILEY INDIA PVT. LTD.


What is Hive?

Hive is a Data Warehousing tool. Hive is used to query


structured data built on top of Hadoop. Facebook created
Hive component to manage their ever- growing volumes of
data. Hive makes use of the following:
1.HDFS for Storage

2.MapReduce for execution

3.Stores metadata in an RDBMS.

Hive is suitable for data warehousing applications,


processing batch jobs on huge data that is immutable.
Examples: Eg: Analysis of web logs, application logs.
Features of Hive

1. It is similar to SQL.

2. HQL is easy to code.

3. Hive supports rich data types/collection data types such


as structs, lists, and maps.

4. Custom types , custom functions can be defined.

5. Hive supports SQL filters, group-by and order-by


clauses.
Points to remember

 Hive Query Language is similar to SQL and gets


compiled into map reduce jobs and then runs the job
in the Hadoop cluster.

 Hive's default database is derby.


l
Hive integration and work flow
l
Hourly log data can be stored directly into HDFS and then
data cleansing is performed on the log file. Finally, Hive
tables can be created to query the log file.
Hive Data Units
• Databases

• Tables

• Partitions

• Buckets (Clusters)

• In Hive, tables are stored as a folder, partitions are stored as a sub-directory


and buckets are stored as a file.
Database
Tables
Partitions Buckets

Columns Columns

Data Data

Copyright 2015, WILEY INDIA PVT. LTD.


Hive Architecture
l
CLI:Interface to interact with Hive
l
Web interface: It is a graphical
user interface to interact with Hive
l
Hive server(Thrift): It is used to
submit Hive jobs from a client.
Apache Thrift is a software
framework for scalable cross-
language services development. can
be used when developing a web
service that uses a service
developed in one language access
that is in another language
l
JDBC/ODBC: Java code can be
written to connect to Hive and
submit jobs on it.
l
Driver: It compiles, optimizes and
executes Hive queries
HIVE METASTORE
Hive metastore is a database that stores metadata about your Hive
tables (eg. table name, column names and types, table location, storage
handler being used, number of buckets in the table, sorting columns if
any, partition columns if any, etc.).

When you create a table, this metastore gets updated with the
information related to the new table which gets queried when you
issue queries on that table.

Metadata includes:

Table definitions and mappings to the data

IDS of Database, Tables , Indexes

Table creation time

Input/Output format used for a table.

Metadata is updated whenever a table is created or deleted from HIVE


HIVE METASTORE

Three Kinds:

Embedded Metastore : Both the metastore database and the


metastore service run embedded in the main HiveServer
process

Local Metastore: In this mode the Hive metastore service runs


in the same process as the main HiveServer process, but the
metastore database runs in a separate process, and can be on a
separate host. The embedded metastore service communicates
with the metastore database over JDBC.

Remote Metastore: Hive driver and metastore interface run on


different JVMs.
HIVE METASTORE

Use this mode for experimental purposes only. This is the default metastore
deployment mode. In this mode the metastore uses a Derby database, and
both the database and the metastore service run embedded in the main
HiveServer process. Both are started for you when you start the HiveServer
process. This mode requires the least amount of effort to configure, but it can
support only one active user at a time and is not certified for production use.
HIVE METASTORE

In this mode the Hive metastore service runs in the same process
as the main HiveServer process, but the metastore database runs
in a separate process, and can be on a separate host. Metadata
can be stored in any RDBMS component like MySQL.
HIVE METASTORE
Hive File Format: It specifies how records are
encoded in a file
• Text File
The default file format is text file. Each record is a line in the file.

In text file, different control characters are used as delimiters


^A(octal 001, separates fields)
^B(octal 002, separates elements in array/struct)
^C(octal 003, separates key-value pairs).
\n record delimiter
Formats supported csv, tsv, XML, JSON

• Sequential File
Sequential files are flat files that store binary key-value pairs. Includes
compression support which optimizes I/O requirements.

• RCFile (Record Columnar File)


RCFile stores the data in Column Oriented Manner which ensures that
Aggregation operation is not an expensive operation.
RCfile
• It stores the data in columnd oriented manner to ensure aggregation operation is not an
expensive operation.
• In table 9.2, Table from 9.1 is partitioned horizontally
• like row-oriented DBMS.
• Now, in every row group, RCFile partitions the data vertically like column-store.
Hive Query Language (HQL)

 Create and manage tables and partitions.

 Support various Relational, Arithmetic, and Logical Operators.

 Evaluate functions.
 Download the contents of a table to a local directory or result of
queries to HDFS directory.
DDL and DML statements

• DDL statements are used to build and modify the tables and other
objects. They deal with

1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe

• DML statements are used retrive , store, modify delete and


update data in database. They deal with
1. Loading files into table.
2. Inserting data into Hive table from queries.
Database

Starting HIVE shell


Go to its installation path and type hive (hadoop needs to be started in another terminal)

To create a database named “STUDENTS” with comments and database properties.

CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT


'STUDENT Details' WITH DBPROPERTIES ('creator' = 'JOHN');

hive> show databases;


hive> describe database students; //shows DB name, comment and directory
hive> describe database extended students; //will show the properties also
Database

To alter the database properties

hive> alter database students set dbproperties ('edited by'= 'David');


hive> describe database extended students;

hive>use students //to make it current database


hive>drop database students;

Copyright 2015, WILEY INDIA PVT. LTD.


Tables
Hive provides two kinds of tables: Managed and External table
Metadata of a table: When we create a table the masternode will keep
information about the location, schema, list of partitions etc .

Managed Table
When the internal table is dropped, it drops both the data and metadata.
Table is stored under the warehouse folder under Hive.
Life cycle of the table is managed by Hive

External Table –
When you drop this table, it retains the data in the underlying location
External keyword is used to create an external table
Location needs to be specified to store the dataset in that particular
location
hive>describe formatted <tablename> (to see type)
Copyright 2015, WILEY INDIA PVT. LTD.
Tables

To create managed table named ‘STUDENT’.

Format
CREATE [temporary] [external] TABLE [ IF NOT EXISTS] STUDENT(rollno INT,name
STRING,gpa FLOAT) [COMMENT] ROW FORMAT DELIMITED FIELDS TERMINATED BY
'\t';

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW


FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive>describe STUDENT //will show the schema of the table

Copyright 2015, WILEY INDIA PVT. LTD.


Tables

To create external table named ‘EXT_STUDENT’.

CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name STRING,gpa


FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
‘/STUDENT_INFO';

The default location of Hive table is overwritten by using LOCATION

ALTER TABLE name RENAME TO new_name

hive> ALTER TABLE employee RENAME TO emp;

DROP TABLE [IF EXISTS] table_name;


hive> DROP TABLE IF EXISTS employee;
Copyright 2015, WILEY INDIA PVT. LTD.
Loading data into table from file
To load data into the table from file named student.tsv.

LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE INTO


TABLE EXT_STUDENT;

Local is used to load data from the local file system . To load the data from
HDFS remove local key word.

To retrieve the student details from “EXT_STUDENT” table.

hive>show tables;
SELECT * from EXT_STUDENT

hive> select roll no, name, from Student; //to view selected fields
Copyright 2015, WILEY INDIA PVT. LTD.
Collection Data Types

ARRAY : Syntax ARRAY<data_type>,


MAP : MAP<primitive_type, data_type>
STRUCT : STRUCT<col_name:data_type>

hive> create table student_info(rollno INT, name STRING, sub ARRAY<STRING >, marks
MAP<STRING,INT>, addr struct<city:STRING, state:STRING, pin:BIGINT>)row format delimited
fields terminated by ',' collection items terminated by ':' map keys terminated by '!';

hive> load data local inpath '/home/hduser/hive/collection_demo.csv' overwrite into table


student_info;

1001,John,[English:Hindi],{Mark1!45:Mark2!65},{delhi:delhi:897654}
1002,Jill,[Physics:Maths],{Mark1!43:Mark2!89},{bhopal:MP:432024}

Copyright 2015, WILEY INDIA PVT. LTD.


Collection Data Types-Array
hive> select *from student_info ;
1001 John [“English",“Hindi"] {"Mark1":45,"Mark2":65}
1002 Jill [“Physics",“Maths"] {"Mark1":43,"Mark2":89}

//Accessing complex data types


hive> select name, sub from student_info; //select whole array
John [“English",“Hindi"]
Jill [“Physics",“Maths"]

hive> select name, sub[1] from student_info where rollno=1001;


hive> select name, sub[0] from student_info; //array element

Copyright 2015, WILEY INDIA PVT. LTD.


Collection Data Types - Map
hive> select name, marks from student_info; //display the whole map collection
John {"Mark1":45,"Mark2":65}
Jill {"Mark1":43,"Mark2":89}

hive >select name, marks[“mark1”], marks[“mark2”] from student_info;


John 45 65
Jill 43 89

hive>select name, addr.city from student_info;


John delhi
Jill bhopal

https://acadgild.com/blog/hive-complex-data-types-with-examples

Copyright 2015, WILEY INDIA PVT. LTD.


Partitions -> To execute hive query faster
Very often users need to filter the data on specific column
values. Example: select the employees whose salary is above
50000 in a table of 10 lakh entries.
If Users understand the domain of the data on which they are
doing analysis, they can identify frequently queried columns
and then partitioning can be applied on those columns.
When “where clause” is applied, Hive reads the entire
dataset.
This decreases the efficiency and becomes a bottleneck
when we are required to run the queries on large tables.
This issue can be overcome by implementing partitions in
Copyright 2015, WILEY INDIA PVT. LTD.
l
Partition splits the large dataset into more meaningful chunks.

l
Two kinds of partitions:
l
1) Static partition (It comprises columns whose values are
known at compile time)
l
2) Dynamic partition (It comprises columns whose values are
known only at execution time)
l
Static partition:
l
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

l
LOAD DATA LOCAL INPATH '/home/hduser/hive/student_data.csv' OVERWRITE
INTO TABLE student;

l
create table if not exists static_student(rollno int,name string) partitioned by (gpa float)
row format delimited fields terminated by ',';
l
insert overwrite table static_student partition (gpa=6.5) select rollno,name from
student where gpa=6.5;
l
alter table static_student add partition (gpa=8.0);
l
select * from static_student;
l
Dynamic partition:
l
To use the dynamic partitioning in hive we need to set the
below parameters in hive shell or in hive-site.xml file.
l

Now we will enable the dynamic partition using the following commands are as follows.

hive> set hive.exec.dynamic.partition=true;


hive> set hive.exec.dynamic.partition.mode=nonstrict;
l create table if not exists dynamic_student(rollno int,name string) partitioned by (gpa float) row format delimited
fields terminated by ',';

insert overwrite table dynamic_student partition (gpa) select rollno,name,gpa from


student;

select * from
dynamic_student;
Buckets

To use bucketing in hive we need to set the below parameters in hive shell or in hive-site.xml
file.
set hive.enforce.bucketing=true;

• To create a bucketed table having 3 buckets.

CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name STRING,gpa FLOAT)


CLUSTERED BY (gpa) into 3 buckets;

• Load data to bucketed table.

FROM STUDENT INSERT OVERWRITE TABLE STUDENT_BUCKET SELECT rollno,name,gpa;

• To display the content of first bucket.

SELECT DISTINCT GRADE FROM STUDENT_BUCKET TABLESAMPLE(BUCKET 1 OUT OF 3 ON


GRADE);
l
When you create the table and bucket it using the “clustered by” clause into 32
buckets (as an example), hive buckets your data into 32 buckets using
deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x
OUT OF y), hive divides your buckets into groups of y buckets and then picks
the x'th bucket of each group. For example:

l
If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32
buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then
picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.

l
If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your
32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then
picks the 23rd bucket as your result.
Partitioning VS Bucketing

Basically both Partitioning and Bucketing slice the data for executing the query much more efficiently
than on the non-sliced data.

The major difference is that the number of slices will keep on changing in the case of partitioning as
data is modified, but with bucketing the number of slices are fixed which are specified while creating
the table.

Bucketing happen by using a Hash algorithm and then a modulo on the number of buckets. So, a row
might get inserted into any of the bucket. Bucketing can be used for sampling of data, as well also for
joining two data sets much more effectively and much more.

when we do partitioning, we create a partition for each unique value of the column. But there may be
situation where we need to create lot of tiny partitions.

But if you use bucketing, you can limit it to a number which you choose and decompose your data
into those buckets. In hive a partition is a directory but a bucket is a file.
Aggregations
Hive supports aggregation functions like avg, count, etc.

To write the average and count aggregation function.

SELECT avg(gpa) FROM STUDENT;


 count(*) FROM STUDENT;
SELECT

Copyright 2015, WILEY INDIA PVT. LTD.


Group by and Having

select gpa from student group by gpa having count(gpa)


>= 3;

HAVING Clause enables you to specify conditions that filter which


group results appear in the results.

Copyright 2015, WILEY INDIA PVT. LTD.


VIEWS: Views in SQL are kind of virtual tables. A view also has rows and columns as they are in a real
table in the database. We can create a view by selecting fields from one or more tables present in the
database. A View can either have all the rows of a table or specific rows based on certain condition.

CREATE VIEW DetailsView AS SELECT NAME, ADDRESS FROM StudentDetails


WHERE S_ID < 5;

CREATE VIEW MarksView AS SELECT StudentDetails.NAME, StudentDetails.ADDRESS,


StudentMarks.MARKS FROM StudentDetails, StudentMarks WHERE StudentDetails.NAME =
StudentMarks.NAME;

Copyright 2015, WILEY INDIA PVT. LTD.


join
Join- similar to the SQL JOIN
JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used
to combine records from two or more tables in the database.

Copyright 2015, WILEY INDIA PVT. LTD.


SerDer (Serializer converts java objects into something that Hive can
write to HDFS. Deserializer converts binary representation
of records and converts it into java objects.

• SerDer stands for Serializer/Deserializer.

• Contains the logic to convert unstructured data into records.

• Implemented using Java.

• Serializers are used at the time of writing.

• Deserializers are used at query time (SELECT Statement).

Copyright 2015, WILEY INDIA PVT. LTD.


Answer a few quick questions …

Copyright 2015, WILEY INDIA PVT. LTD.


Fill in the blanks

 The metastore consists of and a .


 The most commonly used interface to interact with Hive is .
 The default metastore for Hive is .
 Metastore contains of Hive tables.
 __________________is responsible for compilation, optimization, and execution
of Hive queries.

Copyright 2015, WILEY INDIA PVT. LTD.


Summary please…

Ask a few participants of the learning program to summarize the


lecture.

Copyright 2015, WILEY INDIA PVT. LTD.


References …

Copyright 2015, WILEY INDIA PVT. LTD.


Further Readings

 http://en.wikipedia.org/wiki/RCFile
 https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

Copyright 2015, WILEY INDIA PVT. LTD.


Thank you

Copyright 2015, WILEY INDIA PVT. LTD.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy