Hive
Hive
Hive
Introduction to Hive
What is Hive?
Hive Architecture
Hive Data Types
Primitive Data Types
Collection Data Types
Hive File Format
Text File
Sequential File
RCFile (Record Columnar
File)
Hive Queries
1. It is similar to SQL.
• Tables
• Partitions
• Buckets (Clusters)
Columns Columns
Data Data
When you create a table, this metastore gets updated with the
information related to the new table which gets queried when you
issue queries on that table.
Metadata includes:
Three Kinds:
Use this mode for experimental purposes only. This is the default metastore
deployment mode. In this mode the metastore uses a Derby database, and
both the database and the metastore service run embedded in the main
HiveServer process. Both are started for you when you start the HiveServer
process. This mode requires the least amount of effort to configure, but it can
support only one active user at a time and is not certified for production use.
HIVE METASTORE
In this mode the Hive metastore service runs in the same process
as the main HiveServer process, but the metastore database runs
in a separate process, and can be on a separate host. Metadata
can be stored in any RDBMS component like MySQL.
HIVE METASTORE
Hive File Format: It specifies how records are
encoded in a file
• Text File
The default file format is text file. Each record is a line in the file.
• Sequential File
Sequential files are flat files that store binary key-value pairs. Includes
compression support which optimizes I/O requirements.
Evaluate functions.
Download the contents of a table to a local directory or result of
queries to HDFS directory.
DDL and DML statements
• DDL statements are used to build and modify the tables and other
objects. They deal with
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe
Managed Table
When the internal table is dropped, it drops both the data and metadata.
Table is stored under the warehouse folder under Hive.
Life cycle of the table is managed by Hive
External Table –
When you drop this table, it retains the data in the underlying location
External keyword is used to create an external table
Location needs to be specified to store the dataset in that particular
location
hive>describe formatted <tablename> (to see type)
Copyright 2015, WILEY INDIA PVT. LTD.
Tables
Format
CREATE [temporary] [external] TABLE [ IF NOT EXISTS] STUDENT(rollno INT,name
STRING,gpa FLOAT) [COMMENT] ROW FORMAT DELIMITED FIELDS TERMINATED BY
'\t';
Local is used to load data from the local file system . To load the data from
HDFS remove local key word.
hive>show tables;
SELECT * from EXT_STUDENT
hive> select roll no, name, from Student; //to view selected fields
Copyright 2015, WILEY INDIA PVT. LTD.
Collection Data Types
hive> create table student_info(rollno INT, name STRING, sub ARRAY<STRING >, marks
MAP<STRING,INT>, addr struct<city:STRING, state:STRING, pin:BIGINT>)row format delimited
fields terminated by ',' collection items terminated by ':' map keys terminated by '!';
1001,John,[English:Hindi],{Mark1!45:Mark2!65},{delhi:delhi:897654}
1002,Jill,[Physics:Maths],{Mark1!43:Mark2!89},{bhopal:MP:432024}
https://acadgild.com/blog/hive-complex-data-types-with-examples
l
Two kinds of partitions:
l
1) Static partition (It comprises columns whose values are
known at compile time)
l
2) Dynamic partition (It comprises columns whose values are
known only at execution time)
l
Static partition:
l
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
l
LOAD DATA LOCAL INPATH '/home/hduser/hive/student_data.csv' OVERWRITE
INTO TABLE student;
l
create table if not exists static_student(rollno int,name string) partitioned by (gpa float)
row format delimited fields terminated by ',';
l
insert overwrite table static_student partition (gpa=6.5) select rollno,name from
student where gpa=6.5;
l
alter table static_student add partition (gpa=8.0);
l
select * from static_student;
l
Dynamic partition:
l
To use the dynamic partitioning in hive we need to set the
below parameters in hive shell or in hive-site.xml file.
l
Now we will enable the dynamic partition using the following commands are as follows.
select * from
dynamic_student;
Buckets
To use bucketing in hive we need to set the below parameters in hive shell or in hive-site.xml
file.
set hive.enforce.bucketing=true;
l
If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32
buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then
picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.
l
If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your
32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then
picks the 23rd bucket as your result.
Partitioning VS Bucketing
Basically both Partitioning and Bucketing slice the data for executing the query much more efficiently
than on the non-sliced data.
The major difference is that the number of slices will keep on changing in the case of partitioning as
data is modified, but with bucketing the number of slices are fixed which are specified while creating
the table.
Bucketing happen by using a Hash algorithm and then a modulo on the number of buckets. So, a row
might get inserted into any of the bucket. Bucketing can be used for sampling of data, as well also for
joining two data sets much more effectively and much more.
when we do partitioning, we create a partition for each unique value of the column. But there may be
situation where we need to create lot of tiny partitions.
But if you use bucketing, you can limit it to a number which you choose and decompose your data
into those buckets. In hive a partition is a directory but a bucket is a file.
Aggregations
Hive supports aggregation functions like avg, count, etc.
http://en.wikipedia.org/wiki/RCFile
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML