6.1NoSQL ApacheHIVE Witha3
6.1NoSQL ApacheHIVE Witha3
Vinu Venugopal
ScaDS Lab, IIIT Bangalore
NoSQL Systems
Hive and HiveQL
• Hive is an ETL and data warehouse tool on top of Hadoop ecosystem
• developed at Facebook in 2007-2008
• used for querying and analyzing large datasets stored in Hadoop file
• provides SQL-like declarative language, called HiveQL
• Hive engine compiles these queries into Map-Reduce jobs to be executed on
Hadoop
2
Hive Reference
3
Hive Access Modes
• Hive shell
• Interactive mode
% beeline -u jdbc:hive2://
hive> SHOW TABLES;
OK
Time taken: 18.457 seconds
6
Hive Data Model
The basic data model of Hive, again, are nested relations (called “tables”).
Elements of a relation may be of a simple type or of a complex type.
• 10 simple (atomic) data types:
TINYINT (1-byte signed integer), SMALLINT (2-byte signed integer), INT
(4-byte signed integer), BIGINT (8-byte signed integer), FLOAT (4-byte
single-precision float), DOUBLE (8-byte double-precision float), BOOLEAN (true/false,
1-byte), STRING (char array), BINARY (byte array), TIMESTAMP (8-bytes)
• 3 complex types:
ARRAY, STRUCT, MAP which are created via the built-in functions array(),
struct(), map().
7
Operators and Built-In Functions
HiveQL supports the usual set of SQL operators:
SELECT… FROM… WHERE… GROUP BY… HAVING… ORDER BY… LIMIT;
Built-in functions:
• Mathematical Functions e.g., round(DOUBLE a)
• Collection Functions e.g., size(Map<K.V>)
• Type Conversion Functions e.g., binary(string|binary)
• Date Functions e.g., to_date(string timestamp)
• Conditional Functions e.g., isnull( a )
• String Functions e.g., reverse(string A)
• Data Masking Functions fields e.g., mask_first_n(string str, int n)
9
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);
10
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);
11
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);
12
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);
13
Tables
Managed Table External Table
• Managed Tables
• Internal tables - HIVE controls the lifecycle of its data
• Data is physically moved from HDFS
• While loading data into a table in hive, it would be kept under a default directory:
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse) which can again be a
distributed file system location..
• We can override this location by using LOCATION keyword.
• As HIVE has complete control over the data, when dropping a managed table, the
entire data will get deleted.
• But it is not so convenient when we want to share data among multiple tools like
pig or some other tool while using managed table.
• If we don’t want to give the ownership of the data to hive alone…
14
Tables
Managed Table External Table
• Managed Tables
• Internal tables - HIVE controls the lifecycle of its data
• Data is physically moved from HDFS
• Stored at Hive’s own internal location (another location in HDFS)
CREATE TABLE managed_tbl (field0 STRING)
LOCATION '/mytables/managed_tbl_table’;
16
Schema Management
Issuing a CREATE TABLE statement in Hive (both for a managed and for an
external table) performs two basic steps:
1. The physical files (and subdirectory structure) for storing the
table are created.
2. The new table's schema (i.e., its attributes, attribute types,
and file mappings) is stored in Hive's so-called Metastore
(which is simply another DBMS service, usually MySQL).
17
Schema Management
Issuing a CREATE TABLE statement in Hive (both for a managed and for an
external table) performs two basic steps:
1. The physical files (and subdirectory structure) for storing the
table are created.
2. The new table's schema (i.e., its attributes, attribute types,
and file mappings) is stored in Hive's so-called Metastore
(which is simply another DBMS service, usually MySQL).
18
Partitions & Buckets
• Partitions
• even a simple query in Hive reads the entire dataset
• horizontal slices of data
tab1/clientdata/2009/file2
tab1/clientdata/file1 1, sunny, SC, 2009
id, name, dept, yoj Load these files
2, sam, HR, 2009
1, sunny, SC, 2009 as different partitions
2, sam, HR, 2009 tab1/clientdata/2010/file3
3, bob, SC, 2010 3, bob, SC, 2010
4, claire, TP, 2010 4, claire, TP, 2010
CREATE TABLE studentTab (id INT, name STRING, dept STRING, yoj
INT) PARTITIONED BY (year STRING);
20
Partitions & Buckets
• Buckets
• partitions can be further subdivided into more manageable parts known
as Buckets or Clusters.
• based on Hash function, which depends on the type of the bucketing
column.
• CLUSTERED BY clause is used to divide the table into buckets.
21
Map-Side Join in Hive
One of the tables to be joined should be small enough to fit into main
memory
SELECT /*+ MAPJOIN(s)*/ b.*, s.*
FROM big_tbl b JOIN small_tbl s ON (b.id = s.id);
22
Map-Side Join in Hive
One of the tables to be joined should be small enough to fit into main
memory
SELECT /*+ MAPJOIN(s)*/ b.*, s.*
FROM big_tbl b JOIN small_tbl s ON (b.id = s.id);
SET hive.optimize.bucketmapjoin=true;
23
Note on Updates, Transactions and Indexes
24
Updating and Appending Data
• INSERT OVERWRITE replaces the contents of the target table (or partition,
resp.) with those of the source table (or subquery, see later slides).
• Just like in SQL, you can also create a new table directly from a subquery.
25
Multi-Table Insert
Inserting the contents of one source table into multiple target
tables!
FROM records
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station) GROUP BY year
Input schema: records(year int, station int, temperature float, quality int)
hive> FROM records
SELECT year, temperature
DISTRIBUTE BY year
SORT BY year ASC, temperature DESC;
27
Joins
hive> SELECT * FROM sales; hive> SELECT sales.*, things.*
Joe 2 FROM sales JOIN things ON
Hank 4 (sales.id = things.id);
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie Note: the SQL-style abbreviation for an inner
4 Coat join SELECT * FROM sales, things
3 Hat WHERE sales.id = things.id;
1 Scarf is not allowed!
The above query is translated into two MapReduce jobs, one for each
GROUP-BY clause.
32
Views in HiveQL
As in SQL, views are "virtual tables" which are not materialized. (Use CREATE
TABLE if you want materialization instead.)
Input schema: records(year int, station int, temperature float, quality int)
The above series of view results in the same execution plan, using one MR job
for each GROUP-BY clause!
33
User-Defined Functions in Hive
UDF's in Hive are potentially more powerful than in Pig!
UDF's are implemented as Java classes
2 types of UDFs:
Basic UDFs: obtain a single row as input & produce a single row
Aggregate UDFs: obtain multiple rows as input iteratively & produce a
single row
34
Example: Basic UDF
import org.apache.hadoop.hive.ql.exec.UDF;
…
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
35
Example: Basic UDF
import org.apache.hadoop.hive.ql.exec.UDF;
…
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
36
Example: Aggregate UDF (UDAF)
• Suppose we would like to
(again) calculate the Mean
of a set of input numbers in
a distributed fashion.
• Further assume that the
input number are split
across different files.
• user-defined aggregate
functions (UDAF's) in Hive
may run in an arbitrary,
distributed fashion inside a
MapReduce job.
37
Example: Aggregate UDF (UDAF)
23.4
41
Assignment III
Part-B: Hive and HiveQL
Problem-1: Processing YAGO dataset using HIVE
1. Load the YAGO dataset and find out the top three frequently
occurring predicates in the YAGO dataset using operators
available in HiveQL. (3 points)
42
Assignment III
Part-B: Hive and HiveQL
Problem-2: Write a HiveQL query to find all the subjects (x) and
objects (y and z) matching the pattern: ?x <hasGivenName> ?y.
?x <livesIn> ?z., from the Yago dataset.
43
Assignment III
Part-B: Hive and HiveQL
For case (i):
• Load the entire triples from the given yago.tsv files into a table
named yago having three columns: subject, predicate and
object.
• Create a new table, named yago part buck, with a partition
based on the predicate column and clustered based on the
subject column.
• Load data (statically) into the partitions for all the 29
predicates (listed in the next slide) in the dataset. This loading
could be done by inserting data into the partitioned table from
the yago table specifying the partition key – you may write all
the insert statements into a single HiveQL script for loading
data.
• Write a HiveQL query to find the required pattern from the
yago-part-buck table.
44
Assignment III
Part-B: Hive and HiveQL
Note: You can set the following hive parameters to true (as given
below), to enable the bucketized merge-join.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
45