0% found this document useful (0 votes)
17 views

6.1NoSQL ApacheHIVE Witha3

apache hive presentation

Uploaded by

shubham agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

6.1NoSQL ApacheHIVE Witha3

apache hive presentation

Uploaded by

shubham agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

NoSQL Systems

Apache Hive & HiveQL


“Data Warehouse and Query Language for Hadoop”

Vinu Venugopal
ScaDS Lab, IIIT Bangalore
NoSQL Systems
Hive and HiveQL
• Hive is an ETL and data warehouse tool on top of Hadoop ecosystem
• developed at Facebook in 2007-2008
• used for querying and analyzing large datasets stored in Hadoop file
• provides SQL-like declarative language, called HiveQL
• Hive engine compiles these queries into Map-Reduce jobs to be executed on
Hadoop

• constantly being developed and available from the Apache software


foundation.
(see: http://hive.apache.org 29 March 2024: release 4.0.0 available)

2
Hive Reference

“Apache Hive is a distributed,


fault-tolerant data warehouse
system that enables analytics at
a massive scale”

3
Hive Access Modes
• Hive shell
• Interactive mode
% beeline -u jdbc:hive2://
hive> SHOW TABLES;
OK
Time taken: 18.457 seconds

• Non-Interactive mode (running HiveQL script)


% beeline -u jdbc:hive2://
hive> !run myscript.hiveql

• Other access modes:


• Apache Thrift clients
• JDBC/ODBC clients
4
Hive Architecture

• The Hive Server is a single process running in a Java Virtual Machine


(JVM).

• It communicates with the Hadoop FileSystem (HDFS), the Hadoop


JobClient, and its own MetaStore (an actual DBMS).
5
Hive Architecture
Hive Metastore Server (HMS)
• A central repository of metadata for Hive tables and partitions in
a relational database
• provides clients (including Hive, Impala and Spark) access to this
information using the metastore service API.

6
Hive Data Model
The basic data model of Hive, again, are nested relations (called “tables”).
Elements of a relation may be of a simple type or of a complex type.
• 10 simple (atomic) data types:
TINYINT (1-byte signed integer), SMALLINT (2-byte signed integer), INT
(4-byte signed integer), BIGINT (8-byte signed integer), FLOAT (4-byte
single-precision float), DOUBLE (8-byte double-precision float), BOOLEAN (true/false,
1-byte), STRING (char array), BINARY (byte array), TIMESTAMP (8-bytes)

• 3 complex types:
ARRAY, STRUCT, MAP which are created via the built-in functions array(),
struct(), map().

7
Operators and Built-In Functions
HiveQL supports the usual set of SQL operators:
SELECT… FROM… WHERE… GROUP BY… HAVING… ORDER BY… LIMIT;

Built-in functions:
• Mathematical Functions e.g., round(DOUBLE a)
• Collection Functions e.g., size(Map<K.V>)
• Type Conversion Functions e.g., binary(string|binary)
• Date Functions e.g., to_date(string timestamp)
• Conditional Functions e.g., isnull( a )
• String Functions e.g., reverse(string A)
• Data Masking Functions fields e.g., mask_first_n(string str, int n)

hive> SHOW FUNCTIONS;


hive> DESCRIBE FUNCTION length;
8
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);
• Once after loading data..
hive> SELECT col0, col1[0], col2['b'], col3.c FROM
complex_table;

9
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM


complex_table;

10
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM


complex_table;

11
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM


complex_table;

The above CREATE TABLE statement is actually short for:


CREATE TABLE …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE | [BINARY] RCFILE | [BINARY] SEQUENCEFILE;

12
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM


complex_table;

Nested (i.e., complex) types are serialized/de-serialized by Hive using the


built-in class:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

13
Tables
Managed Table External Table

• Managed Tables
• Internal tables - HIVE controls the lifecycle of its data
• Data is physically moved from HDFS
• While loading data into a table in hive, it would be kept under a default directory:
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse) which can again be a
distributed file system location..
• We can override this location by using LOCATION keyword.
• As HIVE has complete control over the data, when dropping a managed table, the
entire data will get deleted.
• But it is not so convenient when we want to share data among multiple tools like
pig or some other tool while using managed table.
• If we don’t want to give the ownership of the data to hive alone…

14
Tables
Managed Table External Table

• Managed Tables
• Internal tables - HIVE controls the lifecycle of its data
• Data is physically moved from HDFS
• Stored at Hive’s own internal location (another location in HDFS)
CREATE TABLE managed_tbl (field0 STRING)
LOCATION '/mytables/managed_tbl_table’;

LOAD DATA LOCAL INPATH '/myfiles/data.txt' INTO


TABLE managed_ tbl;

• DROP TABLE managed_tbl;


• physically deletes all data in the Hive table!
15
Tables
• External Tables
• Data is NOT physically moved from HDFS
• Merely keeps a reference to the file holding the data, together with
the schema

CREATE EXTERNAL TABLE external_tbl (dummy STRING)


LOCATION ‘/hdfs/data’;

16
Schema Management
Issuing a CREATE TABLE statement in Hive (both for a managed and for an
external table) performs two basic steps:
1. The physical files (and subdirectory structure) for storing the
table are created.
2. The new table's schema (i.e., its attributes, attribute types,
and file mappings) is stored in Hive's so-called Metastore
(which is simply another DBMS service, usually MySQL).

17
Schema Management
Issuing a CREATE TABLE statement in Hive (both for a managed and for an
external table) performs two basic steps:
1. The physical files (and subdirectory structure) for storing the
table are created.
2. The new table's schema (i.e., its attributes, attribute types,
and file mappings) is stored in Hive's so-called Metastore
(which is simply another DBMS service, usually MySQL).

Schema-On-Write vs. Schema-On-Read


• Schema-On-Write: data is verified when initially written into the
database (traditional DBMS's, Hive's managed tables)

• Schema-On-Read: data is verified when query is issued (Hive's


external tables, Pig's relations, etc.)

18
Partitions & Buckets
• Partitions
• even a simple query in Hive reads the entire dataset
• horizontal slices of data
tab1/clientdata/2009/file2
tab1/clientdata/file1 1, sunny, SC, 2009
id, name, dept, yoj Load these files
2, sam, HR, 2009
1, sunny, SC, 2009 as different partitions
2, sam, HR, 2009 tab1/clientdata/2010/file3
3, bob, SC, 2010 3, bob, SC, 2010
4, claire, TP, 2010 4, claire, TP, 2010

CREATE TABLE studentTab (id INT, name STRING, dept STRING, yoj
INT) PARTITIONED BY (year STRING);

LOAD DATA LOCAL INPATH ‘tab1/clientdata/2009/file2’ INTO TABLE


studentTab PARTITION (year='2009’);

LOAD DATA LOCAL INPATH ‘tab1 /clientdata/2010/file3’ INTO


TABLE studentTab PARTITION (year='2010’);
19
Partitions & Buckets
• Types of Partitioning:
• Static
• Specify partition key(s) and manually or “statically” move the
data into each partition of the table
• Can perform Static partition on Hive Manage table or external
table
• Dynamic
• Automatically load data into the partitions from a non-
partitioned table or file
• Takes more time in loading data compared to static partition
• Can perform dynamic partition on Hive external table and
managed table
• Set property set hive.exec.dynamic.partition.mode=nonstrict

20
Partitions & Buckets
• Buckets
• partitions can be further subdivided into more manageable parts known
as Buckets or Clusters.
• based on Hash function, which depends on the type of the bucketing
column.
• CLUSTERED BY clause is used to divide the table into buckets.

CREATE TABLE bucketed_users (id INT, name STRING)


CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

• partitions are stored in separate HDFS subdirectories, whereas buckets


are kept in HDFS separate files (within each subdirectory).

21
Map-Side Join in Hive
One of the tables to be joined should be small enough to fit into main
memory
SELECT /*+ MAPJOIN(s)*/ b.*, s.*
FROM big_tbl b JOIN small_tbl s ON (b.id = s.id);

/*+ MAPJOIN(things)*/ is a hint (similar to an optimizer hint, e.g., in


Oracle) for loading the data of small_tbl into main memory, and to
perform a Map-side merge join instead of the default Reduce-side join.

22
Map-Side Join in Hive
One of the tables to be joined should be small enough to fit into main
memory
SELECT /*+ MAPJOIN(s)*/ b.*, s.*
FROM big_tbl b JOIN small_tbl s ON (b.id = s.id);

/*+ MAPJOIN(things)*/ is a hint (similar to an optimizer hint, e.g., in


Oracle) for loading the data of small_tbl into main memory, and to
perform a Map-side merge join instead of the default Reduce-side join.

This can be further optimized by bucketizing both the input tables:


CREATE TABLE big_tbl (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

CREATE TABLE small_tbl (id INT, name STRING)


CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

SET hive.optimize.bucketmapjoin=true;

23
Note on Updates, Transactions and Indexes

…only have limited support in Hive


• Updates are supported via INSERT statements (thus via appending
tuples to a table) only.
• Transactions are implemented via table- and partition-level locking.
• A simple form of an index is available via table partitions and
buckets; more support for indexes is "in progress".

24
Updating and Appending Data
• INSERT OVERWRITE replaces the contents of the target table (or partition,
resp.) with those of the source table (or subquery, see later slides).

• If OVERWRITE is omitted, records are appended.

INSERT OVERWRITE TABLE target


SELECT col1, col2 FROM source;

• For partitioned tables, the target partition can be specified manually.

INSERT OVERWRITE TABLE target


PARTITION (date='2013-04-25')
SELECT col1, col2 FROM source;

• Just like in SQL, you can also create a new table directly from a subquery.

CREATE TABLE target AS SELECT col1, col2 FROM source;

25
Multi-Table Insert
Inserting the contents of one source table into multiple target
tables!

FROM records
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station) GROUP BY year

INSERT OVERWRITE TABLE records_by_year


SELECT year, COUNT(1) GROUP BY year

INSERT OVERWRITE TABLE good_records_by_year


SELECT year, COUNT(1) WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4)
GROUP BY year;

(à Compare this feature to a multi-query execution in Pig!)


26
Sorting
• Using standard “ORDER BY”
• Results would be totally sorted
• Sets the no. of reducers to one – inefficient for large datasets

• Using nonstandard “SORT BY”


• Results would not be globally sorted
• Does not set the no. of reducers to one
• SORT BY produces a sorted file per reducer.
• Shard data using “DISTRIBUTE BY”

Input schema: records(year int, station int, temperature float, quality int)
hive> FROM records
SELECT year, temperature
DISTRIBUTE BY year
SORT BY year ASC, temperature DESC;

27
Joins
hive> SELECT * FROM sales; hive> SELECT sales.*, things.*
Joe 2 FROM sales JOIN things ON
Hank 4 (sales.id = things.id);
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie Note: the SQL-style abbreviation for an inner
4 Coat join SELECT * FROM sales, things
3 Hat WHERE sales.id = things.id;
1 Scarf is not allowed!

• By default translated into a Reduce-side join (unless specified otherwise) in


MapReduce!
• The EXPLAIN command shows details about the join execution in MR.
• Hive supports only equi-joins.
EXPLAIN SELECT sales.*, things.* FROM sales JOIN things ON
(sales.id = things.id);
28
Outer and Semi-Joins
hive> SELECT sales.*, things.*
FROM sales LEFT OUTER JOIN
hive> SELECT * FROM sales; things ON (sales.id =
Joe 2 things.id);
Hank 4 Ali 0 NULL NULL
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie hive> SELECT * FROM things LEFT
4 Coat SEMI JOIN sales ON (sales.id
3 Hat = things.id);
1 Scarf 2 Tie
3 Hat
4 Coat
• LEFT/RIGHT/FULL outer and semi-joins again follow the usual SQL semantics and are
automatically translated into a Reduce-side join.

• Hive supports only outer and semi-joins with an equality condition.


29
Outer and Semi-Joins
hive> SELECT sales.*, things.*
FROM sales LEFT OUTER JOIN
hive> SELECT * FROM sales; things ON (sales.id =
Joe 2 things.id);
Hank 4 Ali 0 NULL NULL
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie hive> SELECT * FROM things LEFT
4 Coat SEMI JOIN sales ON (sales.id
3 Hat = things.id);
1 Scarf 2 Tie
3 Hat
4 Coat
• LEFT/RIGHT/FULL outer and semi-joins again follow the usual SQL semantics and are
automatically translated into a Reduce-side join.

• Hive supports only outer and semi-joins with an equality condition.


30
Outer and Semi-Joins
hive> SELECT sales.*, things.*
FROM sales LEFT OUTER JOIN
hive> SELECT * FROM sales; things ON (sales.id =
Joe 2 things.id);
Hank 4 Ali 0 NULL NULL
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie hive> SELECT * FROM things LEFT
4 Coat SEMI JOIN sales ON (sales.id
3 Hat = things.id);
1 Scarf 2 Tie
3 Hat
4 Coat
• LEFT/RIGHT/FULL outer and semi-joins again follow the usual SQL semantics and are
automatically translated into a Reduce-side join.

• Hive supports only outer and semi-joins with an equality condition.


31
Subqueries in HiveQL
• Hive has limited support for subqueries
• only permitting a subquery in the FROM clause of a SELECT statement.

SELECT station, year, AVG(max_temperature)


FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records
WHERE temperature != 9999 AND (quality = 0 OR quality = 1)
GROUP BY station, year
DISTRIBUTE BY station, year
) max_temp
GROUP BY station, year
DISTRIBUTE BY station;

The above query is translated into two MapReduce jobs, one for each
GROUP-BY clause.

32
Views in HiveQL
As in SQL, views are "virtual tables" which are not materialized. (Use CREATE
TABLE if you want materialization instead.)

Input schema: records(year int, station int, temperature float, quality int)

CREATE VIEW valid_records AS


SELECT * FROM records
WHERE temperature != 9999 AND (quality = 0 OR quality = 1);
CREATE VIEW max_temperatures (station, year, max_temperature) AS
SELECT station, year, MAX(temperature)
FROM valid_records GROUP BY station, year
DISTRIBUTE BY station, year;
SELECT station, year, AVG(max_temperature)
FROM max_temperatures GROUP BY station, year
DISTRIBUTE BY station;

The above series of view results in the same execution plan, using one MR job
for each GROUP-BY clause!
33
User-Defined Functions in Hive
UDF's in Hive are potentially more powerful than in Pig!
UDF's are implemented as Java classes
2 types of UDFs:
Basic UDFs: obtain a single row as input & produce a single row
Aggregate UDFs: obtain multiple rows as input iteratively & produce a
single row

34
Example: Basic UDF
import org.apache.hadoop.hive.ql.exec.UDF;

public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}

Strip(’ 00a11b00 ’) => ’00a11b00’


Strip(’00a11b00’, ‘01’) => ’a11b’ – remove chars in the 2nd argument from
the prefix/suffix of 1st argument

35
Example: Basic UDF
import org.apache.hadoop.hive.ql.exec.UDF;

public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}

ADD JAR ./stripexample.jar;


CREATE TEMPORARY FUNCTION strip AS 'Strip’;
SELECT strip(’00a11b00’, ‘01’) ;

36
Example: Aggregate UDF (UDAF)
• Suppose we would like to
(again) calculate the Mean
of a set of input numbers in
a distributed fashion.
• Further assume that the
input number are split
across different files.
• user-defined aggregate
functions (UDAF's) in Hive
may run in an arbitrary,
distributed fashion inside a
MapReduce job.

37
Example: Aggregate UDF (UDAF)

public class Mean extends UDAF {

public static class PartialResult {


double sum;
long count;
}

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

The init() method is called by the hive to intilize an


instance of the UDAF evaluator class for each file.

iterate() and terminatePartial() methods are used at


the MAP side

terminate() and merge() are used at the REDUCE


side 38
Example: Aggregate UDF (UDAF)

public class Mean extends UDAF {

public static class PartialResult {


double sum;
long count;
}

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

PartialResult class defines the structure of the


partial results.

In the e.g., as the intention is to find the mean


of all the values, map() should return the total
sum and the count of numbers.
39
Example: Aggregate UDF (UDAF)

public class Mean extends UDAF {

public static class PartialResult {


double sum;
long count;
}

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

Iterate method will be executed over each


numbers in the file, the partial results would
be aggregated, and the terminatePartial()
would be called once the inputsplit is finished.
Partial results will be returned to merge().
Once obtaining all partial results terminate()
will be called to get final results. 40
Example: Aggregate UDF (UDAF)
See Mean.java
on Moodle!

Now calculate the Mean of a set of temperature values stored in the


records table in a distributed way:

hive> SELECT mean(temperature) FROM records;


Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time:1

23.4

41
Assignment III
Part-B: Hive and HiveQL
Problem-1: Processing YAGO dataset using HIVE
1. Load the YAGO dataset and find out the top three frequently
occurring predicates in the YAGO dataset using operators
available in HiveQL. (3 points)

2. Identify all the given-names (i.e., object values of the


hasGivenName predicate) of persons who are associated with
more than on IivesIn predicates from the YAGO dataset using the
relational operators (join, grouping, etc.) in HiveQL. (4 points)

PF Problem-2 in the next slide..

42
Assignment III
Part-B: Hive and HiveQL
Problem-2: Write a HiveQL query to find all the subjects (x) and
objects (y and z) matching the pattern: ?x <hasGivenName> ?y.
?x <livesIn> ?z., from the Yago dataset.

Implement this problem by:


(i) by considering partitioning and bucketing;
(ii) by considering partitioning but not bucketing;
(iii) by considering neither partitioning nor bucketing.

• For the first case alone perform a Bucketized Merge-Join by


enabling the necessary parameters (see the note below).
• Compare the run time of the three cases by performing your
experiments on your local system.

43
Assignment III
Part-B: Hive and HiveQL
For case (i):
• Load the entire triples from the given yago.tsv files into a table
named yago having three columns: subject, predicate and
object.
• Create a new table, named yago part buck, with a partition
based on the predicate column and clustered based on the
subject column.
• Load data (statically) into the partitions for all the 29
predicates (listed in the next slide) in the dataset. This loading
could be done by inserting data into the partitioned table from
the yago table specifying the partition key – you may write all
the insert statements into a single HiveQL script for loading
data.
• Write a HiveQL query to find the required pattern from the
yago-part-buck table.
44
Assignment III
Part-B: Hive and HiveQL
Note: You can set the following hive parameters to true (as given
below), to enable the bucketized merge-join.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

29 predicates: <actedIn>, <hasAcademicAdvisor>, <hasChild>,


<hasFamilyName>, <hasWebsite>, <hasWonPrize>, <isInterestedIn>,
<isKnownFor>, <directed>, <edited>, <graduatedFrom>, <hasGender>,
<hasMusicalRole>, <isCitizenOf>, <isMarriedTo>, <isPoliticianOf>, <playsFor>,
<worksAt>, <wroteMusicFor>, <created>, <diedIn>, <hasGivenName>,
<influences>, <isAffiliatedTo>, <isLeaderOf>, <livesIn>, <owns>,
<participatedIn>, <wasBornIn>

45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy