0% found this document useful (0 votes)

17 views

6.1NoSQL ApacheHIVE Witha3

apache hive presentation

Uploaded by

shubham agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

6.1NoSQL ApacheHIVE Witha3

apache hive presentation

Uploaded by

shubham agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

NoSQL Systems

Apache Hive & HiveQL

“Data Warehouse and Query Language for Hadoop”

Vinu Venugopal
ScaDS Lab, IIIT Bangalore
NoSQL Systems
Hive and HiveQL
• Hive is an ETL and data warehouse tool on top of Hadoop ecosystem
• developed at Facebook in 2007-2008
• used for querying and analyzing large datasets stored in Hadoop file
• provides SQL-like declarative language, called HiveQL
• Hive engine compiles these queries into Map-Reduce jobs to be executed on
Hadoop

• constantly being developed and available from the Apache software

foundation.
(see: http://hive.apache.org 29 March 2024: release 4.0.0 available)

2
Hive Reference

“Apache Hive is a distributed,

fault-tolerant data warehouse
system that enables analytics at
a massive scale”

3
Hive Access Modes
• Hive shell
• Interactive mode
% beeline -u jdbc:hive2://
hive> SHOW TABLES;
OK
Time taken: 18.457 seconds

• Non-Interactive mode (running HiveQL script)

% beeline -u jdbc:hive2://
hive> !run myscript.hiveql

• Other access modes:

• Apache Thrift clients
• JDBC/ODBC clients
4
Hive Architecture

• The Hive Server is a single process running in a Java Virtual Machine

(JVM).

• It communicates with the Hadoop FileSystem (HDFS), the Hadoop

JobClient, and its own MetaStore (an actual DBMS).
5
Hive Architecture
Hive Metastore Server (HMS)
• A central repository of metadata for Hive tables and partitions in
a relational database
• provides clients (including Hive, Impala and Spark) access to this
information using the metastore service API.

6
Hive Data Model
The basic data model of Hive, again, are nested relations (called “tables”).
Elements of a relation may be of a simple type or of a complex type.
• 10 simple (atomic) data types:
TINYINT (1-byte signed integer), SMALLINT (2-byte signed integer), INT
(4-byte signed integer), BIGINT (8-byte signed integer), FLOAT (4-byte
single-precision float), DOUBLE (8-byte double-precision float), BOOLEAN (true/false,
1-byte), STRING (char array), BINARY (byte array), TIMESTAMP (8-bytes)

• 3 complex types:
ARRAY, STRUCT, MAP which are created via the built-in functions array(),
struct(), map().

7
Operators and Built-In Functions
HiveQL supports the usual set of SQL operators:
SELECT… FROM… WHERE… GROUP BY… HAVING… ORDER BY… LIMIT;

Built-in functions:
• Mathematical Functions e.g., round(DOUBLE a)
• Collection Functions e.g., size(Map<K.V>)
• Type Conversion Functions e.g., binary(string|binary)
• Date Functions e.g., to_date(string timestamp)
• Conditional Functions e.g., isnull( a )
• String Functions e.g., reverse(string A)
• Data Masking Functions fields e.g., mask_first_n(string str, int n)

hive> SHOW FUNCTIONS;

hive> DESCRIBE FUNCTION length;
8
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);
• Once after loading data..
hive> SELECT col0, col1[0], col2['b'], col3.c FROM
complex_table;

9
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

complex_table;

10
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

complex_table;

11
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

complex_table;

The above CREATE TABLE statement is actually short for:

CREATE TABLE …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE | [BINARY] RCFILE | [BINARY] SEQUENCEFILE;

12
Example: Table with Complex Types
hive> CREATE TABLE complex_table (
col0 INT,
col1 ARRAY<INT>,
col2 MAP<STRING, INT>,
col3 STRUCT<a:STRING, b:INT, c:DOUBLE>);

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

complex_table;

Nested (i.e., complex) types are serialized/de-serialized by Hive using the

built-in class:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

13
Tables
Managed Table External Table

• Managed Tables
• Internal tables - HIVE controls the lifecycle of its data
• Data is physically moved from HDFS
• While loading data into a table in hive, it would be kept under a default directory:
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse) which can again be a
distributed file system location..
• We can override this location by using LOCATION keyword.
• As HIVE has complete control over the data, when dropping a managed table, the
entire data will get deleted.
• But it is not so convenient when we want to share data among multiple tools like
pig or some other tool while using managed table.
• If we don’t want to give the ownership of the data to hive alone…

14
Tables
Managed Table External Table

• Managed Tables
• Internal tables - HIVE controls the lifecycle of its data
• Data is physically moved from HDFS
• Stored at Hive’s own internal location (another location in HDFS)
CREATE TABLE managed_tbl (field0 STRING)
LOCATION '/mytables/managed_tbl_table’;

LOAD DATA LOCAL INPATH '/myfiles/data.txt' INTO

TABLE managed_ tbl;

• DROP TABLE managed_tbl;

• physically deletes all data in the Hive table!
15
Tables
• External Tables
• Data is NOT physically moved from HDFS
• Merely keeps a reference to the file holding the data, together with
the schema

CREATE EXTERNAL TABLE external_tbl (dummy STRING)

LOCATION ‘/hdfs/data’;

16
Schema Management
Issuing a CREATE TABLE statement in Hive (both for a managed and for an
external table) performs two basic steps:
1. The physical files (and subdirectory structure) for storing the
table are created.
2. The new table's schema (i.e., its attributes, attribute types,
and file mappings) is stored in Hive's so-called Metastore
(which is simply another DBMS service, usually MySQL).

17
Schema Management
Issuing a CREATE TABLE statement in Hive (both for a managed and for an
external table) performs two basic steps:
1. The physical files (and subdirectory structure) for storing the
table are created.
2. The new table's schema (i.e., its attributes, attribute types,
and file mappings) is stored in Hive's so-called Metastore
(which is simply another DBMS service, usually MySQL).

Schema-On-Write vs. Schema-On-Read

• Schema-On-Write: data is verified when initially written into the
database (traditional DBMS's, Hive's managed tables)

• Schema-On-Read: data is verified when query is issued (Hive's

external tables, Pig's relations, etc.)

18
Partitions & Buckets
• Partitions
• even a simple query in Hive reads the entire dataset
• horizontal slices of data
tab1/clientdata/2009/file2
tab1/clientdata/file1 1, sunny, SC, 2009
id, name, dept, yoj Load these files
2, sam, HR, 2009
1, sunny, SC, 2009 as different partitions
2, sam, HR, 2009 tab1/clientdata/2010/file3
3, bob, SC, 2010 3, bob, SC, 2010
4, claire, TP, 2010 4, claire, TP, 2010

CREATE TABLE studentTab (id INT, name STRING, dept STRING, yoj
INT) PARTITIONED BY (year STRING);

LOAD DATA LOCAL INPATH ‘tab1/clientdata/2009/file2’ INTO TABLE

studentTab PARTITION (year='2009’);

LOAD DATA LOCAL INPATH ‘tab1 /clientdata/2010/file3’ INTO

TABLE studentTab PARTITION (year='2010’);
19
Partitions & Buckets
• Types of Partitioning:
• Static
• Specify partition key(s) and manually or “statically” move the
data into each partition of the table
• Can perform Static partition on Hive Manage table or external
table
• Dynamic
• Automatically load data into the partitions from a non-
partitioned table or file
• Takes more time in loading data compared to static partition
• Can perform dynamic partition on Hive external table and
managed table
• Set property set hive.exec.dynamic.partition.mode=nonstrict

20
Partitions & Buckets
• Buckets
• partitions can be further subdivided into more manageable parts known
as Buckets or Clusters.
• based on Hash function, which depends on the type of the bucketing
column.
• CLUSTERED BY clause is used to divide the table into buckets.

CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

• partitions are stored in separate HDFS subdirectories, whereas buckets

are kept in HDFS separate files (within each subdirectory).

21
Map-Side Join in Hive
One of the tables to be joined should be small enough to fit into main
memory
SELECT /*+ MAPJOIN(s)*/ b.*, s.*
FROM big_tbl b JOIN small_tbl s ON (b.id = s.id);

/+ MAPJOIN(things)/ is a hint (similar to an optimizer hint, e.g., in

Oracle) for loading the data of small_tbl into main memory, and to
perform a Map-side merge join instead of the default Reduce-side join.

22
Map-Side Join in Hive
One of the tables to be joined should be small enough to fit into main
memory
SELECT /*+ MAPJOIN(s)*/ b.*, s.*
FROM big_tbl b JOIN small_tbl s ON (b.id = s.id);

/+ MAPJOIN(things)/ is a hint (similar to an optimizer hint, e.g., in

Oracle) for loading the data of small_tbl into main memory, and to
perform a Map-side merge join instead of the default Reduce-side join.

This can be further optimized by bucketizing both the input tables:

CREATE TABLE big_tbl (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

CREATE TABLE small_tbl (id INT, name STRING)

CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

SET hive.optimize.bucketmapjoin=true;

23
Note on Updates, Transactions and Indexes

…only have limited support in Hive

• Updates are supported via INSERT statements (thus via appending
tuples to a table) only.
• Transactions are implemented via table- and partition-level locking.
• A simple form of an index is available via table partitions and
buckets; more support for indexes is "in progress".

24
Updating and Appending Data
• INSERT OVERWRITE replaces the contents of the target table (or partition,
resp.) with those of the source table (or subquery, see later slides).

• If OVERWRITE is omitted, records are appended.

INSERT OVERWRITE TABLE target

SELECT col1, col2 FROM source;

• For partitioned tables, the target partition can be specified manually.

INSERT OVERWRITE TABLE target

PARTITION (date='2013-04-25')
SELECT col1, col2 FROM source;

• Just like in SQL, you can also create a new table directly from a subquery.

CREATE TABLE target AS SELECT col1, col2 FROM source;

25
Multi-Table Insert
Inserting the contents of one source table into multiple target
tables!

FROM records
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station) GROUP BY year

INSERT OVERWRITE TABLE records_by_year

SELECT year, COUNT(1) GROUP BY year

INSERT OVERWRITE TABLE good_records_by_year

SELECT year, COUNT(1) WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4)
GROUP BY year;

(à Compare this feature to a multi-query execution in Pig!)

26
Sorting
• Using standard “ORDER BY”
• Results would be totally sorted
• Sets the no. of reducers to one – inefficient for large datasets

• Using nonstandard “SORT BY”

• Results would not be globally sorted
• Does not set the no. of reducers to one
• SORT BY produces a sorted file per reducer.
• Shard data using “DISTRIBUTE BY”

Input schema: records(year int, station int, temperature float, quality int)
hive> FROM records
SELECT year, temperature
DISTRIBUTE BY year
SORT BY year ASC, temperature DESC;

27
Joins
hive> SELECT * FROM sales; hive> SELECT sales.*, things.*
Joe 2 FROM sales JOIN things ON
Hank 4 (sales.id = things.id);
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie Note: the SQL-style abbreviation for an inner
4 Coat join SELECT * FROM sales, things
3 Hat WHERE sales.id = things.id;
1 Scarf is not allowed!

• By default translated into a Reduce-side join (unless specified otherwise) in

MapReduce!
• The EXPLAIN command shows details about the join execution in MR.
• Hive supports only equi-joins.
EXPLAIN SELECT sales.*, things.* FROM sales JOIN things ON
(sales.id = things.id);
28
Outer and Semi-Joins
hive> SELECT sales.*, things.*
FROM sales LEFT OUTER JOIN
hive> SELECT * FROM sales; things ON (sales.id =
Joe 2 things.id);
Hank 4 Ali 0 NULL NULL
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie hive> SELECT * FROM things LEFT
4 Coat SEMI JOIN sales ON (sales.id
3 Hat = things.id);
1 Scarf 2 Tie
3 Hat
4 Coat
• LEFT/RIGHT/FULL outer and semi-joins again follow the usual SQL semantics and are
automatically translated into a Reduce-side join.

• Hive supports only outer and semi-joins with an equality condition.

29
Outer and Semi-Joins
hive> SELECT sales.*, things.*
FROM sales LEFT OUTER JOIN
hive> SELECT * FROM sales; things ON (sales.id =
Joe 2 things.id);
Hank 4 Ali 0 NULL NULL
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie hive> SELECT * FROM things LEFT
4 Coat SEMI JOIN sales ON (sales.id
3 Hat = things.id);
1 Scarf 2 Tie
3 Hat
4 Coat
• LEFT/RIGHT/FULL outer and semi-joins again follow the usual SQL semantics and are
automatically translated into a Reduce-side join.

• Hive supports only outer and semi-joins with an equality condition.

30
Outer and Semi-Joins
hive> SELECT sales.*, things.*
FROM sales LEFT OUTER JOIN
hive> SELECT * FROM sales; things ON (sales.id =
Joe 2 things.id);
Hank 4 Ali 0 NULL NULL
Ali 0 Joe 2 2 Tie
Eve 3 Hank 2 2 Tie
Hank 2 Eve 3 3 Hat
Hank 4 4 Coat
hive> SELECT * FROM things;
2 Tie hive> SELECT * FROM things LEFT
4 Coat SEMI JOIN sales ON (sales.id
3 Hat = things.id);
1 Scarf 2 Tie
3 Hat
4 Coat
• LEFT/RIGHT/FULL outer and semi-joins again follow the usual SQL semantics and are
automatically translated into a Reduce-side join.

• Hive supports only outer and semi-joins with an equality condition.

31
Subqueries in HiveQL
• Hive has limited support for subqueries
• only permitting a subquery in the FROM clause of a SELECT statement.

SELECT station, year, AVG(max_temperature)

FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records
WHERE temperature != 9999 AND (quality = 0 OR quality = 1)
GROUP BY station, year
DISTRIBUTE BY station, year
) max_temp
GROUP BY station, year
DISTRIBUTE BY station;

The above query is translated into two MapReduce jobs, one for each
GROUP-BY clause.

32
Views in HiveQL
As in SQL, views are "virtual tables" which are not materialized. (Use CREATE
TABLE if you want materialization instead.)

Input schema: records(year int, station int, temperature float, quality int)

CREATE VIEW valid_records AS

SELECT * FROM records
WHERE temperature != 9999 AND (quality = 0 OR quality = 1);
CREATE VIEW max_temperatures (station, year, max_temperature) AS
SELECT station, year, MAX(temperature)
FROM valid_records GROUP BY station, year
DISTRIBUTE BY station, year;
SELECT station, year, AVG(max_temperature)
FROM max_temperatures GROUP BY station, year
DISTRIBUTE BY station;

The above series of view results in the same execution plan, using one MR job
for each GROUP-BY clause!
33
User-Defined Functions in Hive
UDF's in Hive are potentially more powerful than in Pig!
UDF's are implemented as Java classes
2 types of UDFs:
Basic UDFs: obtain a single row as input & produce a single row
Aggregate UDFs: obtain multiple rows as input iteratively & produce a
single row

34
Example: Basic UDF
import org.apache.hadoop.hive.ql.exec.UDF;
…
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}

Strip(’ 00a11b00 ’) => ’00a11b00’

Strip(’00a11b00’, ‘01’) => ’a11b’ – remove chars in the 2nd argument from
the prefix/suffix of 1st argument

35
Example: Basic UDF
import org.apache.hadoop.hive.ql.exec.UDF;
…
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}

ADD JAR ./stripexample.jar;

CREATE TEMPORARY FUNCTION strip AS 'Strip’;
SELECT strip(’00a11b00’, ‘01’) ;

36
Example: Aggregate UDF (UDAF)
• Suppose we would like to
(again) calculate the Mean
of a set of input numbers in
a distributed fashion.
• Further assume that the
input number are split
across different files.
• user-defined aggregate
functions (UDAF's) in Hive
may run in an arbitrary,
distributed fashion inside a
MapReduce job.

37
Example: Aggregate UDF (UDAF)

public class Mean extends UDAF {

public static class PartialResult {

double sum;
long count;
}

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

The init() method is called by the hive to intilize an

instance of the UDAF evaluator class for each file.

iterate() and terminatePartial() methods are used at

the MAP side

terminate() and merge() are used at the REDUCE

side 38
Example: Aggregate UDF (UDAF)

public class Mean extends UDAF {

public static class PartialResult {

double sum;
long count;
}

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

PartialResult class defines the structure of the

partial results.

In the e.g., as the intention is to find the mean

of all the values, map() should return the total
sum and the count of numbers.
39
Example: Aggregate UDF (UDAF)

public class Mean extends UDAF {

public static class PartialResult {

double sum;
long count;
}

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

Iterate method will be executed over each

numbers in the file, the partial results would
be aggregated, and the terminatePartial()
would be called once the inputsplit is finished.
Partial results will be returned to merge().
Once obtaining all partial results terminate()
will be called to get final results. 40
Example: Aggregate UDF (UDAF)
See Mean.java
on Moodle!

Now calculate the Mean of a set of temperature values stored in the

records table in a distributed way:

hive> SELECT mean(temperature) FROM records;

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time:1
…

23.4

41
Assignment III
Part-B: Hive and HiveQL
Problem-1: Processing YAGO dataset using HIVE
1. Load the YAGO dataset and find out the top three frequently
occurring predicates in the YAGO dataset using operators
available in HiveQL. (3 points)

2. Identify all the given-names (i.e., object values of the

hasGivenName predicate) of persons who are associated with
more than on IivesIn predicates from the YAGO dataset using the
relational operators (join, grouping, etc.) in HiveQL. (4 points)

PF Problem-2 in the next slide..

42
Assignment III
Part-B: Hive and HiveQL
Problem-2: Write a HiveQL query to find all the subjects (x) and
objects (y and z) matching the pattern: ?x <hasGivenName> ?y.
?x <livesIn> ?z., from the Yago dataset.

Implement this problem by:

(i) by considering partitioning and bucketing;
(ii) by considering partitioning but not bucketing;
(iii) by considering neither partitioning nor bucketing.

• For the first case alone perform a Bucketized Merge-Join by

enabling the necessary parameters (see the note below).
• Compare the run time of the three cases by performing your
experiments on your local system.

43
Assignment III
Part-B: Hive and HiveQL
For case (i):
• Load the entire triples from the given yago.tsv files into a table
named yago having three columns: subject, predicate and
object.
• Create a new table, named yago part buck, with a partition
based on the predicate column and clustered based on the
subject column.
• Load data (statically) into the partitions for all the 29
predicates (listed in the next slide) in the dataset. This loading
could be done by inserting data into the partitioned table from
the yago table specifying the partition key – you may write all
the insert statements into a single HiveQL script for loading
data.
• Write a HiveQL query to find the required pattern from the
yago-part-buck table.
44
Assignment III
Part-B: Hive and HiveQL
Note: You can set the following hive parameters to true (as given
below), to enable the bucketized merge-join.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

29 predicates: <actedIn>, <hasAcademicAdvisor>, <hasChild>,

Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Learn SQLite in 24 Hours
From Everand
Learn SQLite in 24 Hours
Alex Nordeen
No ratings yet
CI Tools Brochure
No ratings yet
CI Tools Brochure
2 pages
Cambridge IGCSE Combined and Co-Ordinated Sciences Coursebook
No ratings yet
Cambridge IGCSE Combined and Co-Ordinated Sciences Coursebook
8 pages
Alibaba: ACA-CLOUD1 Exam
No ratings yet
Alibaba: ACA-CLOUD1 Exam
4 pages
Hive_Main
No ratings yet
Hive_Main
33 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hive
No ratings yet
Hive
29 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Hive Final (1)
No ratings yet
Hive Final (1)
75 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Introduction to Hive
No ratings yet
Introduction to Hive
14 pages
Hive L1
No ratings yet
Hive L1
134 pages
Unit-4 Pig Hive
No ratings yet
Unit-4 Pig Hive
40 pages
5- HIVE
No ratings yet
5- HIVE
51 pages
hive
No ratings yet
hive
49 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
No ratings yet
Bigdata@master: 4.set The Environmental Variable HIVE - HOME in Bashrc File
91 pages
Hive Pig PDF
No ratings yet
Hive Pig PDF
20 pages
HIVE
No ratings yet
HIVE
80 pages
Hive
No ratings yet
Hive
45 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
BDA-UNIT-IV -2020-21
100% (1)
BDA-UNIT-IV -2020-21
30 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Hiveppt
No ratings yet
Hiveppt
29 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Unit 2.2 Hive
No ratings yet
Unit 2.2 Hive
80 pages
module 3-1
No ratings yet
module 3-1
32 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
Unit V BD LM Cse
No ratings yet
Unit V BD LM Cse
34 pages
BDA Unit-5-PPT
No ratings yet
BDA Unit-5-PPT
39 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Hive
No ratings yet
Hive
23 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hive
No ratings yet
Hive
50 pages
HIVE architecture
No ratings yet
HIVE architecture
5 pages
Hive
No ratings yet
Hive
65 pages
Hive
No ratings yet
Hive
42 pages
HIVE Data Types
No ratings yet
HIVE Data Types
6 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Hive
No ratings yet
Hive
30 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
HIVE
No ratings yet
HIVE
24 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Hive Basics MCA
No ratings yet
Hive Basics MCA
8 pages
Unit Iv Part - 1
No ratings yet
Unit Iv Part - 1
60 pages
Apache Hive Interview Questions: 1. Define The Difference Between Hive and Hbase?
No ratings yet
Apache Hive Interview Questions: 1. Define The Difference Between Hive and Hbase?
10 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
SQL Query Basics
From Everand
SQL Query Basics
Isabella Ramirez
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Edt 103 Questions and Answers Compiled by Omoh
No ratings yet
Edt 103 Questions and Answers Compiled by Omoh
6 pages
02 - Neo Software Installation: Product Neo Suite Version 6.x
100% (1)
02 - Neo Software Installation: Product Neo Suite Version 6.x
24 pages
Installation and Setup Manual: Ispvm System Unix
No ratings yet
Installation and Setup Manual: Ispvm System Unix
7 pages
Lab 5
No ratings yet
Lab 5
15 pages
50 Examples of How Brands Are Using AI Plus AI Survey - Sweathead
No ratings yet
50 Examples of How Brands Are Using AI Plus AI Survey - Sweathead
94 pages
Web - Time Table Practical
No ratings yet
Web - Time Table Practical
5 pages
AUTOSAR SWS CANInterface
No ratings yet
AUTOSAR SWS CANInterface
212 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
ClimsoftV4 User Manual Feb 2022
No ratings yet
ClimsoftV4 User Manual Feb 2022
109 pages
Ranganadh Lakkireddy: Software Engineer - Techtuners
No ratings yet
Ranganadh Lakkireddy: Software Engineer - Techtuners
2 pages
Mantis
No ratings yet
Mantis
12 pages
Seminar Final Report
No ratings yet
Seminar Final Report
26 pages
HEC RAS Modelling
No ratings yet
HEC RAS Modelling
62 pages
Status Holder Certificate V1.2
No ratings yet
Status Holder Certificate V1.2
1 page
Threat Hunting Workshop Process Guide
No ratings yet
Threat Hunting Workshop Process Guide
29 pages
Material Std 11 All Ch Mcq
No ratings yet
Material Std 11 All Ch Mcq
80 pages
Mymuse BP Presentation
No ratings yet
Mymuse BP Presentation
14 pages
Ch 4 _Project Management Summary_1
No ratings yet
Ch 4 _Project Management Summary_1
6 pages
S4E-14 - Putaway To Fixed Bin
No ratings yet
S4E-14 - Putaway To Fixed Bin
5 pages
Cooja Simulator Manual: Edinburgh Napier University (C) 2015-2016
No ratings yet
Cooja Simulator Manual: Edinburgh Napier University (C) 2015-2016
25 pages
PSEB - Requirement Document
No ratings yet
PSEB - Requirement Document
22 pages
Raspberry Pi Pico Tips and Tricks 2023
100% (2)
Raspberry Pi Pico Tips and Tricks 2023
143 pages
GS. System Requirements
No ratings yet
GS. System Requirements
2 pages
Logcat
No ratings yet
Logcat
37 pages
Software Project Management For Dummies pdf download
100% (2)
Software Project Management For Dummies pdf download
29 pages
Introduction To SAP HANA
No ratings yet
Introduction To SAP HANA
12 pages
ABX Air Boeing 767: Illustrations by Kind Permission of Graham
No ratings yet
ABX Air Boeing 767: Illustrations by Kind Permission of Graham
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

6.1NoSQL ApacheHIVE Witha3

Uploaded by

6.1NoSQL ApacheHIVE Witha3

Uploaded by

NoSQL Systems

Apache Hive & HiveQL

• constantly being developed and available from the Apache software

“Apache Hive is a distributed,

• Non-Interactive mode (running HiveQL script)

• Other access modes:

• The Hive Server is a single process running in a Java Virtual Machine

• It communicates with the Hadoop FileSystem (HDFS), the Hadoop

hive> SHOW FUNCTIONS;

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

The above CREATE TABLE statement is actually short for:

hive> SELECT col0, col1[0], col2['b'], col3.c FROM

Nested (i.e., complex) types are serialized/de-serialized by Hive using the

LOAD DATA LOCAL INPATH '/myfiles/data.txt' INTO

• DROP TABLE managed_tbl;

CREATE EXTERNAL TABLE external_tbl (dummy STRING)

Schema-On-Write vs. Schema-On-Read

• Schema-On-Read: data is verified when query is issued (Hive's

LOAD DATA LOCAL INPATH ‘tab1/clientdata/2009/file2’ INTO TABLE

LOAD DATA LOCAL INPATH ‘tab1 /clientdata/2010/file3’ INTO

CREATE TABLE bucketed_users (id INT, name STRING)

• partitions are stored in separate HDFS subdirectories, whereas buckets

/*+ MAPJOIN(things)*/ is a hint (similar to an optimizer hint, e.g., in

/*+ MAPJOIN(things)*/ is a hint (similar to an optimizer hint, e.g., in

This can be further optimized by bucketizing both the input tables:

CREATE TABLE small_tbl (id INT, name STRING)

…only have limited support in Hive

• If OVERWRITE is omitted, records are appended.

INSERT OVERWRITE TABLE target

• For partitioned tables, the target partition can be specified manually.

INSERT OVERWRITE TABLE target

CREATE TABLE target AS SELECT col1, col2 FROM source;

INSERT OVERWRITE TABLE records_by_year

INSERT OVERWRITE TABLE good_records_by_year

(à Compare this feature to a multi-query execution in Pig!)

• Using nonstandard “SORT BY”

• By default translated into a Reduce-side join (unless specified otherwise) in

• Hive supports only outer and semi-joins with an equality condition.

• Hive supports only outer and semi-joins with an equality condition.

• Hive supports only outer and semi-joins with an equality condition.

SELECT station, year, AVG(max_temperature)

CREATE VIEW valid_records AS

Strip(’ 00a11b00 ’) => ’00a11b00’

ADD JAR ./stripexample.jar;

public class Mean extends UDAF {

public static class PartialResult {

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

The init() method is called by the hive to intilize an

iterate() and terminatePartial() methods are used at

terminate() and merge() are used at the REDUCE

public class Mean extends UDAF {

public static class PartialResult {

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

PartialResult class defines the structure of the

In the e.g., as the intention is to find the mean

public class Mean extends UDAF {

public static class PartialResult {

public void init() {…}

public boolean iterate() {…}

public PartialResult terminatePartial() {…}

public boolean merge(PartialResult other) {…}

public DoubleWritable terminate() {…}

Iterate method will be executed over each

Now calculate the Mean of a set of temperature values stored in the

hive> SELECT mean(temperature) FROM records;

2. Identify all the given-names (i.e., object values of the

/+ MAPJOIN(things)/ is a hint (similar to an optimizer hint, e.g., in

/+ MAPJOIN(things)/ is a hint (similar to an optimizer hint, e.g., in