Unit 5 PIG&HIVE
Unit 5 PIG&HIVE
• Relation
– A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
• Map
– A map (or data map) is a set of key-value pairs. The key needs to be of type chararray
and should be unique. The value might be of any type. It is represented by ‘[]’ – Example:
[name#Raja, age#30]
Pig Operators
Pig Data Model
Pig Data Model
Applications on Big Data Using Hive
Introduction to Hive
What is hive?
• Hive is a data warehouse infrastructure tool to process structured
data in Hadoop.
• It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an
open source under the name Apache Hive.
• It is used by different companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
Hive is not-
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level
updates
Features of Hive:
• It stores schema in a database and processed
data into HDFS.
• It is designed for OLAP(Online Analytical Processing).
• It provides SQL type language for querying
called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Hive Architecture:
Hive Architecture:
• User Interface
❖ Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive supports
are Hive Web UI, Hive command line, and Hive HD Insight (In Windows
server).
• Meta Store
❖ Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types, and
HDFS mapping.
● HiveQL Process Engine
❖ HiveQL is similar to SQL for querying on schema info on the Metastore.
❖ It is one of the replacements of traditional approach for MapReduce
program.
❖ Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.
• Execution Engine
❖ The conjunction part of HiveQL process Engine and MapReduce
is Hive Execution Engine.
❖ Execution engine processes the query and generates results as
same as MapReduce results.
❖ It uses the flavor of MapReduce.
• HDFS or HBASE
❖ The Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Working of Hive:
Execution of Hive:
• Execute Query
- The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
• Get Plan
- The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
• Get Metadata
- The compiler sends metadata request to Metastore (any database).
• Send Metadata
- Metastore sends metadata as a response to the compiler.
• Send Plan
- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is
complete.
• Execute Plan
- The driver sends the execute plan to the execution engine.
• Execute Job
- Internally, the process of execution job is a MapReduce job.
- The execution engine sends the job to JobTracker, which is in
Name node and
- it assigns this job to TaskTracker, which is in Data node.
- Here, the query executes MapReduce job.
• Metadata Ops
- Meanwhile in execution, the execution engine can
execute metadata operations with Metastore.
• Fetch Result
- The execution engine receives the results from
Data nodes.
• Send Results
- The execution engine sends those resultant values
to the driver.
• Send Results
- The driver sends the results to Hive Interfaces.
Applications on Big Data using Hive
🠶 When to use Hive
• Most suitable for data warehouse applications where relatively static data is
analyzed.
• Fast response time is not required.
• Data is not changing rapidly.
• An abstraction to underlying MR program.
• Hive of course is a good choice for queries that lend themselves to being
expressed in SQL, particularly long-running queries, where fault tolerance is
desirable.
• Hive can be a good choice if you’d like to write feature-rich, fault-tolerant,
batch (i.e., not near-real-time) transformation or ETL jobs in a pluggable
SQL engine.
Applications Supported by Hive are:-
⮚ Log Processing
⮚ Text Mining
⮚ Document Indexing
⮚ Google Analytics
⮚ Predictive Modeling
⮚ Hypothesis Testing
Hive Services
The Hive shell is only one of several services that you can run using the hive
command.
You can specify the service to run using the --service option.
hiveserver
Runs Hive as a server exposing a Thrift service, enabling access from a range of clients
written in different languages.
Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to
communicate with Hive.
Set the HIVE_PORT environment variable to specify the port, so the server will listen on
(defaults to 10,000).
hwi
The Hive Web Interface is an alternative to using the Hive command line
interface.
Using the web interface is a great way to get started with Hive.
As an alternative to the shell, you might want to try Hive’s simple web interface. Start
it using the following commands:
% export ANT_LIB=/path/to/ant/lib
% hive --service hwi
jar
The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes
on the classpath.
metastore
• By default, the metastore is run in the same process as the Hive service.
• Using this service, it is possible to run the metastore as a standalone (remote) process.
• Set the METASTORE_PORT environment variable to specify the port the server will
listen on.
FileSystem :
The /user/hive and /user/hive/warehouse directories need to be created if they do not already exist.
Make sure this location (or any path you specify as hive.metastore.warehouse.dir in your hive-site.xml) exists and
is writable by the users whom you expect to be creating tables.
Attention:
Jobclient :
org.apache.hadoop.mapred
Class JobClient
● java.lang.Object
● org.apache.hadoop.mapred.JobClient
@InterfaceStability.Stable
extends CLI
implements AutoCloseable
JobClient is the primary interface for the user-job to interact with the cluster.
JobClient provides facilities to submit jobs, track their progress, access component-tasks'
reports/logs, get the Map-Reduce cluster status information etc.
The job submission process involves:
Normally the user creates the application, describes various facets of the
job via JobConf and then uses the JobClient to submit the job and
monitor its progress.
Here is an example on how to use JobClient:
// Create a new JobConf
JobConf job = new JobConf(new Configuration(), MyJob.class);
// Specify various job-specific parameters
job.setJobName("myjob");
job.setInputPath(new Path("in"));
job.setOutputPath(new Path("out"));
job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
Job Control
At times clients would chain map-reduce jobs to accomplish complex tasks which cannot be done via a single map-reduce job.
This is fairly easy since the output of the job, typically, goes to distributed file-system and that can be used as the input for the next job.
However, this also means that the onus on ensuring jobs are complete (success/failure) lies squarely on the clients.
In such situations the various job-control options are:
1. runJob(JobConf) : submits the job and returns only after the job has completed.
2. submitJob(JobConf) : only submits the job, then poll the returned handle to the RunningJob to query status and make scheduling
decisions.
3. JobConf.setJobEndNotificationURI(String) : setup a notification on job-completion, thus avoiding polling.
Hive clients
If you run Hive as a server (hive --service hiveserver), then there are a number of different mechanisms for connecting to it from
applications. The relationship between Hive clients and Hive services is illustrated in Figure 12-1.
Thrift Client
The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages.
Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby.
They can be found in the src/service/src subdirectory in the Hive distribution.
JDBC Driver
Hive provides a Type 4 (pure Java) JDBC driver, defined in the class org.apache.hadoop.hive.jdbc.HiveDriver.
When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to
a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface
implemented by the Hive Thrift Client using the Java Thrift bindings.)
You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://.
In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a
standalone server since it does not use the Thrift service or the Hive Thrift Client.
ODBC Driver
The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC
driver uses Thrift to communicate with the Hive server.)
The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.
There are more details on using these clients on the Hive wiki at https://cwiki.apache .org/confluence/display/Hive/HiveClient.
The Metastore
By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk.
This is called the embedded metastore configuration.
however, only one embedded Derby database can access the database files on disk at any one time,
which means you can only have one Hive session open at a time that shares the same metastore.
The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database.
This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service,
but connects to a database running in a separate process, either on the same machine or on a remote machine.
Any JDBC-compliant database may be used by setting the javax.jdo.option.* configuration properties listed in Table 12-1.
MySQL is a popular choice for the standalone metastore.
In this case, javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIf
NotExist=true, and javax.jdo.option.ConnectionDriverName is set to
com.mysql.jdbc.Driver. (The user name and password should be set, too, of course.)
The JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s classpath, which is simply
achieved by placing it in Hive’s lib directory.
Going a step further, there’s another metastore configuration called a remote metastore, where one
or more metastore servers run in separate processes to the Hive service.
This brings better manageability and security, since the database tier can be completely firewalled off,
and the clients no longer need the database credentials.
A Hive service is configured to use a remote metastore by setting hive.meta store.local to false,
and hive.metastore.uris to the metastore server URIs, separated by commas if there is more than one.
Metastore server URIs are of the form thrift:// host:port, where the port corresponds to the one set by
METASTORE_PORT when starting the metastore server.
HIVE (HQL) Commands
HIVE (HQL) Commands
Q1: How to enter the HIVE Shell?
Go to the Terminal and type hive, you will see the hive on the prompt.
use emp_details;
create table emp(empno int, ename string, job string, sal int, deptno int)
row format delimited fields terminated by ',';
Suppose you created a comma separated file in local system named empdetails.txt
1,A,clerk,4000,10
2,A,clerk,4000,30
3,B,mgr,8000,20
4,C,peon,2000,40
5,D,clerk,4000,10
6,E,mgr,8000,50
3
Q7: How to see all the tables present in database
show tables;
Count aggregate function is used count the total number of the records in a table.
select count(1) from emp;
OR
Select count(*) from emp;
Q11: Group the sum of salaries as per the deptno select deptno, sum(sal) from emp
group by deptno;
Q13: Select the name of employees where job has exactly 5 characters
hive> select ename from emp where job LIKE '_____';
Q14: List the employee names where job has l as the second character
Q15: Retrieve the total salary for each department select deptno, sum(sal) from emp
group by deptno;