0% found this document useful (0 votes)
71 views115 pages

Unit 5 PIG&HIVE

Apache Pig and Hive are tools for analyzing large datasets on Hadoop. Pig provides a high-level language called Pig Latin for writing data analysis programs. Pig Latin scripts are compiled into MapReduce jobs. Hive provides a SQL-like interface called HiveQL and is used for data warehousing. It stores schema information in a metastore and processes queries into MapReduce jobs which are executed on Hadoop. Both Pig and Hive simplify Big Data analysis by allowing users to write scripts without coding MapReduce jobs directly in Java.

Uploaded by

Kishore Parimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views115 pages

Unit 5 PIG&HIVE

Apache Pig and Hive are tools for analyzing large datasets on Hadoop. Pig provides a high-level language called Pig Latin for writing data analysis programs. Pig Latin scripts are compiled into MapReduce jobs. Hive provides a SQL-like interface called HiveQL and is used for data warehousing. It stores schema information in a metastore and processes queries into MapReduce jobs which are executed on Hadoop. Both Pig and Hive simplify Big Data analysis by allowing users to write scripts without coding MapReduce jobs directly in Java.

Uploaded by

Kishore Parimi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

UNIT - 5

Applications on Big Data


Using
Pig & Hive
Hadoop Tools:
● The Hadoop ecosystem contains different subprojects (tools)
such as Sqoop, Pig, and Hive that are used to help Hadoop
modules.

➔ Sqoop: It is used to import and export data to and from


between HDFS and RDBMS.
➔ Pig: It is a procedural language platform used to develop a
script for MapReduce operations.
➔ Hive: It is a platform used to develop SQL type scripts to do
MapReduce operations.
What is Pig?

Apache Pig is an abstraction over MapReduce. It is a


tool/platform which is used to analyze large sets of data
representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the
data manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high level
language known as Pig Latin.
• This language provides various operators using which
programmers can develop their own functions for
reading, writing, and processing data.
Apache Pig
To analyze data using Apache Pig, programmers
need to write scripts using Pig Latin language.
● All these scripts are internally converted to Map
and Reduce tasks.
● Apache Pig has a component known as Pig Engine
that accepts the Pig Latin scripts as input and
converts those scripts into MapReduce jobs.
Pig Environment:

Pig is made of two Components:

● Pig Latin language : Used to express data flow

● Execution Environment : To run Pig Latin Programs


Pig Components:
Installing and Running Pig

● Download a stable release from http://pig.apache.org/releases.html, and


unpack the tarball in a suitable place on your workstation:
% tar xzf pig-x.y.z.tar.gz

● It’s convenient to add Pig’s binary directory to your command-line path.


● % export PIG_INSTALL=/home/tom/pig-x.y.z
● % export PATH=$PATH:$PIG_INSTALL/bin
● set the JAVA_HOME environment variable
● pig -help to get usage instructions.
Why do we need Apache Pig?

Using Pig Latin, programmers can perform MapReduce tasks easily


without having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the length of codes.
For example, an operation that would require you to type 200 lines of code
(LoC) in Java can be easily done by typing as less as just 10 LoC in Apache
Pig. Ultimately, Apache Pig reduces the development time by almost 16
times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you
are familiar with SQL.
• Apache Pig provides many built-in operators to support data operations
like joins, filters, ordering, etc. In addition, it also provides nested data
types like tuples, bags, and maps that are missing from MapReduce.
Features of Pig:
• Rich set of operators: It provides many operators to perform operations
like join, sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
• Optimization opportunities: The tasks in Apache Pig optimize their
execution automatically.
• Extensibility: Using the existing operators, users can develop their
own functions to read, process, and write data.
• UDF’s: Pig provides the facility to create User-defined Functions in
other programming languages such as Java and invoke or embed them in
Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Pig vs. MapReduce
Pig vs. SQL
Pig vs. Hive
Applications of Apache Pig:

• Processes large volume of data


• Performs data processing in search platforms
• Processes time-sensitive data loads
• Used by telecom companies to de-identify the user call data information
• Used by data scientists for performing tasks involving ad-hoc processing
and quick prototyping across large datasets.
• To process huge data sources such as web logs.
Applications of Apache Pig:

• For exploring large datasets, Pig Scripting is used.


• Provides the supports across large data-sets for Ad-hoc queries.
• In the prototyping of large data-sets & processing algorithms.
• For collecting and processing large amounts of datasets in the form of search logs and web
crawls.
• Used where the analytical insights are needed using the sampling.
• Cycle’s time-delicate data loads and
• Cycles a huge volume of data.
• Performs data handling in search stages.
• Supports fast prototyping and impromptu(done without being planned, organized, or
rehearsed) inquiries across huge datasets.
Apache Pig – History:

• In 2006, Apache Pig was developed as a research


project at Yahoo, especially to create and execute
MapReduce jobs on every dataset.
• In 2007, Apache Pig was open sourced via
Apache incubator.
• In 2008, the first release of Apache Pig came out.
• In 2010, Apache Pig graduated as an Apache
top-level project.
Pig Architecture :
Apache Pig – Components :
Parser: Initially the Pig Scripts are handled by the Parser.
It checks the syntax of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which represents
the Pig Latin statements and logical operators.
Optimizer: The logical plan (DAG(directed acyclic graph)) is passed to the logical optimizer,
which carries out the logical optimizations such as projection and pushdown.
Compiler: The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted
order. Finally, these MapReduce jobs are executed on Hadoop producing the desired
results.
Apache Pig – Data Model :
Apache Pig – Elements:
• Atom
– Any single value in Pig Latin, irrespective of their data type is known as an Atom.
– It is stored as string and can be used as string and number. int, long, float, double,
chararray, and bytearray are the atomic values of Pig.
– A piece of data or a simple atomic value is known as a field.
– Example: ‘raja’ or ‘30’
• Tuple
– A record that is formed by an ordered set of fields is known as a tuple, the fields can be
of any type. A tuple is similar to a row in a table of RDBMS.
– Example: (Raja, 30)
Apache Pig – Elements:
• Bag
– A bag is an unordered set of tuples.
In other words, a collection of tuples (non-unique) is known as a bag.
Each tuple can have any number of fields (flexible schema).
A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS,
it is not necessary that every tuple contain the same number of fields or that the fields in
the same position (column) have the same type.
– Example: {(Raja, 30), (Mohammad, 45)}
– A bag can be a field in a relation; in that context, it is known as inner bag.
– Example: {Raja, 30, {9848022338, raja@gmail.com,}}
Apache Pig – Elements:

• Relation

– A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).

• Map

– A map (or data map) is a set of key-value pairs. The key needs to be of type chararray
and should be unique. The value might be of any type. It is represented by ‘[]’ – Example:
[name#Raja, age#30]
Pig Operators
Pig Data Model
Pig Data Model
Applications on Big Data Using Hive
Introduction to Hive
What is hive?
• Hive is a data warehouse infrastructure tool to process structured
data in Hadoop.
• It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an
open source under the name Apache Hive.
• It is used by different companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
Hive is not-
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level
updates
Features of Hive:
• It stores schema in a database and processed
data into HDFS.
• It is designed for OLAP(Online Analytical Processing).
• It provides SQL type language for querying
called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Hive Architecture:
Hive Architecture:
• User Interface
❖ Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive supports
are Hive Web UI, Hive command line, and Hive HD Insight (In Windows
server).
• Meta Store
❖ Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types, and
HDFS mapping.
● HiveQL Process Engine
❖ HiveQL is similar to SQL for querying on schema info on the Metastore.
❖ It is one of the replacements of traditional approach for MapReduce
program.
❖ Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.
• Execution Engine
❖ The conjunction part of HiveQL process Engine and MapReduce
is Hive Execution Engine.
❖ Execution engine processes the query and generates results as
same as MapReduce results.
❖ It uses the flavor of MapReduce.

• HDFS or HBASE
❖ The Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Working of Hive:
Execution of Hive:
• Execute Query
- The Hive interface such as Command Line or Web UI sends query to
Driver (any database driver such as JDBC, ODBC, etc.) to execute.
• Get Plan
- The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
• Get Metadata
- The compiler sends metadata request to Metastore (any database).
• Send Metadata
- Metastore sends metadata as a response to the compiler.
• Send Plan
- The compiler checks the requirement and resends the plan to
the driver. Up to here, the parsing and compiling of a query is
complete.
• Execute Plan
- The driver sends the execute plan to the execution engine.
• Execute Job
- Internally, the process of execution job is a MapReduce job.
- The execution engine sends the job to JobTracker, which is in
Name node and
- it assigns this job to TaskTracker, which is in Data node.
- Here, the query executes MapReduce job.
• Metadata Ops
- Meanwhile in execution, the execution engine can
execute metadata operations with Metastore.
• Fetch Result
- The execution engine receives the results from
Data nodes.
• Send Results
- The execution engine sends those resultant values
to the driver.
• Send Results
- The driver sends the results to Hive Interfaces.
Applications on Big Data using Hive
🠶 When to use Hive
• Most suitable for data warehouse applications where relatively static data is
analyzed.
• Fast response time is not required.
• Data is not changing rapidly.
• An abstraction to underlying MR program.
• Hive of course is a good choice for queries that lend themselves to being
expressed in SQL, particularly long-running queries, where fault tolerance is
desirable.
• Hive can be a good choice if you’d like to write feature-rich, fault-tolerant,
batch (i.e., not near-real-time) transformation or ETL jobs in a pluggable
SQL engine.
Applications Supported by Hive are:-
⮚ Log Processing
⮚ Text Mining
⮚ Document Indexing
⮚ Google Analytics
⮚ Predictive Modeling
⮚ Hypothesis Testing
Hive Services
The Hive shell is only one of several services that you can run using the hive
command.

You can specify the service to run using the --service option.

Type hive --service help to get a list of available service names;


the most useful are described below.
cli
The command line interface to Hive (the shell). This is the default service.

hiveserver
Runs Hive as a server exposing a Thrift service, enabling access from a range of clients
written in different languages.

Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to
communicate with Hive.

Set the HIVE_PORT environment variable to specify the port, so the server will listen on
(defaults to 10,000).
hwi
The Hive Web Interface is an alternative to using the Hive command line
interface.

Using the web interface is a great way to get started with Hive.

The Hive Web Interface, abbreviated as HWI, is a simple graphical user


interface (GUI)
The Hive Web Interface (HWI)

As an alternative to the shell, you might want to try Hive’s simple web interface. Start
it using the following commands:
% export ANT_LIB=/path/to/ant/lib
% hive --service hwi
jar
The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes
on the classpath.
metastore

• By default, the metastore is run in the same process as the Hive service.
• Using this service, it is possible to run the metastore as a standalone (remote) process.
• Set the METASTORE_PORT environment variable to specify the port the server will
listen on.
FileSystem :

Your Hive data is stored in HDFS, normally under /user/hive/warehouse.

The /user/hive and /user/hive/warehouse directories need to be created if they do not already exist.

Make sure this location (or any path you specify as hive.metastore.warehouse.dir in your hive-site.xml) exists and
is writable by the users whom you expect to be creating tables.
Attention:

● Cloudera recommends setting permissions on the Hive warehouse directory to 1777,


● making it accessible to all users, with the sticky bit set.
● This allows users to create and access their tables, but prevents them from deleting tables they do not own.

Jobclient :

org.apache.hadoop.mapred

Class JobClient

● java.lang.Object
● org.apache.hadoop.mapred.JobClient

All Implemented Interfaces:


AutoCloseable, Configurable, Tool
@InterfaceAudience.Public

@InterfaceStability.Stable

public class JobClient

extends CLI

implements AutoCloseable

JobClient is the primary interface for the user-job to interact with the cluster.
JobClient provides facilities to submit jobs, track their progress, access component-tasks'
reports/logs, get the Map-Reduce cluster status information etc.
The job submission process involves:

1. Checking the input and output specifications of the job.


2. Computing the InputSplits for the job.
3. Setup the requisite accounting information for the
DistributedCache of the job, if necessary.
4. Copying the job's jar and configuration to the map-reduce system
directory on the distributed file-system.
5. Submitting the job to the cluster and optionally monitoring it's status.

Normally the user creates the application, describes various facets of the
job via JobConf and then uses the JobClient to submit the job and
monitor its progress.
Here is an example on how to use JobClient:
// Create a new JobConf
JobConf job = new JobConf(new Configuration(), MyJob.class);
// Specify various job-specific parameters
job.setJobName("myjob");
job.setInputPath(new Path("in"));
job.setOutputPath(new Path("out"));
job.setMapperClass(MyJob.MyMapper.class);
job.setReducerClass(MyJob.MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
Job Control
At times clients would chain map-reduce jobs to accomplish complex tasks which cannot be done via a single map-reduce job.
This is fairly easy since the output of the job, typically, goes to distributed file-system and that can be used as the input for the next job.
However, this also means that the onus on ensuring jobs are complete (success/failure) lies squarely on the clients.
In such situations the various job-control options are:
1. runJob(JobConf) : submits the job and returns only after the job has completed.
2. submitJob(JobConf) : only submits the job, then poll the returned handle to the RunningJob to query status and make scheduling
decisions.
3. JobConf.setJobEndNotificationURI(String) : setup a notification on job-completion, thus avoiding polling.

Hive clients
If you run Hive as a server (hive --service hiveserver), then there are a number of different mechanisms for connecting to it from
applications. The relationship between Hive clients and Hive services is illustrated in Figure 12-1.

Thrift Client
The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages.
Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby.
They can be found in the src/service/src subdirectory in the Hive distribution.
JDBC Driver

Hive provides a Type 4 (pure Java) JDBC driver, defined in the class org.apache.hadoop.hive.jdbc.HiveDriver.
When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to
a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface
implemented by the Hive Thrift Client using the Java Thrift bindings.)

You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://.
In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a
standalone server since it does not use the Thrift service or the Hive Thrift Client.

ODBC Driver
The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC
driver uses Thrift to communicate with the Hive server.)

The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.

There are more details on using these clients on the Hive wiki at https://cwiki.apache .org/confluence/display/Hive/HiveClient.
The Metastore

The metastore is the central repository of Hive metadata.


The metastore is divided into two pieces:
a service and
the backing store for the data.

By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk.
This is called the embedded metastore configuration.

Using an embedded metastore is a simple way to get started with Hive;

however, only one embedded Derby database can access the database files on disk at any one time,
which means you can only have one Hive session open at a time that shares the same metastore.

Trying to start a second session gives the error:

Failed to start database 'metastore_db'

when it attempts to open a connection to the metastore.

The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database.

This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service,

but connects to a database running in a separate process, either on the same machine or on a remote machine.
Any JDBC-compliant database may be used by setting the javax.jdo.option.* configuration properties listed in Table 12-1.
MySQL is a popular choice for the standalone metastore.
In this case, javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIf
NotExist=true, and javax.jdo.option.ConnectionDriverName is set to
com.mysql.jdbc.Driver. (The user name and password should be set, too, of course.)

The JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s classpath, which is simply
achieved by placing it in Hive’s lib directory.

Going a step further, there’s another metastore configuration called a remote metastore, where one
or more metastore servers run in separate processes to the Hive service.
This brings better manageability and security, since the database tier can be completely firewalled off,
and the clients no longer need the database credentials.

A Hive service is configured to use a remote metastore by setting hive.meta store.local to false,
and hive.metastore.uris to the metastore server URIs, separated by commas if there is more than one.
Metastore server URIs are of the form thrift:// host:port, where the port corresponds to the one set by
METASTORE_PORT when starting the metastore server.
HIVE (HQL) Commands
HIVE (HQL) Commands
Q1: How to enter the HIVE Shell?

Go to the Terminal and type hive, you will see the hive on the prompt.

[cloudera@quickstart Desktop]$ hive

Q2: Create a database

create database emp_details;

use emp_details;

Q3: How to create Managed Table in HIVE?

create table emp(empno int, ename string, job string, sal int, deptno int)
row format delimited fields terminated by ',';

Q4: How to load the data from LOCAL to HIVE TABLE

Suppose you created a comma separated file in local system named empdetails.txt

1,A,clerk,4000,10
2,A,clerk,4000,30
3,B,mgr,8000,20
4,C,peon,2000,40
5,D,clerk,4000,10
6,E,mgr,8000,50

hive> LOAD DATA LOCAL INPATH


'/home/cloudera/Desktop/empdetails.txt' OVERWRITE INTO TABLE emp;
# Note: If 'LOCAL' is omitted then it looks for the file in HDFS.
The keyword 'OVERWRITE' signifies that existing data in the table is deleted. If the 'OVERWRITE' keyword is omitted, data files are appended to existing data sets.
Q5: How to check where the managed table is created in hive db
[cloudera@quickstart Desktop]$ hadoop fs -ls /user/hive/warehouse/emp_details.db
Found 2 items
drwxrwxrwx - cloudera supergroup 0 2018-07-24 02:40 /user/hive/warehouse/emp_details.db/emp
drwxrwxrwx - cloudera supergroup 0 2018-07-24 02:28 /user/hive/warehouse/emp_details.db/emp1
Also check the contents inside emp:
[cloudera@quickstart Desktop]$ hadoop fs -ls
/user/hive/warehouse/emp_details.db/emp
Found 1 items
-rwxrwxrwx 1 cloudera supergroup 104 2018-07-24 02:40 /user/hive/warehouse/emp_details.db/emp/empdetails.txt
Now see the contents inside empdetails.txt
[cloudera@quickstart Desktop]$ hadoop fs -cat /user/hive/warehouse/emp_details.db/emp/empdetails.txt
1,A,clerk,4000,10
2,A,clerk,4000,30
3,B,mgr,8000,20
4,C,peon,2000,40
5,D,clerk,4000,10
6,E,mgr,8000,50
Q6:Check the schema of the created table emp?
describe emp;

For a detailed schema use:


describe extended emp;

3
Q7: How to see all the tables present in database
show tables;

Q8: Select all the enames from emp table


select ename from emp;

Q9: Get the records where name is 'A'


select * from emp where ename='A';

Q10: Count the total number of records in the created table

Count aggregate function is used count the total number of the records in a table.
select count(1) from emp;
OR
Select count(*) from emp;
Q11: Group the sum of salaries as per the deptno select deptno, sum(sal) from emp
group by deptno;

Q12: Get the salary of people between 1000 and 2000

select * from emp where sal between 1000 and 2000;

Q13: Select the name of employees where job has exactly 5 characters
hive> select ename from emp where job LIKE '_____';

Q14: List the employee names where job has l as the second character

hive> select ename from emp where job LIKE '_l%';

Q15: Retrieve the total salary for each department select deptno, sum(sal) from emp
group by deptno;

Q16: Add a column to the table


alter table emp add COLUMNS(lastname string);

Q17: How to Rename a table


alter table emp rename to emp1;

18: How to drop table


drop table emp;
Fundamentals of HBase
Fundamentals of ZooKeeper

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy