IMTC634 - Data Science - Chapter 16

Chapter 16: Pig
Chapter Index
S.No. Reference Particulars Slide
No. From - To
1 Learning Objectives 3
2 Topic 1 Pig 4-6
3 Topic 2 Modes of running Pig Scripts 7-8
4 Topic 3 Running Pig problems 9

5 Topic 4 Getting started with Pig Latin 10-15
6 Topic 5 Working with Functions in Pig 16-17
Let’s Sum Up 18-19

Learning Objectives
 Describe the meaning of Pig
 Discuss the benefits of Pig
 Describe the properties of Pig
 Explain how to execute Pig programs
 Explain the working with functions in Pig

1.Pig
 Pig was designed and developed for performing a long series of data operations.
The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.
 Pig was developed in 2006 at Yahoo. Its aim, as a research project was to
provide a simple way to use Hadoop and focus on examining large datasets.
 Pig became an Apache project in 2007. By 2009, other companies started using
pig, making it a top-level Apache project in 2010.
 Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.
 Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
1.Pig
Benefits of Pig :
The Pig programming language offers the following benefits:
 Ease of coding: Using Pig Latin we can write complex programs. The
code is simple and easy to understand and maintain. It takes complex tasks
involving interrelated data transformations as data flow sequence and
explicitly encodes them.
 Optimization: Pig Latin encodes tasks in such a way that they can be
easily optimized for execution. This allows users to concentrate on the data
processing aspects without bothering about efficiency.
 Extensibility: Pig Latin is designed in such a way that it allows us to create

our own custom functions. These can be used for performing special tasks.
Custom functions are also called user- defined functions.
1.Pig
Properties of Pig :
 Pig programming language is written in Java; therefore, it supports the
properties of Java and uses the Java notation.
 Following table shows some important properties of Pig:
Property Description Values Default
stop.on.failure Decides whether to exit at True/ False
the first error False
udf.import.list Refers to a comma-
separated list of imports
for Universal Disk Format
(UDF)
pig.additional. Refers to a command- JAR
jars separated list of files
JavaArchive (JAR) files
2.Modes of Running Pig Scripts
Pig scripts can be run in the following two modes:
 Local mode: In this mode, several scripts can run on a single machine without
requiring Hadoop MapReduce and Hadoop Distributed File System (HDFS).
This mode is useful for developing and testing Pig logic.
 MapReduce mode: Also known as the Hadoop mode. The pig script, in this
mode, gets converted into a series of MapReduce jobs and then run on the
Hadoop cluster.
2.Modes of Running Pig Scripts
Following figure shows the modes available for running Pig scripts:
3.Running Pig Programs
 Before running the Pig program, it is necessary to know about the pig shell. As
we all know, without a shell, no one can access the pig’s in-built characteristics.
 Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in
nature and used for scripting of pig.
 Grunt saves the previously used command in “pig_history” file in Home
directory.
 There is one handier feature of Grunt: If you are writing a script on grunt, it will
automatically complete the keywords that you are typing.
 For example, if you are writing a script and you need to just type “for” and press
Tab button, it will automatically type “foreach” keyword.
4.Getting Started with Pig Latin
 Schema :For every scripting language, there is a schema definition that tells
everything about the script.
 Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.
 Pig Latin abstracts its programming from the Java MapReduce idiom and
extends itself by allowing direct calling of user-defined functions written in
Java, Python, JavaScript, Ruby, or Groovy.
 Pig translates the Pig Latin script into MapReduce jobs.

The main reasons for developing Pig Latin are as follows:
 Simple: A streamlined method is provided in Pig Latin for interacting with

MapReduce.
 Smart: A Pig Latin program is converted into a series of Java MapReduce jobs
using the Pig Latin Compiler.
 Extensible: Due to its extensible nature, Pig Latin enables the developers to
address specific business problems by adding functions
Pig Latin Application Flow :

 Pig Latin is regarded as a data flow language. This simply means that we can
use Pig Latin to define a data stream and a sequence of transformations, which
can be applied to the data as it moves throughout our application.
 In a control flow language, we can write a sequence of instructions. Also We
can use concepts such as conditional logic and loops.
 There are no if and loop statements in Pig Latin.
 The Pig syntax used for data processing flow is as follows:
X = LOAD ‘file_name.txt’ ...
Y = GROUP ... ;...
Z= FILTER ... ;...
DUMP Y; ..
STORE Z into ‘temp’;
 Working with Operators in Pig :
 In Pig Latin, relational operators are used for transforming data. Different
types of transformations include grouping, filtering, sorting, and joining.
 The following are some basic relational operators used in Pig:

1. FOREACH 2. ASSERT 3. FILTER
4. GROUP 5. ORDER BY 6. DISTINCT
7. JOIN 8. SAMPLE 9. SPLIT
10. LIMIT
Description of Operators:
 FOREACH : The FOREACH operator performs iterations over every record to

perform a transformation.
 ASSERT : The ASSERT operator asserts a condition on the data.
 FILTER : The FILTER operator enables us to use a predicate for selecting the
records that need to be retained in the pipeline.
 Working with Operators in Pig :

 GROUP : GROUP operator is used for grouping data in single or

multiple relations.
 GROUP : GROUP operator is used for grouping data in single or multiple
relations.
 ORDER BY : Depending on one or more fields, a given relation can be
sorted using the ORDER BY operator.
 DISTINCT : This operator is used for removing duplicate fields from a
given set of records.
 JOIN : For joining two or more relations, JOIN operator is used. Types of
joins can be performed in Pig Latin:
1. Inner join 2. Outer join
 LIMIT : The LIMIT operator in Pig allows a user to limit the number of
results.
 SA MPLE : SAMPLE operator can be used for selecting a random data
sample in Pig.
 SPLIT : SPLIT operator partitions a given relation into two or more relations.
5.Working with Functions in Pig
 A set of statements that can be used to perform a specific task or a set of tasks is
called a function.
 There are basically two types of functions in Pig: user-defined functions and
built-in functions.
 User-defined functions can be created by users as per their requirements. On the

other hand, built-in functions are already defined in Pig.
 There are mainly five categories of built-in functions in Pig.

5.Working with Functions in Pig
 There are mainly five categories of built-in functions in Pig, which are as
follows :
 Eval or Evaluation functions: These functions are used to evaluate a
value by using an expression.
 Math functions: These functions are used to perform mathematical

operations.
 String functions: These functions are used to perform various operations

on strings.
 Bag and Tuple functions: These functions are used to perform operations
on tuples and bags.
 Load and Store functions: These functions are used to load and extract
Let’s Sum Up
 Pig was designed and developed for performing a long series of data operations.
 The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.
 Pig enables users to focus more on what to do than on how to do it.
 Pig was developed in 2006 at Yahoo.
 Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.
 Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
Let’s Sum Up
 Pig programming language is written in Java; therefore, it supports the

properties of Java and uses the Java notation.
 Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in

nature and used for scripting of pig.
 Grunt saves the previously used command in “pig_history” file in Home

directory.
 Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.
THANK YOU

IMTC634 - Data Science - Chapter 16

Uploaded by

Copyright:

Available Formats

IMTC634 - Data Science - Chapter 16

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IMTC634 - Data Science - Chapter 16

Uploaded by

Copyright:

Available Formats

Chapter 16: Pig

4 Topic 3 Running Pig problems 9

6 Topic 5 Working with Functions in Pig 16-17

Let’s Sum Up 18-19

 Describe the meaning of Pig

 Discuss the benefits of Pig

 Describe the properties of Pig

 Explain how to execute Pig programs

 Explain the working with functions in Pig

 Extensibility: Pig Latin is designed in such a way that it allows us to create

Pig scripts can be run in the following two modes:

 Pig translates the Pig Latin script into MapReduce jobs.

The main reasons for developing Pig Latin are as follows:

 Simple: A streamlined method is provided in Pig Latin for interacting with

Pig Latin Application Flow :

 Working with Operators in Pig :

 The following are some basic relational operators used in Pig:

 FOREACH : The FOREACH operator performs iterations over every record to

 Working with Operators in Pig :

 GROUP : GROUP operator is used for grouping data in single or

 User-defined functions can be created by users as per their requirements. On the

 There are mainly five categories of built-in functions in Pig.

 Math functions: These functions are used to perform mathematical

 String functions: These functions are used to perform various operations

 Pig enables users to focus more on what to do than on how to do it.

 Pig was developed in 2006 at Yahoo.

 Pig programming language is written in Java; therefore, it supports the

 Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in

 Grunt saves the previously used command in “pig_history” file in Home

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

 Extensibility: Pig Latin is designed in such a way that it allows us to create