IMTC634 - Data Science - Chapter 16
IMTC634 - Data Science - Chapter 16
IMTC634 - Data Science - Chapter 16
Chapter Index
S.No. Reference Particulars Slide
No. From - To
1 Learning Objectives 3
2 Topic 1 Pig 4-6
3 Topic 2 Modes of running Pig Scripts 7-8
Pig was designed and developed for performing a long series of data operations.
The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.
Pig was developed in 2006 at Yahoo. Its aim, as a research project was to
provide a simple way to use Hadoop and focus on examining large datasets.
Pig became an Apache project in 2007. By 2009, other companies started using
pig, making it a top-level Apache project in 2010.
Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.
Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
1.Pig
Benefits of Pig :
The Pig programming language offers the following benefits:
Ease of coding: Using Pig Latin we can write complex programs. The
code is simple and easy to understand and maintain. It takes complex tasks
involving interrelated data transformations as data flow sequence and
explicitly encodes them.
Optimization: Pig Latin encodes tasks in such a way that they can be
easily optimized for execution. This allows users to concentrate on the data
processing aspects without bothering about efficiency.
Properties of Pig :
Pig programming language is written in Java; therefore, it supports the
properties of Java and uses the Java notation.
Following table shows some important properties of Pig:
Property Description Values Default
stop.on.failure Decides whether to exit at True/ False
the first error False
udf.import.list Refers to a comma-
separated list of imports
for Universal Disk Format
(UDF)
pig.additional. Refers to a command- JAR
jars separated list of files
JavaArchive (JAR) files
2.Modes of Running Pig Scripts
Local mode: In this mode, several scripts can run on a single machine without
requiring Hadoop MapReduce and Hadoop Distributed File System (HDFS).
This mode is useful for developing and testing Pig logic.
MapReduce mode: Also known as the Hadoop mode. The pig script, in this
mode, gets converted into a series of MapReduce jobs and then run on the
Hadoop cluster.
2.Modes of Running Pig Scripts
Following figure shows the modes available for running Pig scripts:
3.Running Pig Programs
Before running the Pig program, it is necessary to know about the pig shell. As
we all know, without a shell, no one can access the pig’s in-built characteristics.
Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in
nature and used for scripting of pig.
Grunt saves the previously used command in “pig_history” file in Home
directory.
There is one handier feature of Grunt: If you are writing a script on grunt, it will
automatically complete the keywords that you are typing.
For example, if you are writing a script and you need to just type “for” and press
Tab button, it will automatically type “foreach” keyword.
4.Getting Started with Pig Latin
Schema :For every scripting language, there is a schema definition that tells
everything about the script.
Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.
Pig Latin abstracts its programming from the Java MapReduce idiom and
extends itself by allowing direct calling of user-defined functions written in
Java, Python, JavaScript, Ruby, or Groovy.
Smart: A Pig Latin program is converted into a series of Java MapReduce jobs
using the Pig Latin Compiler.
Extensible: Due to its extensible nature, Pig Latin enables the developers to
address specific business problems by adding functions
4.Getting Started with Pig Latin
In Pig Latin, relational operators are used for transforming data. Different
types of transformations include grouping, filtering, sorting, and joining.
10. LIMIT
4.Getting Started with Pig Latin
Description of Operators:
Description of Operators:
GROUP : GROUP operator is used for grouping data in single or multiple
relations.
ORDER BY : Depending on one or more fields, a given relation can be
sorted using the ORDER BY operator.
DISTINCT : This operator is used for removing duplicate fields from a
given set of records.
JOIN : For joining two or more relations, JOIN operator is used. Types of
joins can be performed in Pig Latin:
1. Inner join 2. Outer join
LIMIT : The LIMIT operator in Pig allows a user to limit the number of
results.
SA MPLE : SAMPLE operator can be used for selecting a random data
sample in Pig.
SPLIT : SPLIT operator partitions a given relation into two or more relations.
5.Working with Functions in Pig
A set of statements that can be used to perform a specific task or a set of tasks is
called a function.
There are basically two types of functions in Pig: user-defined functions and
built-in functions.
There are mainly five categories of built-in functions in Pig, which are as
follows :
Eval or Evaluation functions: These functions are used to evaluate a
value by using an expression.
Bag and Tuple functions: These functions are used to perform operations
on tuples and bags.
Load and Store functions: These functions are used to load and extract
Let’s Sum Up
Pig was designed and developed for performing a long series of data operations.
The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.
Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.
Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
Let’s Sum Up
Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.
THANK YOU