IMTC634 - Data Science - Chapter 16

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

Chapter 16: Pig

Chapter Index
S.No. Reference Particulars Slide
No. From - To

1 Learning Objectives 3
2 Topic 1 Pig 4-6
3 Topic 2 Modes of running Pig Scripts 7-8

4 Topic 3 Running Pig problems 9


5 Topic 4 Getting started with Pig Latin 10-15

6 Topic 5 Working with Functions in Pig 16-17

Let’s Sum Up 18-19


Learning Objectives

 Describe the meaning of Pig

 Discuss the benefits of Pig

 Describe the properties of Pig

 Explain how to execute Pig programs

 Explain the working with functions in Pig


1.Pig

 Pig was designed and developed for performing a long series of data operations.
The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.

 Pig was developed in 2006 at Yahoo. Its aim, as a research project was to
provide a simple way to use Hadoop and focus on examining large datasets.

 Pig became an Apache project in 2007. By 2009, other companies started using
pig, making it a top-level Apache project in 2010.

 Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.

 Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
1.Pig

Benefits of Pig :
The Pig programming language offers the following benefits:

 €€Ease of coding: Using Pig Latin we can write complex programs. The
code is simple and easy to understand and maintain. It takes complex tasks
involving interrelated data transformations as data flow sequence and
explicitly encodes them.

 €€Optimization: Pig Latin encodes tasks in such a way that they can be
easily optimized for execution. This allows users to concentrate on the data
processing aspects without bothering about efficiency.

 €€Extensibility: Pig Latin is designed in such a way that it allows us to create


our own custom functions. These can be used for performing special tasks.
Custom functions are also called user- defined functions.
1.Pig

Properties of Pig :
 Pig programming language is written in Java; therefore, it supports the
properties of Java and uses the Java notation.
 Following table shows some important properties of Pig:
Property Description Values Default
stop.on.failure Decides whether to exit at True/ False
the first error False
udf.import.list Refers to a comma-
separated list of imports
for Universal Disk Format
(UDF)
pig.additional. Refers to a command- JAR
jars separated list of files
JavaArchive (JAR) files
2.Modes of Running Pig Scripts

Pig scripts can be run in the following two modes:

 Local mode: In this mode, several scripts can run on a single machine without
requiring Hadoop MapReduce and Hadoop Distributed File System (HDFS).
This mode is useful for developing and testing Pig logic.

 MapReduce mode: Also known as the Hadoop mode. The pig script, in this
mode, gets converted into a series of MapReduce jobs and then run on the
Hadoop cluster.
2.Modes of Running Pig Scripts

Following figure shows the modes available for running Pig scripts:
3.Running Pig Programs

 Before running the Pig program, it is necessary to know about the pig shell. As
we all know, without a shell, no one can access the pig’s in-built characteristics.
 Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in
nature and used for scripting of pig.
 Grunt saves the previously used command in “pig_history” file in Home
directory.
 There is one handier feature of Grunt: If you are writing a script on grunt, it will
automatically complete the keywords that you are typing.
 For example, if you are writing a script and you need to just type “for” and press
Tab button, it will automatically type “foreach” keyword.
4.Getting Started with Pig Latin

 Schema :For every scripting language, there is a schema definition that tells
everything about the script.

 Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.

 Pig Latin abstracts its programming from the Java MapReduce idiom and
extends itself by allowing direct calling of user-defined functions written in
Java, Python, JavaScript, Ruby, or Groovy.

 Pig translates the Pig Latin script into MapReduce jobs.


4.Getting Started with Pig Latin

The main reasons for developing Pig Latin are as follows:

 Simple: A streamlined method is provided in Pig Latin for interacting with


MapReduce.

 Smart: A Pig Latin program is converted into a series of Java MapReduce jobs
using the Pig Latin Compiler.

 Extensible: Due to its extensible nature, Pig Latin enables the developers to
address specific business problems by adding functions
4.Getting Started with Pig Latin

Pig Latin Application Flow :


 Pig Latin is regarded as a data flow language. This simply means that we can
use Pig Latin to define a data stream and a sequence of transformations, which
can be applied to the data as it moves throughout our application.
 In a control flow language, we can write a sequence of instructions. Also We
can use concepts such as conditional logic and loops.
 There are no if and loop statements in Pig Latin.
 The Pig syntax used for data processing flow is as follows:
X = LOAD ‘file_name.txt’ ...
Y = GROUP ... ;...
Z= FILTER ... ;...
DUMP Y; ..
STORE Z into ‘temp’;
4.Getting Started with Pig Latin

 Working with Operators in Pig :

 In Pig Latin, relational operators are used for transforming data. Different
types of transformations include grouping, filtering, sorting, and joining.

 The following are some basic relational operators used in Pig:


1. FOREACH €€ 2. ASSERT 3. €€ FILTER
4. €€GROUP 5. ORDER BY 6. DISTINCT
7. JOIN 8. SAMPLE 9. €SPLIT

10. LIMIT
4.Getting Started with Pig Latin

Description of Operators:

 FOREACH : The FOREACH operator performs iterations over every record to


perform a transformation.
 ASSERT : The ASSERT operator asserts a condition on the data.
 FILTER : The FILTER operator enables us to use a predicate for selecting the
records that need to be retained in the pipeline.

 Working with Operators in Pig :


Description of Operators:

 GROUP : GROUP operator is used for grouping data in single or


multiple relations.
4.Getting Started with Pig Latin

Description of Operators:
 GROUP : GROUP operator is used for grouping data in single or multiple
relations.
 ORDER BY : Depending on one or more fields, a given relation can be
sorted using the ORDER BY operator.
 DISTINCT : This operator is used for removing duplicate fields from a
given set of records.
 JOIN : For joining two or more relations, JOIN operator is used. Types of
joins can be performed in Pig Latin:
1.€€ Inner join €€ 2. Outer join
 LIMIT : The LIMIT operator in Pig allows a user to limit the number of
results.
 SA MPLE : SAMPLE operator can be used for selecting a random data
sample in Pig.
 SPLIT : SPLIT operator partitions a given relation into two or more relations.
5.Working with Functions in Pig

 A set of statements that can be used to perform a specific task or a set of tasks is
called a function.

 There are basically two types of functions in Pig: user-defined functions and
built-in functions.

 User-defined functions can be created by users as per their requirements. On the


other hand, built-in functions are already defined in Pig.

 There are mainly five categories of built-in functions in Pig.


5.Working with Functions in Pig

 There are mainly five categories of built-in functions in Pig, which are as
follows :
 Eval or Evaluation functions: These functions are used to evaluate a
value by using an expression.

 Math functions: These functions are used to perform mathematical


operations.

 String functions: These functions are used to perform various operations


on strings.

 Bag and Tuple functions: These functions are used to perform operations
on tuples and bags.

 Load and Store functions: These functions are used to load and extract
Let’s Sum Up

 Pig was designed and developed for performing a long series of data operations.

 The Pig platform is specially designed for handling many kinds of data, be it
structured, semi-structured, or unstructured.

 Pig enables users to focus more on what to do than on how to do it.

 Pig was developed in 2006 at Yahoo.

 Pig can be divided into three categories: ETL (Extract, Transform, and Load),
research, and interactive data processing.

 Pig consists of a scripting language, known as Pig Latin, and a Pig Latin
compiler.
Let’s Sum Up

 Pig programming language is written in Java; therefore, it supports the


properties of Java and uses the Java notation.

 Pig shell is known as “Grunt.” Grunt is a command shell, which is graphical in


nature and used for scripting of pig.

 Grunt saves the previously used command in “pig_history” file in Home


directory.

 Pig is a high-level programming platform that uses Pig Latin language for
developing Hadoop MapReduce programs.
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy