0% found this document useful (0 votes)

28 views41 pages

Unit 5 Da

The document discusses several big data tools and techniques, including Pig, Hive, HBase, Zookeeper, Mahout, and Storm. It provides overview information on what each tool is used for, how it works, and example code snippets. Some key points covered include: - Pig is a data flow language for expressing data analysis workflows as sequences of MapReduce jobs. It allows for relational operations on nested data structures. - Hive provides a SQL-like interface to analyze large datasets stored in Hadoop. It includes a metastore to manage metadata and supports partitioning, clustering, and complex data types. - HBase is a column-oriented NoSQL database modeled after Bigtable. It stores billions of

Uploaded by

aadityapawar210138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views41 pages

Unit 5 Da

Uploaded by

aadityapawar210138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Acropolis Institute of Technology &

Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-V:
BIG DATA TOOLS AND TECHNIQUES: Installing and Running Pig, Comparison with Databases, Pig Latin, User-
Define Functions, Data Processing Operators, Installing and Running Hive, Hive QL, Querying Data, User-Defined
Functions, Oracle Big Data.

December 13, 2023 3

Hadoop Subprojects
Hadoop Related Subprojects
Pig
 High-level language for data analysis
HBase
 Table storage for semi-structured data
Zookeeper
 Coordinating distributed applications
Hive
 SQL-like Query language and Metastore
Mahout
 Machine learning
Pig
Started at Yahoo! Research
Pig
Now runs about 30% of Yahoo!’s jobs
Features
Expresses sequences of MapReduce jobs
Data model: nested “bags” of items
Provides relational (SQL) operators
(JOIN, GROUP BY, etc.)
Easy to plug in Java functions
An Example Problem
Suppose you have Load Users Load Pages
user data in a file,
website data in Filter by age

another, and you

need to find the top Join on name

5 most visited pages Group on url

by users aged 18-25 Count clicks

Order by clicks

Take top 5
In MapReduce
Users = load ‘users’ as (name, age);
In Pig Latin Filtered = filter Users by age >= 18 and age <=
25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Ease of Translation

Load Users Load Pages

Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Group on url Grouped = group …
Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks

Take top 5
Ease of Translation

Load Users Load Pages

Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
HBase
Modeled on Google’s Bigtable
HBase - What?
Row/column store
Billions of rows/millions on columns
Column-oriented - nulls are free
Untyped - stores byte[]
HBase - Data Model

Column
Column family:
Row Timestamp family
animal:
repairs:
animal:type animal:size repairs:cost
t2 zebra 1000 EUR
enclosure1
t1 lion big
enclosure2 … … … …
HBase - Data Storage Column family animal:
(enclosure1, t2, animal:type) zebra
(enclosure1, t1, animal:size) big
(enclosure1, t1, animal:type) lion

Column family repairs:

(enclosure1, t1, repairs:cost) 1000 EUR
HTable table = …
HBase - Code Text row = new Text(“enclosure1”);
Text col1 = new Text(“animal:type”);
Text col2 = new Text(“animal:size”);
BatchUpdate update = new BatchUpdate(row);
update.put(col1, “lion”.getBytes(“UTF-8”));
update.put(col2, “big”.getBytes(“UTF-8));
table.commit(update);

update = new BatchUpdate(row);

update.put(col1, “zebra”.getBytes(“UTF-8”));
table.commit(update);
HBase - Querying
Retrieve a cell
Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();

Retrieve a row
RowResult = table.getRow( “enclosure1” );

Scan through a range of rows

Scanner s = table.getScanner( new String[] { “animal:type” } );
Hive
Developed at Facebook
Hive
Used for majority of Facebook jobs
“Relational database” built on Hadoop
Maintains list of table schemas
SQL-like query language (HiveQL)
Can call Hadoop Streaming scripts from HiveQL
Supports table partitioning, clustering, complex
data types, some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

Partitioning breaks table into separate files for

each (dt, country) pair
Ex: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
A Simple Query

• Find all page views coming from xyz.com

on March 31st:
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND page_views.date <= '2008-03-31'
AND page_views.referrer_url like '%xyz.com';

• Hive only reads partition 2008-03-01,*

instead of scanning entire table
Aggregation and Joins

• Count users who visited each page by gender:

SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)
FROM page_views pv JOIN user u ON (pv.userid = u.id)
GROUP BY pv.page_url, u.gender
WHERE pv.date = '2008-03-03';

• Sample output:
Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_views.userid,

page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
Storm
Developed by BackType which was acquired by
Storm Twitter
Lots of tools for data (i.e. batch) processing
 Hadoop, Pig, HBase, Hive, …
None of them are realtime systems which is
becoming a real requirement for businesses
Storm provides realtime computation
 Scalable
 Guarantees no data loss
 Extremely robust and fault-tolerant
 Programming language agnostic
Before Storm
Before Storm – Adding a worker

Deploy

Reconfigure/Redeploy
Scaling is painful
Problems
Poor fault-tolerance
Coding is tedious
Guaranteed data processing
What we want
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message
passing
“Just works” !!
Storm Cluster
Master node (similar to
Hadoop JobTracker)

Used for cluster coordination

Run worker processes

Streams
Concepts
Spouts
Bolts
Topologies
Streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

Spouts

Source of streams
Bolts

Processes input streams and produces new streams:

Can implement functions such as filters, aggregation, join, etc
Topology

Network of spouts and bolts

Topology

Spouts and bolts execute as

many tasks across the cluster
Stream Grouping

When a tuple is emitted which task does it go to?

Stream Grouping
• Shuffle grouping: pick a random task
• Fields grouping: consistent hashing on a
subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id
Demo
Word Count
hadoop jar hadoop-0.20.2-examples.jar wordcount <input dir> <output dir>
Hive
hive -f pagerank.hive

SAFe Glossary EN PDF
100% (2)
SAFe Glossary EN PDF
16 pages
Case Study: Hadoop
No ratings yet
Case Study: Hadoop
46 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
S Pig Hive HBase
No ratings yet
S Pig Hive HBase
19 pages
Big Data Overview
No ratings yet
Big Data Overview
39 pages
Data Warehousing & Analytics On Hadoop: Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
No ratings yet
Data Warehousing & Analytics On Hadoop: Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
19 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
BDS Session 8
No ratings yet
BDS Session 8
49 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Leçon4 Hadoop Query Languages
No ratings yet
Leçon4 Hadoop Query Languages
21 pages
Ese Bda
No ratings yet
Ese Bda
28 pages
Hive Intoduction and Tables
No ratings yet
Hive Intoduction and Tables
31 pages
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
No ratings yet
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
219 pages
Module 5 - Data Analytics
No ratings yet
Module 5 - Data Analytics
4 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
YouTube Data Analysis Using Hadoop
No ratings yet
YouTube Data Analysis Using Hadoop
64 pages
06 Hadoop Query Languages
No ratings yet
06 Hadoop Query Languages
23 pages
Case Study Pig Hive Hbase
No ratings yet
Case Study Pig Hive Hbase
15 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Big Data
No ratings yet
Big Data
4 pages
Bda 4 Og
No ratings yet
Bda 4 Og
18 pages
6 H Data With Hive Big Data Analytics B.tech. Final Year
No ratings yet
6 H Data With Hive Big Data Analytics B.tech. Final Year
24 pages
Hive Main
No ratings yet
Hive Main
33 pages
Big Data 4
No ratings yet
Big Data 4
14 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hive
No ratings yet
Hive
29 pages
BATCH12
No ratings yet
BATCH12
32 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Big Data Hadoop - Course Curriculum - V1
No ratings yet
Big Data Hadoop - Course Curriculum - V1
7 pages
Big Data Analytics - Sem 7 CVMU
No ratings yet
Big Data Analytics - Sem 7 CVMU
4 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Data Analytics Chapter 5
No ratings yet
Data Analytics Chapter 5
14 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
Hive - Data Warehousing &: Analytics On Hadoop
No ratings yet
Hive - Data Warehousing &: Analytics On Hadoop
42 pages
Module 1
No ratings yet
Module 1
54 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
16 pages
Hadoop
No ratings yet
Hadoop
15 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Big Data
No ratings yet
Big Data
18 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Unit 1
No ratings yet
Unit 1
19 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
20 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Class Notes
No ratings yet
Class Notes
19 pages
Module 2 Ecosystem
No ratings yet
Module 2 Ecosystem
47 pages
Eee Es 401 Impt Questions
No ratings yet
Eee Es 401 Impt Questions
12 pages
MST - I E&ee
No ratings yet
MST - I E&ee
1 page
Research Paper of Minor
No ratings yet
Research Paper of Minor
5 pages
Anmol IWT1
No ratings yet
Anmol IWT1
13 pages
C: A M S C C: ES Orpius Assive Panish Rawling Orpus
No ratings yet
C: A M S C C: ES Orpius Assive Panish Rawling Orpus
7 pages
Surveyors Supply Catalog: Volume 10, Number 1
No ratings yet
Surveyors Supply Catalog: Volume 10, Number 1
101 pages
Excel Vba Parse PDF
No ratings yet
Excel Vba Parse PDF
2 pages
Online Crime System
100% (1)
Online Crime System
58 pages
Analysis of An IEC 61850 Based Electric Substation Communication Architecture
No ratings yet
Analysis of An IEC 61850 Based Electric Substation Communication Architecture
6 pages
Using A VB ActiveX DLL From Visual LISP
No ratings yet
Using A VB ActiveX DLL From Visual LISP
4 pages
Day 4 - Session 4 - Local Reference Fields
100% (1)
Day 4 - Session 4 - Local Reference Fields
43 pages
JASON Version 0.99: Just Another Ship-Owner Name:-)
No ratings yet
JASON Version 0.99: Just Another Ship-Owner Name:-)
7 pages
Bullseye Manual RevD en
No ratings yet
Bullseye Manual RevD en
104 pages
ETPv3.3-Tutorial Intallation Multi-User
No ratings yet
ETPv3.3-Tutorial Intallation Multi-User
11 pages
OCS Inventory NG Documentation - OCS Inventory NG
100% (4)
OCS Inventory NG Documentation - OCS Inventory NG
91 pages
Work Accident Prevention Efforts Using SHELL Theory
No ratings yet
Work Accident Prevention Efforts Using SHELL Theory
2 pages
Documentation (Exam)
No ratings yet
Documentation (Exam)
98 pages
Modbus Communications Module
No ratings yet
Modbus Communications Module
100 pages
Center Border Gap Mask Converter
No ratings yet
Center Border Gap Mask Converter
2 pages
Automation Controller Administration Guide
No ratings yet
Automation Controller Administration Guide
166 pages
Morocco Fixed Broadband Certificate Umlaut
No ratings yet
Morocco Fixed Broadband Certificate Umlaut
4 pages
CapsuleVPNClient AdminGuide
No ratings yet
CapsuleVPNClient AdminGuide
32 pages
22csp03 PP Lab QB
No ratings yet
22csp03 PP Lab QB
34 pages
VB Script Coding Conventions
No ratings yet
VB Script Coding Conventions
11 pages
11 Inheritance PDF
No ratings yet
11 Inheritance PDF
52 pages
Ing 3
No ratings yet
Ing 3
6 pages
Europa Encoder 9 4 4
No ratings yet
Europa Encoder 9 4 4
28 pages
License
No ratings yet
License
2 pages
Lec21 Security
No ratings yet
Lec21 Security
51 pages
Accomplishment Report: Week 1
No ratings yet
Accomplishment Report: Week 1
6 pages
The NPCs in This Village Sim Game Must Be Real! Vol. 1
No ratings yet
The NPCs in This Village Sim Game Must Be Real! Vol. 1
196 pages
Video Lib Management System
No ratings yet
Video Lib Management System
8 pages
Revenue Proposal in Full
No ratings yet
Revenue Proposal in Full
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 5 Da

Uploaded by

Unit 5 Da

Uploaded by

Acropolis Institute of Technology &

December 13, 2023 3

another, and you

5 most visited pages Group on url

Load Users Load Pages

Load Users Load Pages

Load Users Load Pages

Column family repairs:

update = new BatchUpdate(row);

Scan through a range of rows

Partitioning breaks table into separate files for

• Find all page views coming from xyz.com

• Hive only reads partition 2008-03-01,*

• Count users who visited each page by gender:

Used for cluster coordination

Run worker processes

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

Processes input streams and produces new streams:

Network of spouts and bolts

Spouts and bolts execute as

When a tuple is emitted which task does it go to?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.