0% found this document useful (0 votes)
28 views41 pages

Unit 5 Da

The document discusses several big data tools and techniques, including Pig, Hive, HBase, Zookeeper, Mahout, and Storm. It provides overview information on what each tool is used for, how it works, and example code snippets. Some key points covered include: - Pig is a data flow language for expressing data analysis workflows as sequences of MapReduce jobs. It allows for relational operations on nested data structures. - Hive provides a SQL-like interface to analyze large datasets stored in Hadoop. It includes a metastore to manage metadata and supports partitioning, clustering, and complex data types. - HBase is a column-oriented NoSQL database modeled after Bigtable. It stores billions of
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views41 pages

Unit 5 Da

The document discusses several big data tools and techniques, including Pig, Hive, HBase, Zookeeper, Mahout, and Storm. It provides overview information on what each tool is used for, how it works, and example code snippets. Some key points covered include: - Pig is a data flow language for expressing data analysis workflows as sequences of MapReduce jobs. It allows for relational operations on nested data structures. - Hive provides a SQL-like interface to analyze large datasets stored in Hadoop. It includes a metastore to manage metadata and supports partitioning, clustering, and complex data types. - HBase is a column-oriented NoSQL database modeled after Bigtable. It stores billions of
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Acropolis Institute of Technology &

Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-V:
BIG DATA TOOLS AND TECHNIQUES: Installing and Running Pig, Comparison with Databases, Pig Latin, User-
Define Functions, Data Processing Operators, Installing and Running Hive, Hive QL, Querying Data, User-Defined
Functions, Oracle Big Data.

December 13, 2023 3


Hadoop Subprojects
Hadoop Related Subprojects
Pig
 High-level language for data analysis
HBase
 Table storage for semi-structured data
Zookeeper
 Coordinating distributed applications
Hive
 SQL-like Query language and Metastore
Mahout
 Machine learning
Pig
Started at Yahoo! Research
Pig
Now runs about 30% of Yahoo!’s jobs
Features
Expresses sequences of MapReduce jobs
Data model: nested “bags” of items
Provides relational (SQL) operators
(JOIN, GROUP BY, etc.)
Easy to plug in Java functions
An Example Problem
Suppose you have Load Users Load Pages
user data in a file,
website data in Filter by age

another, and you


need to find the top Join on name

5 most visited pages Group on url


by users aged 18-25 Count clicks

Order by clicks

Take top 5
In MapReduce
Users = load ‘users’ as (name, age);
In Pig Latin Filtered = filter Users by age >= 18 and age <=
25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Ease of Translation

Load Users Load Pages

Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Group on url Grouped = group …
Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks

Take top 5
Ease of Translation

Load Users Load Pages

Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
Ease of Translation

Load Users Load Pages

Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
HBase
Modeled on Google’s Bigtable
HBase - What?
Row/column store
Billions of rows/millions on columns
Column-oriented - nulls are free
Untyped - stores byte[]
HBase - Data Model

Column
Column family:
Row Timestamp family
animal:
repairs:
animal:type animal:size repairs:cost
t2 zebra 1000 EUR
enclosure1
t1 lion big
enclosure2 … … … …
HBase - Data Storage Column family animal:
(enclosure1, t2, animal:type) zebra
(enclosure1, t1, animal:size) big
(enclosure1, t1, animal:type) lion

Column family repairs:


(enclosure1, t1, repairs:cost) 1000 EUR
HTable table = …
HBase - Code Text row = new Text(“enclosure1”);
Text col1 = new Text(“animal:type”);
Text col2 = new Text(“animal:size”);
BatchUpdate update = new BatchUpdate(row);
update.put(col1, “lion”.getBytes(“UTF-8”));
update.put(col2, “big”.getBytes(“UTF-8));
table.commit(update);

update = new BatchUpdate(row);


update.put(col1, “zebra”.getBytes(“UTF-8”));
table.commit(update);
HBase - Querying
Retrieve a cell
Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();

Retrieve a row
RowResult = table.getRow( “enclosure1” );

Scan through a range of rows


Scanner s = table.getScanner( new String[] { “animal:type” } );
Hive
Developed at Facebook
Hive
Used for majority of Facebook jobs
“Relational database” built on Hadoop
Maintains list of table schemas
SQL-like query language (HiveQL)
Can call Hadoop Streaming scripts from HiveQL
Supports table partitioning, clustering, complex
data types, some optimizations
Creating a Hive Table
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

Partitioning breaks table into separate files for


each (dt, country) pair
Ex: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
A Simple Query

• Find all page views coming from xyz.com


on March 31st:
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND page_views.date <= '2008-03-31'
AND page_views.referrer_url like '%xyz.com';

• Hive only reads partition 2008-03-01,*


instead of scanning entire table
Aggregation and Joins

• Count users who visited each page by gender:


SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)
FROM page_views pv JOIN user u ON (pv.userid = u.id)
GROUP BY pv.page_url, u.gender
WHERE pv.date = '2008-03-03';

• Sample output:
Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_views.userid,

page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
Storm
Developed by BackType which was acquired by
Storm Twitter
Lots of tools for data (i.e. batch) processing
 Hadoop, Pig, HBase, Hive, …
None of them are realtime systems which is
becoming a real requirement for businesses
Storm provides realtime computation
 Scalable
 Guarantees no data loss
 Extremely robust and fault-tolerant
 Programming language agnostic
Before Storm
Before Storm – Adding a worker

Deploy

Reconfigure/Redeploy
Scaling is painful
Problems
Poor fault-tolerance
Coding is tedious
Guaranteed data processing
What we want
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message
passing
“Just works” !!
Storm Cluster
Master node (similar to
Hadoop JobTracker)

Used for cluster coordination

Run worker processes


Streams
Concepts
Spouts
Bolts
Topologies
Streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples


Spouts

Source of streams
Bolts

Processes input streams and produces new streams:


Can implement functions such as filters, aggregation, join, etc
Topology

Network of spouts and bolts


Topology

Spouts and bolts execute as


many tasks across the cluster
Stream Grouping

When a tuple is emitted which task does it go to?


Stream Grouping
• Shuffle grouping: pick a random task
• Fields grouping: consistent hashing on a
subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id
Demo
Word Count
hadoop jar hadoop-0.20.2-examples.jar wordcount <input dir> <output dir>
Hive
hive -f pagerank.hive

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy