Data Warehousing & Analytics On Hadoop: Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
Data Warehousing & Analytics On Hadoop: Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
Data Warehousing & Analytics On Hadoop: Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
Filers
Hive on
Oracle RAC Federated MySQL
Hadoop Cluster
HIVE: Components
Mgmt. Web UI
Hive CLI
Browsing Queries DDL
Hive QL
SerDe
Thrift Jute JSON..
MetaStore
Data Model
#Buckets=32
Schema Bucketing Info
Partitioning Cols
Library
Hash clicks
Partitioning
Logical Partitioning
…
/hive/clicks
/hive/clicks/ds=2008-03-25 Tables
/hive/clicks/ds=2008-03-25/0
HDFS MetaStore
Dealing with Structured Data
Type system
– Primitive types
– Recursively build up using Composition/Maps/Lists
Generic (De)Serialization Interface (SerDe)
– To recursively list schema
– To recursively access fields within a row object
Serialization families implement interface
– Thrift DDL based SerDe
– Delimited text based SerDe
– You can write your own SerDe
Schema Evolution
MetaStore
DDL:
– create table/drop table/rename table
– alter table add column
Browsing:
– show tables
– describe table
– cat table
Loading Data
Queries
Hive Query Language
Philosophy
– SQL like constructs + Hadoop Streaming
Package these capabilities into a more formal SQL like query language
in next version
Introduce other important constructs:
– Ability to stream data thru custom mappers/reducers
– Multi table inserts
– Multiple group bys
– SQL like column expressions and some XPath like expressions
– Etc..
Joins
• Joins
FROM page_view pv JOIN user u ON (pv.userid = u.id)
INSERT INTO TABLE pv_users
SELECT pv.*, u.gender, u.age
WHERE pv.date = 2008-03-03;
• Outer Joins
FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)
INSERT INTO TABLE pv_users
SELECT pv.*, u.gender, u.age
WHERE pv.date = 2008-03-03;
Aggregations and Multi-Table Inserts
FROM pv_users
INSERT INTO TABLE pv_gender_uu
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO TABLE pv_ip_uu
SELECT pv_users.ip, count(DISTINCT pv_users.id)
GROUP BY(pv_users.ip);
Running Custom Map/Reduce Scripts
FROM (
FROM pv_users
SELECT TRANSFORM(pv_users.userid, pv_users.date) USING
'map_script'
AS(dt, uid)
CLUSTER BY(dt)) map
INSERT INTO TABLE pv_users_reduced
SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script'
AS (date, count);
Inserts into Files, Tables and Local Files
FROM pv_users
INSERT INTO TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age)
INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’
FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age);
Hadoop Usage @ Facebook
Types of Applications:
– Summarization
Eg: Daily/Weekly aggregations of impression/click counts
– Ad hoc Analysis
Eg: how many group admins broken down by state/country
– Data Mining (Assembling training data)
Eg: User Engagement as a function of user attributes
Hadoop Usage @ Facebook
Usage statistics:
– Total Users: ~140 (about 50% of engineering !) in the last 1 ½
months
– Hive Data (compressed): 80 TB total, ~1TB incoming per day
– Job statistics:
~1000 jobs/day
~100 loader jobs/day
Hadoop Improvements @ Facebook
Some problems:
– No Fair Sharing: Big tasks can hog the cluster
– No snapshots: What if a software bug corrupts the NameNode
transaction log
Solutions:
– Simple fair sharing (Matie Zaharia)
– Investigating Snapshots (Dhrubha Bortharkur)
Conclusion
JIRA http://issues.apache.org/jira/browse/HADOOP-3601
Soon to be checked into hadoop trunk
Release available in hadoop version 0.19
People:
– Suresh Anthony
– Zheng Shao
– Prasad Chakka
– Pete Wyckoff
– Namit Jain
– Raghu Murthy
– Joydeep Sen Sarma
– Ashish Thusoo