12 SparkAggregatingData
12 SparkAggregatingData
12 SparkAggregatingData
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
Aggrega:ng
Data
with
Pair
RDDs
13
Wri&ng
and
Deploying
Spark
Applica&ons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaDerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-2
Aggrega&ng
Data
with
Pair
RDDs
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-3
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-4
Pair
RDDs
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-5
Crea&ng
Pair
RDDs
The
rst
step
in
most
workows
is
to
get
the
data
into
key/value
form
What
should
the
RDD
should
be
keyed
on?
What
is
the
value?
Commonly
used
func:ons
to
create
Pair
RDDs
map
flatMap
/
flatMapValues
keyBy
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-6
Example:
A
Simple
Pair
RDD
(user001,Fred Flintstone)
user001\tFred Flintstone
(user090,Bugs Bunny)
user090\tBugs Bunny
user111\tHarry Potter (user111,Harry Potter)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-7
Example:
Keying
Web
Logs
by
User
ID
> sc.textFile(logfile) \
Python
.keyBy(lambda line: line.split(' ')[2])
Scala
> sc.textFile(logfile) \
.keyBy(line => line.split(' ')(2))
User
ID
56.38.234.188 99788 "GET /KBDOC-00157.html HTTP/1.0"
56.38.234.188 99788 "GET /theme.css HTTP/1.0"
203.146.17.59 25254 "GET /KBDOC-00230.html HTTP/1.0"
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-8
Ques&on
1:
Pairs
With
Complex
Values
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
00211 43.005895 -71.013202 (00211,(43.005895,-71.013202))
00212 43.005895 -71.013202
?
(00212,(43.005895,-71.013202))
00213 43.005895 -71.013202 (00213,(43.005895,-71.013202))
00214 43.005895 -71.013202
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-9
Answer
1:
Pairs
With
Complex
Values
> sc.textFile(file) \
.map(lambda line: line.split()) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))
> sc.textFile(file).
map(line => line.split('\t')).
map(fields => (fields(0),(fields(1),fields(2))))
(00210,(43.005895,-71.013202))
00210 43.005895 -71.013202
01014 42.170731 -72.604842 (01014,(42.170731,-72.604842))
01062 42.324232 -72.67915 (01062,(42.324232,-72.67915))
01263 42.3929 -73.228483 (01263,(42.3929,-73.228483))
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-10
Ques&on
2:
Mapping
Single
Rows
to
Mul&ple
Pairs
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-11
Ques&on
2:
Mapping
Single
Rows
to
Mul&ple
Pairs
(2)
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
(00001,(sku010,sku933,sku022))
(00002,(sku912,sku331))
(00003,(sku888,sku022,sku010,sku594))
(00004,(sku411))
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-12
Answer
2:
Mapping
Single
Rows
to
Mul&ple
Pairs
(1)
> sc.textFile(file)
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-13
Answer
2:
Mapping
Single
Rows
to
Mul&ple
Pairs
(2)
> sc.textFile(file) \
.map(lambda line: line.split('\t'))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331] Note
that
split
returns
[00003,sku888:sku022:sku010:sku594] 2-element
arrays,
not
[00004,sku411] pairs/tuples
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-14
Answer
2:
Mapping
Single
Rows
to
Mul&ple
Pairs
(3)
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
[00001,sku010:sku933:sku022]
00004 sku411
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594]
(00001,sku010:sku933:sku022)
[00004,sku411] Map
array
elements
to
(00002,sku912:sku331)
(00003,sku888:sku022:sku010:sku594) tuples
to
produce
a
Pair
RDD
(00004,sku411)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-15
Answer
2:
Mapping
Single
Rows
to
Mul&ple
Pairs
(4)
> sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))
00001 sku010:sku933:sku022
00002 sku912:sku331 (00001,sku010)
00003 sku888:sku022:sku010:sku594 (00001,sku933)
[00001,sku010:sku933:sku022]
00004 sku411 (00001,sku022)
[00002,sku912:sku331]
[00003,sku888:sku022:sku010:sku594] (00002,sku912)
(00001,sku010:sku933:sku022)
[00004,sku411] (00002,sku331)
(00002,sku912:sku331)
(00003,sku888)
(00003,sku888:sku022:sku010:sku594)
(00004,sku411)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-16
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-17
Map-Reduce
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-18
Map-Reduce
in
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-19
Map-Reduce
Example:
Word
Count
Result
aardvark 1
Input
Data
cat 1
the cat sat on the mat mat 1
the aardvark sat on the sofa ?
on 2
sat 2
sofa 1
the 4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-20
Example:
Word
Count
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-21
Example:
Word
Count
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-22
Example:
Word
Count
(3)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-23
Example:
Word
Count
(4)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-24
ReduceByKey
(1)
(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (the,3) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the, 4)
(aardvark,1) (cat,1)
(sat,1) (sat,2)
(on,1)
(the,1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-25
ReduceByKey
(2)
(the,1)
(cat,1)
(the,2)
(sat,1) (on,2)
(on,1) (sofa,1)
(the,1) (mat,1)
(mat,1) (aardvark,1)
(the,4)
(the,1) (the, 4)
(aardvark,1) (cat,1)
(the,2)
(sat,1) (sat,2)
(on,1)
(the,1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-26
Word
Count
Recap
(the
Scala
Version)
OR
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-27
Why
Do
We
Care
About
Coun&ng
Words?
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-28
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-29
Pair
RDD
Opera&ons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-30
Example:
Pair
RDD
Opera&ons
(00004,sku411)
(00003,sku888)
(00001,sku010)
(00003,sku022)
(00001,sku933) e y ( l s e)
B y K i n g =F a (00003,sku010)
r t
(00001,sku022) so c end
a s (00003,sku594)
(00002,sku912)
(00002,sku912)
(00002,sku331)
(00003,sku888)
(00002,[sku912,sku331])
(00001,[sku022,sku010,sku933])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-31
Example:
Joining
by
Key
(Casablanca,($3.7M,1942))
(Star Wars,($775M,1977))
(Annie Hall,($38M,1977))
(Argo,($232M,2012))
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-32
Using
Join
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-33
Example:
Join
Web
Log
With
Knowledge
Base
Ar&cles
(1)
weblogs
56.38.234.188 99788 "GET /KBDOC-00157.html HTTP/1.0"
56.38.234.188 99788 "GET /theme.css HTTP/1.0"
203.146.17.59 25254 "GET /KBDOC-00230.html HTTP/1.0"
221.78.60.155 45402 "GET /titanic_4000_sales.html HTTP/1.0"
65.187.255.81 14242 "GET /KBDOC-00107.html HTTP/1.0"
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-34
Example:
Join
Web
Log
With
Knowledge
Base
Ar&cles
(2)
Steps
1. Map
separate
datasets
into
key-value
Pair
RDDs
a. Map
web
log
requests
to
(docid,userid)
b. Map
KB
Doc
index
to
(docid,title)
2. Join
by
key:
docid
3. Map
joined
data
into
the
desired
format:
(userid,title)
4. Further
processing:
group
&tles
by
User
ID
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-35
Step
1a:
Map
Web
Log
Requests
to
(docid,userid)
> import re
> def getRequestDoc(s):
return re.search(r'KBDOC-[0-9]*',s).group()
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-36
Step
1b:
Map
KB
Index
to
(docid,title)
kblist
(KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-37
Step
2:
Join
By
Key
docid
kbreqs
kblist
(KBDOC-00157,99788) (KBDOC-00157,Ronin Novelty Note 3 - Back up files)
(KBDOC-00230,25254) (KBDOC-00230,Sorrento F33L - Transfer Contacts)
(KBDOC-00107,14242) (KBDOC-00050,Titanic 1000 - Transfer Contacts)
(KBDOC-00107,MeeToo 5.0 - Transfer Contacts)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-38
Step
3:
Map
Result
to
Desired
Format
(userid,title)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-39
Step
4:
Con&nue
Processing
Group
Titles
by
User
ID
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-40
Example
Output
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-41
Aside:
Anonymous
Func&on
Parameters
Python and Scala pa^ern matching can help improve code readability
OR
(KBDOC-00157,(99788,title)) (99788,title)
(KBDOC-00230,(25254,title)) (25254,title)
(KBDOC-00107,(14242,title)) (14242,title)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-42
Other
Pair
Opera&ons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-43
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-44
Essen&al
Points
Pair
RDDs
are
a
special
form
of
RDD
consis:ng
of
Key-Value
pairs
(tuples)
Spark
provides
several
opera:ons
for
working
with
Pair
RDDs
Map-reduce
is
a
generic
programming
model
for
distributed
processing
Spark
implements
map-reduce
with
Pair
RDDs
Hadoop
MapReduce
and
other
implementa&ons
are
limited
to
a
single
map
and
single
reduce
phase
per
job
Spark
allows
exible
chaining
of
map
and
reduce
opera&ons
Spark
provides
opera&ons
to
easily
perform
common
map-reduce
algorithms
like
joining,
sor&ng,
and
grouping
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-45
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
12-46
Homework:
Use
Pair
RDDs
to
Join
Two
Datasets
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 12-47