0% found this document useful (0 votes)
26 views50 pages

Developer Exercise Instructions

This document describes hands-on exercises for using HDFS and MapReduce with Hadoop. The first exercise has users explore HDFS by listing directories and uploading sample Shakespeare data files. Users learn how to browse HDFS with the hadoop fs command and upload local files into HDFS for distributed storage and processing.

Uploaded by

seshuchoudary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views50 pages

Developer Exercise Instructions

This document describes hands-on exercises for using HDFS and MapReduce with Hadoop. The first exercise has users explore HDFS by listing directories and uploading sample Shakespeare data files. Users learn how to browse HDFS with the hadoop fs command and upload local files into HDFS for distributed storage and processing.

Uploaded by

seshuchoudary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

201212!

Cloudera Developer Training for


Apache Hadoop:
Hands-On Exercises

General'Notes'...........................................................................................................................'3!
Hands0On'Exercise:'Using'HDFS'.........................................................................................'5!
Hands0On'Exercise:'Running'a'MapReduce'Job'..........................................................'11!
Hands0On'Exercise:'Writing'a'MapReduce'Program'.................................................'15!
Hands0On'Exercise:'Writing'Unit'Tests'With'the'MRUnit'Framework'...............'23!
Hands0On'Exercise:'Writing'and'Implementing'a'Combiner'.................................'24!
Hands0On'Exercise:'Writing'a'Partitioner'....................................................................'25!
Hands0On'Exercise:'Using'Counters'and'a'Map0Only'Job'.........................................'27!
Hands0On'Exercise:'Using'SequenceFiles'and'File'Compression'..........................'28!
Hands0On'Exercise:'Creating'an'Inverted'Index'.........................................................'32!
Hands0On'Exercise:'Calculating'Word'Co0Occurrence'.............................................'36!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 1


Not to be reproduced without prior written consent.
Optional'Hands0On'Exercise:'Implementing'Word'Co0Occurrence'with'a'
Custom'WritableComparable'............................................................................................'37!
Hands0On'Exercise:'Importing'Data'With'Sqoop'........................................................'38!
Hands0On'Exercise:'Using'a'Mahout'Recommender'.................................................'42!
Hands0On'Exercise:'Manipulating'Data'With'Hive'.....................................................'44!
Hands0On'Exercise:'Using'Pig'to'Retrieve'Movie'Names'From'Our'
Recommender'........................................................................................................................'47!
Hands0On'Exercise:'Running'an'Oozie'Workflow'......................................................'49!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 2


Not to be reproduced without prior written consent.
General Notes
Cloudera’s!training!courses!use!a!Virtual!Machine!running!the!CentOS!6.3!Linux!
distribution.!This!VM!has!CDH4.1!(Cloudera’s!Distribution,!including!Apache!
Hadoop!version!4.1)!installed!in!PseudoJDistributed!mode.!PseudoJDistributed!
mode!is!a!method!of!running!Hadoop!whereby!all!five!Hadoop!daemons!run!on!the!
same!machine.!It!is,!essentially,!a!cluster!consisting!of!a!single!machine.!It!works!just!
like!a!larger!Hadoop!cluster,!the!only!key!difference!(apart!from!speed,!of!course!)!
being!that!the!block!replication!factor!is!set!to!1,!since!there!is!only!a!single!
DataNode!available.!

Points to note while working in the VM


1.' The!VM!is!set!to!automatically!log!in!as!the!user!training.!Should!you!log!out!
at!any!time,!you!can!log!back!in!as!the!user!training!with!the!password!
training.!

2.' Should!you!need!it,!the!root!password!is!training.!You!may!be!prompted!for!
this!if,!for!example,!you!want!to!change!the!keyboard!layout.!In!general,!you!
should!not!need!this!password!since!the!training!user!has!unlimited!sudo!
privileges.!

3.' In!some!commandJline!steps!in!the!exercises,!you!will!see!lines!like!this:!

$ hadoop fs -put shakespeare \


/user/training/shakespeare

The!backslash!at!the!end!of!the!first!line!signifies!that!the!command!is!not!
completed,!and!continues!on!the!next!line.!You!can!enter!the!code!exactly!as!
shown!(on!two!lines),!or!you!can!enter!it!on!a!single!line.!If!you!do!the!latter,!you!
should!not!type!in!the!backslash.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 3


Not to be reproduced without prior written consent.
Points to note during the exercises
1.' There!are!additional!challenges!for!most!of!the!HandsJOn!Exercises.!If!you!finish!
the!main!exercise,!please!attempt!the!additional!exercise.!

2.' Sample!solutions!are!always!available!in!the!sample_solutions!
subdirectory!of!the!exercise!directory.!

3.' As!the!exercises!progress,!and!you!gain!more!familiarity!with!Hadoop!and!
MapReduce,!we!provide!fewer!stepJbyJstep!instructions!J!as!in!the!real!world,!
we!merely!give!you!a!requirement!and!it’s!up!to!you!to!solve!the!problem!!There!
are!‘stub’!files!for!each!exercise!to!get!you!started,!and!you!should!feel!free!to!
ask!your!instructor!for!assistance!at!any!time.!We!also!provide!some!hints!in!
many!of!the!exercises.!And,!of!course,!you!can!always!consult!with!your!fellow!
students.!

4.' If!you!would!like!to!have!more!hints!than!are!provided!in!the!default!stub!files,!
you!can!use!stub!files!with!additional!hints!in!the!stubs_with_hints!
subdirectory!of!each!exercise!directory.!

5.' If!you!are!working!in!Eclipse!and!would!prefer!to!use!the!stub!files!with!
additional!hints,!do!the!following:!

a. Close!Eclipse.!

b. In!a!terminal!window,!change!to!the!/home/training/scripts!
directory.!

c. Run!the!following!command:!

./eclipse_projects.sh hint

d. Restart!Eclipse.!

e. Refresh!each!of!the!Eclipse!projects.!(To!refresh!a!project,!rightJclick!
the!project!and!select!Refresh.)!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 4


Not to be reproduced without prior written consent.
Hands-On Exercise: Using HDFS
In'this'exercise'you'will'begin'to'get'acquainted'with'the'Hadoop'tools.'You'
will'manipulate'files'in'HDFS,'the'Hadoop'Distributed'File'System.'

Hadoop
Hadoop!is!already!installed,!configured,!and!running!on!your!virtual!machine.!!

Most!of!your!interaction!with!the!system!will!be!through!a!commandJline!wrapper!
called!hadoop.!If!you!start!a!terminal!and!run!this!program!with!no!arguments,!it!
prints!a!help!message.!To!try!this,!run!the!following!command:!

$ hadoop

(Note:!although!your!command!prompt!is!more!verbose,!we!use!‘$’!to!indicate!the!
command!prompt!for!brevity’s!sake.)!!!

The!hadoop!command!is!subdivided!into!several!subsystems.!For!example,!there!is!
a!subsystem!for!working!with!files!in!HDFS!and!another!for!launching!and!managing!
MapReduce!processing!jobs.!

Step 1: Exploring HDFS


The!subsystem!associated!with!HDFS!in!the!Hadoop!wrapper!program!is!called!
FsShell.!This!subsystem!can!be!invoked!with!the!command!hadoop fs.!

1.' Open!a!terminal!window!(if!one!is!not!already!open)!by!doubleJclicking!the!
Terminal!icon!on!the!desktop.!

2.' In!the!terminal!window,!enter:!

$ hadoop fs

You!see!a!help!message!describing!all!the!commands!associated!with!this!
subsystem.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 5


Not to be reproduced without prior written consent.
3.' Enter:!

$ hadoop fs -ls /

This!shows!you!the!contents!of!the!root!directory!in!HDFS.!!There!will!be!
multiple!entries,!one!of!which!is!/user.!Individual!users!have!a!“home”!
directory!under!this!directory,!named!after!their!username!J!your!home!
directory!is!/user/training.!!!

4.' Try!viewing!the!contents!of!the!/user!directory!by!running:!

$ hadoop fs -ls /user

You!will!see!your!home!directory!in!the!directory!listing.!!!

5.' Try!running:!

$ hadoop fs -ls /user/training

There!are!no!files,!so!the!command!silently!exits.!This!is!different!than!if!you!ran!
hadoop fs -ls /foo,!which!refers!to!a!directory!that!doesn’t!exist!and!
which!would!display!an!error!message.!

Note!that!the!directory!structure!in!HDFS!has!nothing!to!do!with!the!directory!
structure!of!the!local!filesystem;!they!are!completely!separate!namespaces.!

Step 2: Uploading Files


Besides!browsing!the!existing!filesystem,!another!important!thing!you!can!do!with!
FsShell!is!to!upload!new!data!into!HDFS.!

1.' Change!directories!to!the!directory!containing!the!sample!data!we!will!be!using!
in!the!course.!

cd ~/training_materials/developer/data

If!you!perform!a!‘regular’!ls!command!in!this!directory,!you!will!see!a!few!files,!
including!two!named!shakespeare.tar.gz!and!!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 6


Not to be reproduced without prior written consent.
shakespeare-stream.tar.gz.!Both!of!these!contain!the!complete!works!of!
Shakespeare!in!text!format,!but!with!different!formats!and!organizations.!For!
now!we!will!work!with!shakespeare.tar.gz.!!!

2.' Unzip!shakespeare.tar.gz!by!running:!

$ tar zxvf shakespeare.tar.gz

This!creates!a!directory!named!shakespeare/!containing!several!files!on!your!
local!filesystem.!!!

3.' Insert!this!directory!into!HDFS:!

$ hadoop fs -put shakespeare /user/training/shakespeare

This!copies!the!local!shakespeare!directory!and!its!contents!into!a!remote,!
HDFS!directory!named!/user/training/shakespeare.!!!

4.' List!the!contents!of!your!HDFS!home!directory!now:!

$ hadoop fs -ls /user/training

You!should!see!an!entry!for!the!shakespeare!directory.!!!

5.' Now!try!the!same!fs -ls!command!but!without!a!path!argument:!

$ hadoop fs -ls

You!should!see!the!same!results.!If!you!don’t!pass!a!directory!name!to!the!-ls!
command,!it!assumes!you!mean!your!home!directory,!i.e.!/user/training.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 7


Not to be reproduced without prior written consent.
Relative paths

If you pass any relative (non-absolute) paths to FsShell commands (or use
relative paths in MapReduce programs), they are considered relative to your
home directory.

6.' We!also!have!a!Web!server!log!file!which!we!will!put!into!HDFS!for!use!in!future!
exercises.!This!file!is!currently!compressed!using!GZip.!Rather!than!extract!the!
file!to!the!local!disk!and!then!upload!it,!we!will!extract!and!upload!in!one!step.!
First,!create!a!directory!in!HDFS!in!which!to!store!it:!

$ hadoop fs -mkdir weblog

7.' Now,!extract!and!upload!the!file!in!one!step.!The!-c!option!to!gunzip!
uncompresses!to!standard!output,!and!the!dash!(‘J‘)!in!the!hadoop fs -put!
command!takes!whatever!is!being!sent!to!its!standard!input!and!places!that!data!
in!HDFS.!

$ gunzip -c access_log.gz \
| hadoop fs -put - weblog/access_log

8.' Run!the!hadoop fs -ls!command!to!verify!that!the!Apache!log!file!is!in!your!


HDFS!home!directory.!

9.' The!access!log!file!is!quite!large!–!around!500!MB.!Create!a!smaller!version!of!
this!file,!consisting!only!of!its!first!5000!lines,!and!store!the!smaller!version!in!
HDFS.!You!can!use!the!smaller!version!for!testing!in!subsequent!exercises.!

hadoop fs -mkdir testlog


gunzip -c access_log.gz | head -n 5000 \
| hadoop fs –put - testlog/test_access_log

Step 3: Viewing and Manipulating Files


Now!let’s!view!some!of!the!data!copied!into!HDFS.!!!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 8


Not to be reproduced without prior written consent.
1.' Enter:!

$ hadoop fs -ls shakespeare

This!lists!the!contents!of!the!/user/training/shakespeare!directory,!
which!consists!of!the!files!comedies,!glossary,!histories,!poems,!and!
tragedies.!!!

2.' The!glossary!file!included!in!the!tarball!you!began!with!is!not!strictly!a!work!
of!Shakespeare,!so!let’s!remove!it:!

$ hadoop fs -rm shakespeare/glossary

Note!that!you!could!leave!this!file!in!place!if!you!so!wished.!If!you!did,!then!it!
would!be!included!in!subsequent!computations!across!the!works!of!
Shakespeare,!and!would!skew!your!results!slightly.!As!with!many!realJworld!big!
data!problems,!you!make!tradeJoffs!between!the!labor!to!purify!your!input!data!
and!the!precision!of!your!results.!

3.' Enter:!

$ hadoop fs -cat shakespeare/histories | tail -n 50

This!prints!the!last!50!lines!of!Henry,IV,,Part,1!to!your!terminal.!This!command!
is!handy!for!viewing!the!output!of!MapReduce!programs.!Very!often,!an!
individual!output!file!of!a!MapReduce!program!is!very!large,!making!it!
inconvenient!to!view!the!entire!file!in!the!terminal.!!For!this!reason,!it’s!often!a!
good!idea!to!pipe!the!output!of!the!fs -cat!command!into!head,!tail,!more,!
or!less.!

Note!that!when!you!pipe!the!output!of!the!fs -cat!command!to!a!local!UNIX!
command,!the!full!contents!of!the!file!are!still!extracted!from!HDFS!and!sent!to!
your!local!machine.!Once!on!your!local!machine,!the!file!contents!are!then!
modified!before!being!displayed.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 9


Not to be reproduced without prior written consent.
4.' If!you!want!to!download!a!file!and!manipulate!it!in!the!local!filesystem,!you!can!
use!the!fs -get!command.!This!command!takes!two!arguments:!an!HDFS!path!
and!a!local!path.!It!copies!the!HDFS!contents!into!the!local!filesystem:!

$ hadoop fs -get shakespeare/poems ~/shakepoems.txt!


$ less ~/shakepoems.txt

Other Commands
There!are!several!other!commands!associated!with!the!FsShell!subsystem,!to!
perform!most!common!filesystem!manipulations:!mv,!cp,!mkdir,!etc.!!!

1.' Enter: !

$ hadoop fs

This!displays!a!brief!usage!report!of!the!commands!within!FsShell.!Try!
playing!around!with!a!few!of!these!commands!if!you!like.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 10


Not to be reproduced without prior written consent.
Hands-On Exercise: Running a
MapReduce Job
In'this'exercise'you'will'compile'Java'files,'create'a'JAR,'and'run'MapReduce'
jobs.'

In!addition!to!manipulating!files!in!HDFS,!the!wrapper!program!hadoop!is!used!to!
launch!MapReduce!jobs.!The!code!for!a!job!is!contained!in!a!compiled!JAR!file.!!
Hadoop!loads!the!JAR!into!HDFS!and!distributes!it!to!the!worker!nodes,!where!the!
individual!tasks!of!the!MapReduce!job!are!executed.!

One!simple!example!of!a!MapReduce!job!is!to!count!the!number!of!occurrences!of!
each!word!in!a!file!or!set!of!files.!In!this!lab!you!will!compile!and!submit!a!
MapReduce!job!to!count!the!number!of!occurrences!of!every!word!in!the!works!of!
Shakespeare.!

Compiling and Submitting a MapReduce Job


1.' In!a!terminal!window,!change!to!the!working!directory,!and!take!a!directory!
listing:!

$ cd ~/training_materials/developer/exercises/wordcount
$ ls

This!directory!contains!the!following!Java!files:!

WordCount.java:!A!simple!MapReduce!driver!class.!
WordMapper.java:!A!mapper!class!for!the!job.!
SumReducer.java:!A!reducer!class!for!the!job.!

Examine!these!files!if!you!wish,!but!do!not!change!them.!Remain!in!this!
directory!while!you!execute!the!following!commands.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 11


Not to be reproduced without prior written consent.
2.' Compile!the!three!Java!classes:!

$ javac -classpath `hadoop classpath` *.java

Note:'in'the'command'above,'the'quotes'around'hadoop classpath'are'
back'quotes.'This'runs'the'hadoop classpath'command'and'uses'its'
output'as'part'of'the'javac'command.!

Your!command!includes!the!classpath!for!the!Hadoop!core!API!classes.!The!
compiled!(.class)!files!are!placed!in!your!local!directory.!!!

3.' Collect!your!compiled!Java!files!into!a!JAR!file:!

$ jar cvf wc.jar *.class!

4.' Submit!a!MapReduce!job!to!Hadoop!using!your!JAR!file!to!count!the!occurrences!
of!each!word!in!Shakespeare:!

$ hadoop jar wc.jar WordCount shakespeare wordcounts

This!hadoop jar!command!names!the!JAR!file!to!use!(wc.jar),!the!class!
whose!main!method!should!be!invoked!(WordCount),!and!the!HDFS!input!and!
output!directories!to!use!for!the!MapReduce!job.!

Your!job!reads!all!the!files!in!your!HDFS!shakespeare!directory,!and!places!its!
output!in!a!new!HDFS!directory!called!wordcounts.!

5.' Try!running!this!same!command!again!without!any!change:!

$ hadoop jar wc.jar WordCount shakespeare wordcounts!

Your!job!halts!right!away!with!an!exception,!because!Hadoop!automatically!fails!
if!your!job!tries!to!write!its!output!into!an!existing!directory.!This!is!by!design:!
since!the!result!of!a!MapReduce!job!may!be!expensive!to!reproduce,!Hadoop!
prevents!you!from!accidentally!overwriting!previously!existing!files.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 12


Not to be reproduced without prior written consent.
6.' Review!the!result!of!your!MapReduce!job:!

$ hadoop fs -ls wordcounts

This!lists!the!output!files!for!your!job.!(Your!job!ran!with!only!one!Reducer,!so!
there!should!be!one!file,!named!part-r-00000,!along!with!a!_SUCCESS!file!
and!a!_logs!directory.)!

7.' View!the!contents!of!the!output!for!your!job:!

$ hadoop fs -cat wordcounts/part-r-00000 | less

You!can!page!through!a!few!screens!to!see!words!and!their!frequencies!in!the!
works!of!Shakespeare.!Note!that!you!could!have!specified!wordcounts/*!just!
as!well!in!this!command.!

8.' Try!running!the!WordCount!job!against!a!single!file:!

$ hadoop jar wc.jar WordCount shakespeare/poems pwords

When!the!job!completes,!inspect!the!contents!of!the!pwords!directory.!

9.' Clean!up!the!output!files!produced!by!your!job!runs:!

$ hadoop fs –rm -r wordcounts pwords

Stopping MapReduce Jobs


It!is!important!to!be!able!to!stop!jobs!that!are!already!running.!This!is!useful!if,!for!
example,!you!accidentally!introduced!an!infinite!loop!into!your!Mapper.!An!
important!point!to!remember!is!that!pressing!^C!to!kill!the!current!process!(which!
is!displaying!the!MapReduce!job's!progress)!does!not!actually!stop!the!job!itself.!The!
MapReduce!job,!once!submitted!to!the!Hadoop!daemons,!runs!independently!of!any!
initiating!process.!!

Losing!the!connection!to!the!initiating!process!does!not!kill!a!MapReduce!job.!!
Instead,!you!need!to!tell!the!Hadoop!JobTracker!to!stop!the!job.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 13


Not to be reproduced without prior written consent.
1.' Start!another!word!count!job!like!you!did!in!the!previous!section:!

$ hadoop jar wc.jar WordCount shakespeare count2

2.' While!this!job!is!running,!open!another!terminal!window!and!enter:!

$ mapred job -list

This!lists!the!job!ids!of!all!running!jobs.!A!job!id!looks!something!like:!
job_200902131742_0002!

3.' Copy!the!job!id,!and!then!kill!the!running!job!by!entering:!

$ mapred job -kill jobid

The!JobTracker!kills!the!job,!and!the!program!running!in!the!original!terminal,!
reporting!its!progress,!informs!you!that!the!job!has!failed.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 14


Not to be reproduced without prior written consent.
Hands-On Exercise: Writing a
MapReduce Program
In'this'exercise'you'write'a'MapReduce'job'that'reads'any'text'input'and'
computes'the'average'length'of'all'words'that'start'with'each'character.'You'
can'write'the'job'in'Java'or'using'Hadoop'Streaming.''

For!any!text!input,!the!job!should!report!the!average!length!of!words!that!begin!with!
‘a’,!‘b’,!and!so!forth.!!For!example,!for!input:!

Now is definitely the time

The!output!would!be:!

N 3
d 10
i 2
t 3.5

(For!the!initial!solution,!your!program!can!be!caseJsensitive—which!is!the!case!for!
Java!string!processing!by!default.)!

The Algorithm
The!algorithm!for!this!program!is!a!simple!oneJpass!MapReduce!program:!

The'Mapper'

The!Mapper!receives!a!line!of!text!for!each!input!value.!(Ignore!the!input!key.)!For!
each!word!in!the!line,!emit!the!first!letter!of!the!word!as!a!key,!and!the!length!of!the!
word!as!a!value.!For!example,!for!input!value:!

Now is definitely the time

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 15


Not to be reproduced without prior written consent.
Your!Mapper!should!emit:!

N 3
i 2
d 10
t 3
t 4

The'Reducer'

Thanks!to!the!sort/shuffle!phase!built!in!to!MapReduce,!the!Reducer!receives!the!
keys!in!sorted!order,!and!all!the!values!for!one!key!appear!together.!!So,!for!the!
Mapper!output!above,!the!Reducer!(if!written!in!Java)!receives!this:!

N (3)
d (10)
i (2)
t (3, 4)

If!you!will!be!writing!your!code!using!Hadoop!Streaming,!your!Reducer!would!
receive!the!following:!

N 3
d 10
i 2
t 3
t 4

For!either!type!of!input,!the!final!output!should!be:!

N 3
d 10
i 2
t 3.5

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 16


Not to be reproduced without prior written consent.
Choose Your Language
You!can!perform!this!exercise!in!Java!or!Hadoop!Streaming!(or!both!if!you!have!the!
time).!Your!virtual!machine!has!Perl,!Python,!PHP,!and!Ruby!installed,!so!you!can!
choose!any!of!these—or!even!shell!scripting—to!develop!a!Streaming!solution!if!you!
like.!Following!are!a!discussion!of!the!program!in!Java,!and!then!a!discussion!of!the!
program!in!Streaming.!

If!you!complete!the!first!part!of!the!exercise,!there!is!a!further!exercise!for!you!to!try.!
See!page!21!for!instructions.!

Set up Eclipse
We!have!created!Eclipse!projects!for!each!of!the!HandsJOn!Exercises.!Using!Eclipse!
will!speed!up!your!development!time.!!

Even!if!you!do!not!plan!to!use!Eclipse!to!develop!your!code,!you!still!need!to!set!up!
Eclipse.!In!the!“Writing!Unit!Tests!With!the!MRUnit!Framework”!lab,!you!will!use!
Eclipse!to!run!unit!tests.!

Follow!these!instructions!to!set!up!Eclipse!by!importing!projects!into!the!
environment:!

1.' Launch!Eclipse.!

2.' Select!Import!from!the!File!menu.!

3.' Select!General!J>!Existing!Projects!into!Workspace,!and!click!Next.!

4.' Specify!/home/training/workspace!in!the!Select!Root!Directory!field.!All!the!
exercise!projects!will!appear!in!the!Projects!field.!

5.' Click!OK,!then!click!Finish.!That!will!import!all!projects!into!your!workspace.!

The!steps!to!export!Java!code!to!a!JAR!file!and!run!the!code!are!described!in!the!
slides!for!this!chapter.!The!following!is!a!quick!review!of!the!steps!you!will!need!to!
perform!to!run!Hadoop!source!code!developed!in!Eclipse:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 17


Not to be reproduced without prior written consent.
1.' Verify!that!your!Java!code!does!not!have!any!compiler!errors!or!warnings.!

The!Eclipse!software!in!your!VM!is!preJconfigured!to!compile!code!
automatically!without!performing!any!explicit!steps.!Compile!errors!and!
warnings!appear!as!red!and!yellow!icons!to!the!left!of!the!code.!

2.' RightJclick!the!default!package!entry!for!the!Eclipse!project!(under!the!src!
entry).!

3.' Select!Export!

4.' Select!Java!>!.JAR!File!from!the!Export!dialog!box,!then!click!Next.!

5.' Specify!a!location!for!the!JAR!file.!You!can!place!your!JAR!files!wherever!you!like.!

6.' Run!the!hadoop jar!command!as!you!did!previously!in!the!Running!a!


MapReduce!Job!exercise.!

For!more!information!about!running!a!Hadoop!job!when!working!in!Eclipse,!
including!screen!shots,!refer!to!the!slides!for!this!chapter.!
!

The Program in Java


Basic!stub!files!for!the!exercise!can!be!found!in!
~/training_materials/developer/exercises/averagewordlength.!If!
you!like,!you!can!use!the!wordcount!example!(in!
~/training_materials/developer/exercises/wordcount!as!a!starting!
point!for!your!Java!code.!Here!are!a!few!details!to!help!you!begin!your!Java!
programming:!

1.' Define!the!driver!

This!class!should!configure!and!submit!your!basic!job.!!Among!the!basic!steps!
here,!configure!the!job!with!the!Mapper!class!and!the!Reducer!class!you!will!
write,!and!the!data!types!of!the!intermediate!and!final!keys.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 18


Not to be reproduced without prior written consent.
2.' Define!the!Mapper!

Note!these!simple!string!operations!in!Java:!

str.substring(0, 1) // String : first letter of str


str.length() // int : length of str

3.' Define!the!Reducer!

In!a!single!invocation!the!Reducer!receives!a!string!containing!one!letter!along!
with!an!iterator!of!integers.!!For!this!call,!the!reducer!should!emit!a!single!
output!of!the!letter!and!the!average!of!the!integers.!

4.' Test!your!program!

Compile,!jar,!and!test!your!program.!You!can!use!the!entire!Shakespeare!dataset!
for!your!input,!or!you!can!try!it!with!just!one!of!the!files!in!the!dataset,!or!with!
your!own!test!data.!

Solution in Java
The!directory!
~/training_materials/developer/exercises/averagewordlength/
sample_solution!contains!a!set!of!Java!class!definitions!that!solve!the!problem.!!!

The Program Using Hadoop Streaming


For!your!Hadoop!Streaming!program,!launch!a!text!editor!to!write!your!Mapper!
script!and!your!Reducer!script.!Here!are!some!notes!about!solving!the!problem!in!
Hadoop!Streaming:!

1.' The!Mapper!Script!

The!Mapper!will!receive!lines!of!text!on!stdin.!Find!the!words!in!the!lines!to!
produce!the!intermediate!output,!and!emit!intermediate!(key,!value)!pairs!by!
writing!strings!of!the!form:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 19


Not to be reproduced without prior written consent.
key <tab> value <newline>

These!strings!should!be!written!to!stdout.!

2.' The!Reducer!Script!

For!the!reducer,!multiple!values!with!the!same!key!are!sent!to!your!script!on!
stdin!as!successive!lines!of!input.!Each!line!contains!a!key,!a!tab,!a!value,!and!a!
newline.!All!lines!with!the!same!key!are!sent!one!after!another,!possibly!
followed!by!lines!with!a!different!key,!until!the!reducing!input!is!complete.!For!
example,!the!reduce!script!may!receive!the!following:!

t 3
t 4
w 4
w 6

For!this!input,!emit!the!following!to!stdout:!

t 3.5
w 5

Observe!that!the!reducer!receives!a!key!with!each!input!line,!and!must!“notice”!
when!the!key!changes!on!a!subsequent!line!(or!when!the!input!is!finished)!to!
know!when!the!values!for!a!given!key!have!been!exhausted.!

3.' Run!the!streaming!program!

You!can!run!your!program!with!Hadoop!Streaming!via:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 20


Not to be reproduced without prior written consent.
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/ \
contrib/streaming/hadoop-streaming*.jar \
-input inputDir -output outputDir \
-file pathToMapScript -file pathToReduceScript \
-mapper mapBasename -reducer reduceBasename

(Remember,!you!may!need!to!delete!any!previous!output!before!running!your!
program!!with!hadoop fs –rm -r dataToDelete.)!

Solution in Python
You!can!find!a!working!solution!to!this!exercise!written!in!Python!in!the!directory!
~/training_materials/developer/exercises/averagewordlength/
python.

Additional Exercise
If!you!have!more!time,!attempt!the!additional!exercise.!In!the!
log_file_analysis!directory,!you!will!find!stubs!for!the!Mapper!and!Driver.!
(There!is!also!a!sample!solution!available.)!

Your!task!is!to!count!the!number!of!hits!made!from!each!IP!address!in!the!sample!
(anonymized)!Apache!log!file!that!you!uploaded!to!the!/user/training/weblog!
directory!in!HDFS!when!you!performed!the!Using!HDFS!exercise.!!

Note:!If!you!want,!you!can!test!your!code!against!the!smaller!version!of!the!access!
log!in!the!/user/training/testlog!directory!before!you!run!your!code!against!
the!full!log!in!the!/user/training/weblog!directory.!

1.' Change!directory!to!~/training_materials/developer/exercises/
log_file_analysis.!

2.' Using!the!stub!files!in!that!directory,!write!Mapper!and!Driver!code!to!count!the!
number!of!hits!made!from!each!IP!address!in!the!access!log!file.!Your!final!result!
should!be!a!file!in!HDFS!containing!each!IP!address,!and!the!count!of!log!hits!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 21


Not to be reproduced without prior written consent.
from!that!address.!Note:'You'can're0use'the'Reducer'provided'in'the'
WordCount'hands0on'exercise,'or'you'can'write'your'own'if'you'prefer.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 22


Not to be reproduced without prior written consent.
Hands-On Exercise: Writing Unit
Tests With the MRUnit Framework
In!this!Exercise,!you!will!write!Unit!Tests!for!the!WordCount!code.!

Note:!You!should!have!set!up!Eclipse!when!you!performed!the!“Writing!a!
MapReduce!Program”!exercise.!If!you!did!not!set!up!Eclipse,!locate!the!section!of!
that!exercise!titled!“Set!up!Eclipse”!and!follow!the!steps!to!import!projects!into!
Eclipse.!

1.' Launch!Eclipse!and!expand!the!mrunit!folder.!

2.' Examine!the!TestWordCount.java!file!in!the!mrunit!project.!Notice!that!
three!tests!have!been!created,!one!each!for!the!Mapper,!Reducer,!and!the!entire!
MapReduce!flow.!Currently,!all!three!tests!simply!fail.!

3.' Run!the!tests!by!rightJclicking!on!TestWordCount.java!and!choosing!‘Run!
As’!J>!‘JUnit!Test’.!

4.' Observe!the!failure.!Results!in!the!JUnit!tab!(next!to!the!Package!Explorer!tab)!
should!indicate!that!three!tests!ran!with!three!failures.!

5.' Now!implement!the!three!tests.!If!you!need!hints,!there!is!a!sample!solution!in!
the!sample!solution!directory!within!the!mrunit!directory.!

6.' Run!the!tests!again.!!Results!in!the!JUnit!tab!should!indicate!that!three!tests!ran!
with!no!failures.!

This is the end of the Exercise


'

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 23


Not to be reproduced without prior written consent.
Hands-On Exercise: Writing and
Implementing a Combiner
In!this!HandsJOn!Exercise,!you!will!write!and!implement!a!Combiner!to!reduce!the!
amount!of!intermediate!data!sent!from!the!Mapper!to!the!Reducer.!You!will!use!the!
WordCount!Mapper!and!Reducer!from!an!earlier!exercise.!

There!are!further!exercises!if!you!have!more!time.!

Implement a Combiner
1.' Change!directory!to!
~/training_materials/developer/exercises/combiner!

2.' Copy!the!WordCount!Mapper!and!Reducer!into!this!directory!

$ cp ../wordcount/WordMapper.java .
$ cp ../wordcount/SumReducer.java .

3.' Complete!the!WordCountDriver.java!code!and,!if!necessary,!the!
SumCombiner.java!code!to!implement!a!Combiner!for!the!Wordcount!
program.!

4.' Compile!and!test!your!solution.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 24


Not to be reproduced without prior written consent.
Hands-On Exercise: Writing a
Partitioner
In!this!HandsJOn!Exercise,!you!will!write!a!MapReduce!job!with!multiple!Reducers,!
and!create!a!Partitioner!to!determine!which!Reducer!each!piece!of!Mapper!output!is!
sent!to.!

We!will!be!modifying!the!program!you!wrote!to!solve!the!Additional!Exercise!on!
page!21.!

Prepare The Exercise


1. Change!directories!to!
~/training_materials/developer/exercises/partitioner

2. If!you!completed!the!Additional!Exercise!on!page!21,!copy!the!code!for!your!
Mapper,!Reducer!and!driver!to!this!directory.!If!you!did!not,!copy!the!sample!
solution!from!~/training_materials/developer/exercises/
log_file_analysis/sample_solution!into!the!current!directory.

The Problem
The!code!you!now!have!in!the!partitioner!directory!counts!all!the!hits!made!to!a!
Web!server!from!each!different!IP!address.!(If!you!did!not!complete!the!exercise,!
view!the!code!and!be!sure!you!understand!how!it!works.)!Our!aim!in!this!exercise!is!
to!modify!the!code!such!that!we!get!a!different!result:!We!want!one'output'file'per'
month,'each!file!containing!the!number!of!hits!from!each!IP!address!in!that!month.!
In!other!words,!there!will!be!12!output!files,!one!for!each!month.!

Note:!we!are!actually!breaking!the!standard!MapReduce!paradigm!here.!The!
standard!paradigm!says!that!all!the!values!from!a!particular!key!will!go!to!the!same!
Reducer.!In!this!example!J!which!is!a!very!common!pattern!when!analyzing!log!files!J!
values!from!the!same!key!(the!IP!address)!will!go!to!multiple!Reducers,!based!on!the!
month!portion!of!the!line.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 25


Not to be reproduced without prior written consent.
Writing The Solution
Before!you!start!writing!your!code,!ensure!that!you!understand!the!required!
outcome;!ask!your!instructor!if!you!are!unsure.!

1.' You!will!need!to!modify!your!driver!code!to!specify!that!you!want!12!Reducers.!
Hint:!job.setNumReduceTasks()!specifies!the!number!of!Reducers!for!the!
job.!

2.' Change!the!Mapper!so!that!instead!of!emitting!a!1!for!each!value,!instead!it!
emits!the!entire!line!(or!just!the!month!portion!if!you!prefer).!

3.' Modify!the!MyPartitioner.java!stub!file!to!create!a!Partitioner!that!sends!
the!(key,!value)!pair!to!the!correct!Reducer!based!on!the!month.!Remember!that!
the!Partitioner!receives!both!the!key!and!value,!so!you!can!inspect!the!value!to!
determine!which!Reducer!to!choose.!

4.' Configure!your!job!to!use!your!custom!Partitioner.!Hint:!use!
job.setPartitionerClass()!in!your!driver!code.!

5.' Compile!and!test!your!code.!Hints:!

a. Write!unit!tests!for!your!Partitioner!!

b. Test!your!code!against!the!smaller!version!of!the!access!log!in!the!
/user/training/testlog!directory!before!you!run!your!code!
against!the!full!log!in!the!/user/training/weblog!directory.!

c. Remember!that!the!log!file!may!contain!unexpected!data!J!that!is,!lines!
which!do!not!conform!to!the!expected!format.!Ensure!that!your!code!
copes!with!such!lines.!

This is the end of the Exercise


!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 26


Not to be reproduced without prior written consent.
Hands-On Exercise: Using Counters
and a Map-Only Job
In!this!HandsJon!Exercise!you!will!create!a!MapJonly!MapReduce!job!which!will!use!
a!Web!server’s!access!log!to!count!the!number!of!times!gifs,!jpegs!and!other!
resources!have!been!retrieved.!Your!job!will!report!three!figures:!number!of!gif!
requests,!number!of!jpeg!requests,!and!number!of!other!requests.!

1.' Change!directory!to!the!counters!directory!within!the!exercises!directory!

$ cd ~/training_materials/developer/exercises/counters

2.' Complete!the!stub!files!to!provide!the!solution.!

Hints
You!should!use!a!MapJonly!MapReduce!job,!by!setting!the!number!of!Reducers!to!0!
in!the!Driver!code.!

For!input!data,!use!the!Web!access!log!file!that!you!uploaded!to!the!HDFS!
/user/training/weblog!directory!in!the!Using!HDFS!exercise.!

Note:!If!you!want,!you!can!test!your!code!against!the!smaller!version!of!the!access!
log!in!the!/user/training/testlog!directory!before!you!run!your!code!against!
the!full!log!in!the!/user/training/weblog!directory.!

Use!a!counter!group!called!something!like!‘ImageCounter’,!with!names!‘gif’,!‘jpeg’!
and!‘other’.!

In!your!Driver!code,!retrieve!the!values!of!the!counters!after!the!job!has!completed!
and!report!them!using!System.out.println.!

As!always,!a!sample!solution!is!available!if!you!need!more!hints.!

This is the end of the Exercise


Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 27
Not to be reproduced without prior written consent.
Hands-On Exercise: Using
SequenceFiles and File Compression
In!this!exercise!you!will!explore!reading!and!writing!uncompressed!and!compressed!
SequenceFiles.!!

First,!you!will!develop!a!MapReduce!application!to!convert!text!data!to!a!
SequenceFile.!Then!you!will!modify!the!application!to!compress!the!SequenceFile!
using!Snappy!file!compression.!

When!creating!the!SequenceFile,!use!the!full!access!log!file!for!input!data.!You!
uploaded!the!access!log!file!to!the!HDFS!/user/training/weblog!directory!
when!you!performed!the!“Using!HDFS”!exercise.!

After!you!have!created!the!compressed!SequenceFile,!you!will!write!a!second!
MapReduce!application!to!read!the!compressed!SequenceFile!and!write!a!text!file!
that!contains!the!original!log!file!text.!'

1.' Determine!the!number!of!HDFS!blocks!occupied!by!the!access!log!file:!

a. In!a!browser!window,!start!the!Name!Node!Web!UI.!The!URL!is!
http://localhost:50070.!

b. Click!“Browse!the!filesystem.”!!

c. Navigate!to!the!/user/training/weblog/access_log!file.!

d. Scroll!down!to!the!bottom!of!the!page.!The!total!number!of!blocks!
occupied!by!the!access!log!file!appears!in!the!browser!window.!

2.' Change!directory!to!the!createsequencefile!directory!within!the!
exercises!directory.!

$ cd ~/training_materials/developer/exercises/createsequencefile

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 28


Not to be reproduced without prior written consent.
3.' Complete!the!stub!files!to!read!the!access!log!file!and!create!a!SequenceFile.!
Records!emitted!to!the!SequenceFile!can!have!any!key!you!like,!but!the!values!
should!match!the!text!in!the!access!log!file.!

Note:!If!you!specify!an!output!key!type!other!than!LongWritable,!you!must!
call!job.setOutputKeyClass!–!not!job.setMapOutputKeyClass.!If!you!
specify!an!output!value!type!other!than!Text,!you!must!call!
job.setOutputValueClass!–!not!job.setMapOutputValueClass.!

4.' Compile!the!code!and!run!your!MapReduce!job.!!

For!the!MapReduce!input,!use!the!/user/training/weblog!directory.!

For!the!MapReduce!output,!specify!the!uncompressedsf!directory.!

Note:!The!CreateUncompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!

5.' Examine!the!initial!portion!of!the!output!SequenceFile!using!the!following!
command:!

$ hadoop fs -cat uncompressedsf/part-m-00000 | less

Some!of!the!data!in!the!SequenceFile!is!unreadable,!but!parts!of!the!
SequenceFile!should!be!recognizable:!

• The!string!SEQ,!which!appears!at!the!beginning!of!a!SequenceFile!

• The!Java!classes!for!the!keys!and!values!

• Text!from!the!access!log!file!

6.' Verify!that!the!number!of!files!created!by!the!job!is!equivalent!to!the!number!of!
blocks!required!to!store!the!uncompressed!SequenceFile.!

7.' Modify!your!MapReduce!job!to!compress!the!output!SequenceFile.!Add!
statements!to!your!driver!to!configure!the!output!SequenceFile!as!follows:!

• Compress!the!output!file.!

• Use!block!compression.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 29


Not to be reproduced without prior written consent.
• Use!the!Snappy!compression!codec.!

8.' Compile!the!code!and!run!your!modified!MapReduce!job.!For!the!MapReduce!
output,!specify!the!compressedsf!directory.!

Note:!The!CreateCompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!

9.' Examine!the!first!portion!of!the!output!SequenceFile.!Notice!the!differences!
between!the!uncompressed!and!compressed!SequenceFiles:!

• The!compressed!SequenceFile!specifies!the!
org.apache.hadoop.io.compress.SnappyCodec!compression!
codec!in!its!header.!

• You!cannot!read!the!log!file!text!in!the!compressed!file.!

10.' Compare!the!file!sizes!of!the!uncompressed!and!compressed!SequenceFiles!in!
the!uncompressedsf!and!compressedsf!directories.!

The!compressed!SequenceFiles!should!be!smaller.!!

11.' Write!a!second!MapReduce!job!to!read!the!compressed!log!file!and!write!a!text!
file.!This!text!file!should!have!the!same!text!data!as!the!log!file,!plus!keys.!The!
keys!can!contain!any!values!you!like.!

12.' Compile!the!code!and!run!your!MapReduce!job.!!

For!the!MapReduce!input,!specify!the!uncompressedsf!directory.!You!created!
the!compressed!SequenceFile!in!this!directory.!

For!the!MapReduce!output,!specify!the!compressedsftotext!directory.!

Note:!The!ReadCompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!

13.' Examine!the!first!portion!of!the!output!in!the!compressedsftotext!
directory.!

You!should!be!able!to!read!the!textual!log!file!entries.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 30


Not to be reproduced without prior written consent.
This is the end of the Exercise
'

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 31


Not to be reproduced without prior written consent.
Hands-On Exercise: Creating an
Inverted Index
In!this!exercise,!you!will!write!a!MapReduce!job!that!produces!an!inverted!index.!

For!this!lab!you!will!use!an!alternate!input,!provided!in!the!file!
invertedIndexInput.tgz.!When!decompressed,!this!archive!contains!a!
directory!of!files;!each!is!a!Shakespeare!play!formatted!as!follows:!

0 HAMLET
1
2
3 DRAMATIS PERSONAE
4
5
6 CLAUDIUS king of Denmark. (KING CLAUDIUS:)
7
8 HAMLET son to the late, and nephew to the present
king.
9
10 POLONIUS lord chamberlain. (LORD POLONIUS:)
...

Each!line!contains:!

! Line,number!
! separator:!a!tab!character!
! value:!the!line!of!text!

This!format!can!be!read!directly!using!the!KeyValueTextInputFormat!class!
provided!in!the!Hadoop!API.!This!input!format!presents!each!line!as!one!record!to!
your!Mapper,!with!the!part!before!the!tab!character!as!the!key,!and!the!part!after!the!
tab!as!the!value.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 32


Not to be reproduced without prior written consent.
Given!a!body!of!text!in!this!form,!your!indexer!should!produce!an!index!of!all!the!
words!in!the!text.!!For!each!word,!the!index!should!have!a!list!of!all!the!locations!
where!the!word!appears.!For!example,!for!the!word!‘honeysuckle’!your!output!
should!look!like!this:!

honeysuckle 2kinghenryiv@1038,midsummernightsdream@2175,...

The!index!should!contain!such!an!entry!for!every!word!in!the!text.!

We!have!provided!stub!files!in!the!directory!
~/training_materials/developer/exercises/inverted_index!

Prepare the Input Data


1.' Extract!the!invertedIndexInput!directory!and!upload!to!HDFS:!

$ cd ~/training_materials/developer/data
$ tar zxvf invertedIndexInput.tgz
$ hadoop fs -put invertedIndexInput invertedIndexInput

Define the MapReduce Solution


Remember!that!for!this!program!you!use!a!special!input!format!to!suit!the!form!of!
your!data,!so!your!driver!class!will!include!a!line!like:!

job.setInputFormatClass(KeyValueTextInputFormat.class);

Don’t!forget!to!import!this!class!for!your!use.!

Retrieving the File Name


Note!that!the!exercise!requires!you!to!retrieve!the!file!name!J!since!that!is!the!name!
of!the!play.!The!Context!object!can!be!used!to!retrieve!the!name!of!the!file!like!this:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 33


Not to be reproduced without prior written consent.
FileSplit fileSplit = (FileSplit) context.getInputSplit();
Path path = fileSplit.getPath();
String fileName = path.getName();

Build and Test Your Solution


Test!against!the!invertedIndexInput data!you!loaded!in!step!1!above.!

Hints
You!may!like!to!complete!this!exercise!without!reading!any!further,!or!you!may!find!
the!following!hints!about!the!algorithm!helpful.!

The Mapper
Your!Mapper!should!take!as!input!a!key!and!a!line!of!words,!and!should!emit!as!
intermediate!values!each!word!as!key,!and!the!key!as!value.!!

For!example,!the!line!of!input!from!the!file!‘hamlet’:!

282 Have heaven and earth together

produces!intermediate!output:!

Have hamlet@282
heaven hamlet@282
and hamlet@282
earth hamlet@282
together hamlet@282

The Reducer
Your!Reducer!simply!aggregates!the!values!presented!to!it!for!the!same!key,!into!one!
value.!Use!a!separator!like!‘,’!between!the!values!listed.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 34


Not to be reproduced without prior written consent.
Solution
You!can!find!a!working!solution!to!this!exercise!in!the!directory!
~/training_materials/developer/exercises/inverted_index/sam
ple_solution.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 35


Not to be reproduced without prior written consent.
Hands-On Exercise: Calculating Word
Co-Occurrence
In!this!exercise,!you!will!write!an!application!that!counts!the!number!of!times!words!
appear!next!to!each!other,!using!the!same!input!data!that!you!used!in!the!“Creating!
an!Inverted!Index”!exercise.!!

Note!that!this!implementation!is!a!specialization!of!Word!CoJOccurrence!as!we!
describe!it!in!the!notes;!in!this!case!we'are'only'interested'in'pairs'of'words'
which'appear'directly'next'to'each'other.)"

1.' Change!directories!to!the!word_co-occurrence!directory!within!the!
exercises!directory.!

2.' Complete!the!Driver!and!Mapper!stub!files;!you!can!use!the!standard!
SumReducer!from!the!WordCount!directory!as!your!Reducer.!Your!Mapper’s!
intermediate!output!should!be!in!the!form!of!a!Text!object!as!the!key,!and!an!
IntWritable!as!the!value;!the!key!will!be!‘word1,word2’,!and!the!value!will!be!1.!!

3.' Extra!credit:!Write!a!further!MapReduce!job!to!sort!the!output!from!the!first!job!
so!that!the!list!of!pairs!of!words!appears!in!ascending!frequency.!

This is the end of the Exercise


'

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 36


Not to be reproduced without prior written consent.
Optional Hands-On Exercise:
Implementing Word Co-Occurrence
with a Custom WritableComparable
In!this!HandsJOn!Exercise,!you!will!create!a!custom!WritableComparable!to!improve!
the!Word!CoJOccurrence!exercise!from!earlier!in!the!course.!

1.' Change!directories!to!the!writables!directory!in!the!exercises!directory.!

2.' Edit!the!stub!files!to!solve!the!problem.!If!you!completed!the!Word!CoJ
Occurrence!additional!exercise!earlier,!use!those!files!as!your!starting!point!for!
the!Mapper!and!Reducer.!If!you!did!not!complete!the!exercise!earlier!in!the!
course,!you!will!find!a!sample!solution!in!the!word_co-occurrence!directory!
in!the!exercises!directory.!Copy!that!and!modify!it!appropriately.!

Hints
You!need!to!create!a!WritableComparable!object!that!will!hold!the!two!strings.!The!
stub!provides!an!empty!constructor!for!serialization,!a!standard!constructor!that!
will!be!given!two!strings,!a!toString!method,!and!the!generated!hashCode!and!
equals!methods.!You!will!need!to!implement!the!readFields,!write,!and!
compareTo!methods!required!by!WritableComparables.!!

Note!that!Eclipse!automatically!generated!the!hashCode!and!equals!methods!in!
the!stub!file.!You!can!generate!these!two!methods!in!Eclipse!by!rightJclicking!in!the!
source!code!and!choosing!‘Source’!!>!‘Generate!hashCode()!and!equals()’.!

As!always,!a!sample!solution!is!available.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 37


Not to be reproduced without prior written consent.
Hands-On Exercise: Importing Data
With Sqoop
For!this!exercise!you!will!import!data!from!a!relational!database!using!Sqoop.!The!
data!you!load!here!will!be!used!in!a!subsequent!exercise.!

Consider!the!MySQL!database!movielens,!derived!from!the!MovieLens!project!
from!University!of!Minnesota.!(See!note!at!the!end!of!this!exercise.)!The!database!
consists!of!several!related!tables,!but!we!will!import!only!two!of!these:!movie,!
which!contains!about!3,900!movies;!and!movierating,!which!has!about!1,000,000!
ratings!of!those!movies.!!

Review the Database Tables


First,!review!the!database!tables!to!be!loaded!into!Hadoop.!

1. Log!on!to!MySQL:!

$ mysql --user=training --password=training movielens

2. Review!the!structure!and!contents!of!the!movie!table:!

mysql> DESCRIBE movie;!


. . .!
mysql> SELECT * FROM movie LIMIT 5;

3. Note!the!column!names!for!the!table.!

____________________________________________________________________________________________!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 38


Not to be reproduced without prior written consent.
4. Review!the!structure!and!contents!of!the!movierating!table:!

mysql> DESCRIBE movierating;


. . .
mysql> SELECT * FROM movierating LIMIT 5;

5. Note!these!column!names.!

____________________________________________________________________________________________!

6. Exit!mysql:!

mysql> quit

Import with Sqoop


You!invoke!Sqoop!on!the!command!line!to!perform!several!commands.!With!it!you!
can!connect!to!your!database!server!to!list!the!databases!(schemas)!to!which!you!
have!access,!and!list!the!tables!available!for!loading.!For!database!access,!you!
provide!a!connect!string!to!identify!the!server,!and!J!if!required!J!your!username!and!
password.!

1.' Show!the!commands!available!in!Sqoop:!

$ sqoop help

2.' List!the!databases!(schemas)!in!your!database!server:!

$ sqoop list-databases \
--connect jdbc:mysql://localhost \
--username training --password training

(Note:!Instead!of!entering!--password training!on!your!command!line,!
you!may!prefer!to!enter!-P,!and!let!Sqoop!prompt!you!for!the!password,!which!
is!then!not!visible!when!you!type!it.)!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 39


Not to be reproduced without prior written consent.
3.' List!the!tables!in!the!movielens!database:!

$ sqoop list-tables \
--connect jdbc:mysql://localhost/movielens \
--username training --password training

4.' Import!the!movie!table!into!Hadoop:!

$ sqoop import \!
--connect jdbc:mysql://localhost/movielens \!
--table movie --fields-terminated-by '\t' \!
--username training --password training !

Note:!separating!the!fields!in!the!HDFS!file!with!the!tab!character!is!one!way!to!
manage!compatibility!with!Hive!and!Pig,!which!we!will!use!in!a!future!exercise.!

5.' Verify!that!the!command!has!worked.!

$ hadoop fs -ls movie


$ hadoop fs -tail movie/part-m-00000

6.' Import!the!movierating!table!into!Hadoop.!!

Repeat!steps!4!and!5,!but!for!the!movierating!table.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 40


Not to be reproduced without prior written consent.
Note:

This exercise uses the MovieLens data set, or subsets thereof. This data is freely
available for academic purposes, and is used and distributed by Cloudera with
the express permission of the UMN GroupLens Research Group. If you would
like to use this data for your own research purposes, you are free to do so, as
long as you cite the GroupLens Research Group in any resulting publications. If
you would like to use this data for commercial purposes, you must obtain
explicit permission. You may find the full dataset, as well as detailed license
terms, at http://www.grouplens.org/node/73

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 41


Not to be reproduced without prior written consent.
Hands-On Exercise: Using a Mahout
Recommender
In!this!exercise!you!will!use!Mahout!to!generate!movie!recommendations!for!users.!

1.' Ensure!that!you!completed!the!Sqoop!HandsJOn!Exercise!to!import!the!movie!
and!movierating!data.!

2.' Create!a!list!of!users!for!whom!you!want!to!generate!recommendations,!by!
creating!and!editing!a!file!on!your!local!disk!named!users.!Into!that!file,!on!
separate!lines!place!the!user!IDs!6037,!6038,!6039,!6040,!so!that!the!file!looks!
like!this:!

6037
6038
6039
6040

Important:'Make'sure'there'is'not'a'blank'line'at'the'end'of'the'file.'The'
line'containing'the'last'user'ID'should'not'have'a'carriage'return'at'the'
end'of'that'line.'If'it'does,'the'job'will'fail.!

3.' Upload!the!file!to!HDFS:!

$ hadoop fs -put users users

4.' Run!Mahout’s!itemJbased!recommender:!

$ mahout recommenditembased --input movierating \


--output recs --usersFile users \
--similarityClassname SIMILARITY_LOGLIKELIHOOD

5.' This!will!take!a!long!time!to!execute;!it!runs!approximately!10!MapReduce!jobs.!
Your!instructor!will!now!continue!with!the!notes,!but!when!the!final!job!is!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 42


Not to be reproduced without prior written consent.
complete,!investigate!the!part-r-00000!file!in!your!newlyJcreated!recs!
directory.!

$ hadoop fs -cat recs/part-r-00000


6037
[2010:5.0,1036:5.0,1035:5.0,3703:5.0,2076:5.0,3108:5.0
,1028:5.0,3105:5.0,3104:5.0,2064:5.0]
6038
[671:5.0,2761:5.0,745:5.0,741:5.0,720:5.0,2857:5.0,838
:5.0,3114:5.0,3044:5.0,3000:5.0]
6039
[1946:5.0,1036:5.0,1035:5.0,3703:5.0,1032:5.0,2078:5.0
,3114:5.0,1029:5.0,1028:5.0,2076:5.0]
6040
[2610:5.0,1904:5.0,3910:5.0,3811:5.0,6:5.0,3814:5.0,16
:5.0,17:5.0,1889:5.0,3794:5.0]

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 43


Not to be reproduced without prior written consent.
Hands-On Exercise: Manipulating
Data With Hive
In'this'exercise,'you'will'practice'data'processing'in'Hadoop'using'Hive.'

The!data!sets!for!this!exercise!are!the!movie!and!movierating!data!imported!
from!MySQL!into!Hadoop!in!a!previous!exercise.!

Review the Data


1.' Review!the!data!already!loaded!into!HDFS:!

$ hadoop fs -cat movie/part-m-00000 | head


. . .
$ hadoop fs -cat movierating/part-m-00000 | head

Prepare The Data For Hive


For!Hive!data!sets,!you!create!tables,!which!attach!field!names!and!data!types!to!
your!Hadoop!data!for!subsequent!queries.!You!can!create!external!tables!on!the!
movie!and!movierating!data!sets,!without!having!to!move!the!data!at!all.!

Prepare!the!Hive!tables!for!this!exercise!by!performing!the!following!steps:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 44


Not to be reproduced without prior written consent.
1.' Invoke!the!Hive!shell:!

$ hive

2.' Create!the!movie!table:!

hive> CREATE EXTERNAL TABLE movie


> (id INT, name STRING, year INT)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/user/training/movie';

3.' Create!the!movierating!table:!

hive> CREATE EXTERNAL TABLE movierating


> (userid INT, movieid INT, rating INT)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/user/training/movierating';

4.' Quit!the!Hive!shell:!

hive> QUIT;

The Questions
Now!that!the!data!is!imported!and!suitably!prepared,!use!Hive!to!answer!the!
following!questions.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 45


Not to be reproduced without prior written consent.
Working Interactively or In Batch

Hive:
You can enter Hive commands interactively in the Hive shell:
$ hive
. . .
hive> Enter interactive commands here
Or you can execute text files containing Hive commands with:
$ hive -f file_to_execute

1.' What!is!the!oldest!known!movie!in!the!database?!!Note!that!movies!with!
unknown!years!have!a!value!of!0!in!the!year!field!J!these!do!not!belong!in!your!
answer.!

2.' List!the!name!and!year!of!all!unrated!movies!(movies!where!the!movie!data!has!
no!related!movierating!data).!

3.' Produce!an!updated!copy!of!the!movie!data!with!two!new!fields:!
! ! numratings!! J!the!number!of!ratings!for!the!movie!
avgrating!! ! J!the!average!rating!for!the!movie!
Unrated!movies!are!not!needed!in!this!copy.!

4.' What!are!the!10!highestJrated!movies?!!(Notice!that!your!work!in!step!3!makes!
this!question!easy!to!answer.)!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 46


Not to be reproduced without prior written consent.
Hands-On Exercise: Using Pig to
Retrieve Movie Names From Our
Recommender
!
In!this!HandsJOn!Exercise!you!will!use!Pig!to!extract!movie!names!from!the!
recommendations!file!you!created!earlier.!

1.' Create!a!text!file!called!listrecommendations!with!the!following!contents:!

movies = load 'movie' AS (movieid, name, year);

recs = load 'recs' AS (userid, reclist);

longlist = FOREACH recs GENERATE userid,


FLATTEN(TOKENIZE(reclist)) AS movieandscore;

finallist = FOREACH longlist GENERATE userid,


REGEX_EXTRACT(movieandscore, '(\\d+)', 1) AS movieid;

results = JOIN finallist BY movieid, movies BY movieid;

final = FOREACH results GENERATE userid, name;

srtd = ORDER final BY userid;

dump srtd;

2.' Run!the!Pig!script!to!produce!a!list!of!movie!recommendations:!

$ pig listrecommendations

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 47


Not to be reproduced without prior written consent.
This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 48


Not to be reproduced without prior written consent.
Hands-On Exercise: Running an
Oozie Workflow
In!this!exercise,!you!will!inspect!and!run!Oozie!workflows.!

1.' Change!directories!to!the!oozie-labs!directory!within!the!exercises!
directory!

2.' Start!the!Oozie!server!

$ sudo /etc/init.d/oozie start

3.' Change!directories!to!lab1-java-mapreduce/job!

$ cd lab1-java-mapreduce/job

4.' Inspect!the!contents!of!the!job.properties!and!workflow.xml!files.!You!
will!see!that!this!is!our!standard!WordCount!job.!

5.' Change!directories!back!to!the!main!oozie-labs!directory!

$ cd ../..

6.' We!have!provided!a!simple!shell!script!to!submit!the!Oozie!workflow.!Inspect!
run.sh!

$ cat run.sh

7.' Submit!the!workflow!to!the!Oozie!server!

$ ./run.sh lab1-java-mapreduce

Notice!that!Oozie!returns!a!job!identification!number.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 49


Not to be reproduced without prior written consent.
8.' Inspect!the!progress!of!the!job!

$ oozie job -oozie http://localhost:11000/oozie -info <job_id>

9.' When!the!job!has!completed,!inspect!HDFS!to!confirm!that!the!output!has!been!
produced!as!expected.!

10.' Repeat!the!above!procedure!for!lab2JsortJwordcount.!Notice!when!you!inspect!
workflow.xml!that!this!workflow!includes!two!MapReduce!jobs!which!run!
one!after!the!other.!When!you!inspect!the!output!in!HDFS!you!will!see!that!the!
second!job!sorts!the!output!of!the!first!job!into!descending!numerical!order.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 50


Not to be reproduced without prior written consent.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy