Developer Exercise Instructions
Developer Exercise Instructions
General'Notes'...........................................................................................................................'3!
Hands0On'Exercise:'Using'HDFS'.........................................................................................'5!
Hands0On'Exercise:'Running'a'MapReduce'Job'..........................................................'11!
Hands0On'Exercise:'Writing'a'MapReduce'Program'.................................................'15!
Hands0On'Exercise:'Writing'Unit'Tests'With'the'MRUnit'Framework'...............'23!
Hands0On'Exercise:'Writing'and'Implementing'a'Combiner'.................................'24!
Hands0On'Exercise:'Writing'a'Partitioner'....................................................................'25!
Hands0On'Exercise:'Using'Counters'and'a'Map0Only'Job'.........................................'27!
Hands0On'Exercise:'Using'SequenceFiles'and'File'Compression'..........................'28!
Hands0On'Exercise:'Creating'an'Inverted'Index'.........................................................'32!
Hands0On'Exercise:'Calculating'Word'Co0Occurrence'.............................................'36!
2.' Should!you!need!it,!the!root!password!is!training.!You!may!be!prompted!for!
this!if,!for!example,!you!want!to!change!the!keyboard!layout.!In!general,!you!
should!not!need!this!password!since!the!training!user!has!unlimited!sudo!
privileges.!
3.' In!some!commandJline!steps!in!the!exercises,!you!will!see!lines!like!this:!
The!backslash!at!the!end!of!the!first!line!signifies!that!the!command!is!not!
completed,!and!continues!on!the!next!line.!You!can!enter!the!code!exactly!as!
shown!(on!two!lines),!or!you!can!enter!it!on!a!single!line.!If!you!do!the!latter,!you!
should!not!type!in!the!backslash.!
2.' Sample!solutions!are!always!available!in!the!sample_solutions!
subdirectory!of!the!exercise!directory.!
3.' As!the!exercises!progress,!and!you!gain!more!familiarity!with!Hadoop!and!
MapReduce,!we!provide!fewer!stepJbyJstep!instructions!J!as!in!the!real!world,!
we!merely!give!you!a!requirement!and!it’s!up!to!you!to!solve!the!problem!!There!
are!‘stub’!files!for!each!exercise!to!get!you!started,!and!you!should!feel!free!to!
ask!your!instructor!for!assistance!at!any!time.!We!also!provide!some!hints!in!
many!of!the!exercises.!And,!of!course,!you!can!always!consult!with!your!fellow!
students.!
4.' If!you!would!like!to!have!more!hints!than!are!provided!in!the!default!stub!files,!
you!can!use!stub!files!with!additional!hints!in!the!stubs_with_hints!
subdirectory!of!each!exercise!directory.!
5.' If!you!are!working!in!Eclipse!and!would!prefer!to!use!the!stub!files!with!
additional!hints,!do!the!following:!
a. Close!Eclipse.!
b. In!a!terminal!window,!change!to!the!/home/training/scripts!
directory.!
c. Run!the!following!command:!
./eclipse_projects.sh hint
d. Restart!Eclipse.!
e. Refresh!each!of!the!Eclipse!projects.!(To!refresh!a!project,!rightJclick!
the!project!and!select!Refresh.)!
Hadoop
Hadoop!is!already!installed,!configured,!and!running!on!your!virtual!machine.!!
Most!of!your!interaction!with!the!system!will!be!through!a!commandJline!wrapper!
called!hadoop.!If!you!start!a!terminal!and!run!this!program!with!no!arguments,!it!
prints!a!help!message.!To!try!this,!run!the!following!command:!
$ hadoop
(Note:!although!your!command!prompt!is!more!verbose,!we!use!‘$’!to!indicate!the!
command!prompt!for!brevity’s!sake.)!!!
The!hadoop!command!is!subdivided!into!several!subsystems.!For!example,!there!is!
a!subsystem!for!working!with!files!in!HDFS!and!another!for!launching!and!managing!
MapReduce!processing!jobs.!
1.' Open!a!terminal!window!(if!one!is!not!already!open)!by!doubleJclicking!the!
Terminal!icon!on!the!desktop.!
2.' In!the!terminal!window,!enter:!
$ hadoop fs
You!see!a!help!message!describing!all!the!commands!associated!with!this!
subsystem.!
$ hadoop fs -ls /
This!shows!you!the!contents!of!the!root!directory!in!HDFS.!!There!will!be!
multiple!entries,!one!of!which!is!/user.!Individual!users!have!a!“home”!
directory!under!this!directory,!named!after!their!username!J!your!home!
directory!is!/user/training.!!!
4.' Try!viewing!the!contents!of!the!/user!directory!by!running:!
You!will!see!your!home!directory!in!the!directory!listing.!!!
5.' Try!running:!
There!are!no!files,!so!the!command!silently!exits.!This!is!different!than!if!you!ran!
hadoop fs -ls /foo,!which!refers!to!a!directory!that!doesn’t!exist!and!
which!would!display!an!error!message.!
Note!that!the!directory!structure!in!HDFS!has!nothing!to!do!with!the!directory!
structure!of!the!local!filesystem;!they!are!completely!separate!namespaces.!
1.' Change!directories!to!the!directory!containing!the!sample!data!we!will!be!using!
in!the!course.!
cd ~/training_materials/developer/data
If!you!perform!a!‘regular’!ls!command!in!this!directory,!you!will!see!a!few!files,!
including!two!named!shakespeare.tar.gz!and!!
2.' Unzip!shakespeare.tar.gz!by!running:!
This!creates!a!directory!named!shakespeare/!containing!several!files!on!your!
local!filesystem.!!!
3.' Insert!this!directory!into!HDFS:!
This!copies!the!local!shakespeare!directory!and!its!contents!into!a!remote,!
HDFS!directory!named!/user/training/shakespeare.!!!
4.' List!the!contents!of!your!HDFS!home!directory!now:!
You!should!see!an!entry!for!the!shakespeare!directory.!!!
$ hadoop fs -ls
You!should!see!the!same!results.!If!you!don’t!pass!a!directory!name!to!the!-ls!
command,!it!assumes!you!mean!your!home!directory,!i.e.!/user/training.!
If you pass any relative (non-absolute) paths to FsShell commands (or use
relative paths in MapReduce programs), they are considered relative to your
home directory.
6.' We!also!have!a!Web!server!log!file!which!we!will!put!into!HDFS!for!use!in!future!
exercises.!This!file!is!currently!compressed!using!GZip.!Rather!than!extract!the!
file!to!the!local!disk!and!then!upload!it,!we!will!extract!and!upload!in!one!step.!
First,!create!a!directory!in!HDFS!in!which!to!store!it:!
7.' Now,!extract!and!upload!the!file!in!one!step.!The!-c!option!to!gunzip!
uncompresses!to!standard!output,!and!the!dash!(‘J‘)!in!the!hadoop fs -put!
command!takes!whatever!is!being!sent!to!its!standard!input!and!places!that!data!
in!HDFS.!
$ gunzip -c access_log.gz \
| hadoop fs -put - weblog/access_log
9.' The!access!log!file!is!quite!large!–!around!500!MB.!Create!a!smaller!version!of!
this!file,!consisting!only!of!its!first!5000!lines,!and!store!the!smaller!version!in!
HDFS.!You!can!use!the!smaller!version!for!testing!in!subsequent!exercises.!
This!lists!the!contents!of!the!/user/training/shakespeare!directory,!
which!consists!of!the!files!comedies,!glossary,!histories,!poems,!and!
tragedies.!!!
2.' The!glossary!file!included!in!the!tarball!you!began!with!is!not!strictly!a!work!
of!Shakespeare,!so!let’s!remove!it:!
Note!that!you!could!leave!this!file!in!place!if!you!so!wished.!If!you!did,!then!it!
would!be!included!in!subsequent!computations!across!the!works!of!
Shakespeare,!and!would!skew!your!results!slightly.!As!with!many!realJworld!big!
data!problems,!you!make!tradeJoffs!between!the!labor!to!purify!your!input!data!
and!the!precision!of!your!results.!
3.' Enter:!
This!prints!the!last!50!lines!of!Henry,IV,,Part,1!to!your!terminal.!This!command!
is!handy!for!viewing!the!output!of!MapReduce!programs.!Very!often,!an!
individual!output!file!of!a!MapReduce!program!is!very!large,!making!it!
inconvenient!to!view!the!entire!file!in!the!terminal.!!For!this!reason,!it’s!often!a!
good!idea!to!pipe!the!output!of!the!fs -cat!command!into!head,!tail,!more,!
or!less.!
Note!that!when!you!pipe!the!output!of!the!fs -cat!command!to!a!local!UNIX!
command,!the!full!contents!of!the!file!are!still!extracted!from!HDFS!and!sent!to!
your!local!machine.!Once!on!your!local!machine,!the!file!contents!are!then!
modified!before!being!displayed.!
Other Commands
There!are!several!other!commands!associated!with!the!FsShell!subsystem,!to!
perform!most!common!filesystem!manipulations:!mv,!cp,!mkdir,!etc.!!!
1.' Enter: !
$ hadoop fs
This!displays!a!brief!usage!report!of!the!commands!within!FsShell.!Try!
playing!around!with!a!few!of!these!commands!if!you!like.!
In!addition!to!manipulating!files!in!HDFS,!the!wrapper!program!hadoop!is!used!to!
launch!MapReduce!jobs.!The!code!for!a!job!is!contained!in!a!compiled!JAR!file.!!
Hadoop!loads!the!JAR!into!HDFS!and!distributes!it!to!the!worker!nodes,!where!the!
individual!tasks!of!the!MapReduce!job!are!executed.!
One!simple!example!of!a!MapReduce!job!is!to!count!the!number!of!occurrences!of!
each!word!in!a!file!or!set!of!files.!In!this!lab!you!will!compile!and!submit!a!
MapReduce!job!to!count!the!number!of!occurrences!of!every!word!in!the!works!of!
Shakespeare.!
$ cd ~/training_materials/developer/exercises/wordcount
$ ls
This!directory!contains!the!following!Java!files:!
WordCount.java:!A!simple!MapReduce!driver!class.!
WordMapper.java:!A!mapper!class!for!the!job.!
SumReducer.java:!A!reducer!class!for!the!job.!
Examine!these!files!if!you!wish,!but!do!not!change!them.!Remain!in!this!
directory!while!you!execute!the!following!commands.!
Note:'in'the'command'above,'the'quotes'around'hadoop classpath'are'
back'quotes.'This'runs'the'hadoop classpath'command'and'uses'its'
output'as'part'of'the'javac'command.!
Your!command!includes!the!classpath!for!the!Hadoop!core!API!classes.!The!
compiled!(.class)!files!are!placed!in!your!local!directory.!!!
3.' Collect!your!compiled!Java!files!into!a!JAR!file:!
4.' Submit!a!MapReduce!job!to!Hadoop!using!your!JAR!file!to!count!the!occurrences!
of!each!word!in!Shakespeare:!
This!hadoop jar!command!names!the!JAR!file!to!use!(wc.jar),!the!class!
whose!main!method!should!be!invoked!(WordCount),!and!the!HDFS!input!and!
output!directories!to!use!for!the!MapReduce!job.!
Your!job!reads!all!the!files!in!your!HDFS!shakespeare!directory,!and!places!its!
output!in!a!new!HDFS!directory!called!wordcounts.!
5.' Try!running!this!same!command!again!without!any!change:!
Your!job!halts!right!away!with!an!exception,!because!Hadoop!automatically!fails!
if!your!job!tries!to!write!its!output!into!an!existing!directory.!This!is!by!design:!
since!the!result!of!a!MapReduce!job!may!be!expensive!to!reproduce,!Hadoop!
prevents!you!from!accidentally!overwriting!previously!existing!files.!
This!lists!the!output!files!for!your!job.!(Your!job!ran!with!only!one!Reducer,!so!
there!should!be!one!file,!named!part-r-00000,!along!with!a!_SUCCESS!file!
and!a!_logs!directory.)!
7.' View!the!contents!of!the!output!for!your!job:!
You!can!page!through!a!few!screens!to!see!words!and!their!frequencies!in!the!
works!of!Shakespeare.!Note!that!you!could!have!specified!wordcounts/*!just!
as!well!in!this!command.!
8.' Try!running!the!WordCount!job!against!a!single!file:!
When!the!job!completes,!inspect!the!contents!of!the!pwords!directory.!
9.' Clean!up!the!output!files!produced!by!your!job!runs:!
Losing!the!connection!to!the!initiating!process!does!not!kill!a!MapReduce!job.!!
Instead,!you!need!to!tell!the!Hadoop!JobTracker!to!stop!the!job.!
2.' While!this!job!is!running,!open!another!terminal!window!and!enter:!
This!lists!the!job!ids!of!all!running!jobs.!A!job!id!looks!something!like:!
job_200902131742_0002!
3.' Copy!the!job!id,!and!then!kill!the!running!job!by!entering:!
The!JobTracker!kills!the!job,!and!the!program!running!in!the!original!terminal,!
reporting!its!progress,!informs!you!that!the!job!has!failed.!
For!any!text!input,!the!job!should!report!the!average!length!of!words!that!begin!with!
‘a’,!‘b’,!and!so!forth.!!For!example,!for!input:!
The!output!would!be:!
N 3
d 10
i 2
t 3.5
(For!the!initial!solution,!your!program!can!be!caseJsensitive—which!is!the!case!for!
Java!string!processing!by!default.)!
The Algorithm
The!algorithm!for!this!program!is!a!simple!oneJpass!MapReduce!program:!
The'Mapper'
The!Mapper!receives!a!line!of!text!for!each!input!value.!(Ignore!the!input!key.)!For!
each!word!in!the!line,!emit!the!first!letter!of!the!word!as!a!key,!and!the!length!of!the!
word!as!a!value.!For!example,!for!input!value:!
N 3
i 2
d 10
t 3
t 4
The'Reducer'
Thanks!to!the!sort/shuffle!phase!built!in!to!MapReduce,!the!Reducer!receives!the!
keys!in!sorted!order,!and!all!the!values!for!one!key!appear!together.!!So,!for!the!
Mapper!output!above,!the!Reducer!(if!written!in!Java)!receives!this:!
N (3)
d (10)
i (2)
t (3, 4)
If!you!will!be!writing!your!code!using!Hadoop!Streaming,!your!Reducer!would!
receive!the!following:!
N 3
d 10
i 2
t 3
t 4
For!either!type!of!input,!the!final!output!should!be:!
N 3
d 10
i 2
t 3.5
If!you!complete!the!first!part!of!the!exercise,!there!is!a!further!exercise!for!you!to!try.!
See!page!21!for!instructions.!
Set up Eclipse
We!have!created!Eclipse!projects!for!each!of!the!HandsJOn!Exercises.!Using!Eclipse!
will!speed!up!your!development!time.!!
Even!if!you!do!not!plan!to!use!Eclipse!to!develop!your!code,!you!still!need!to!set!up!
Eclipse.!In!the!“Writing!Unit!Tests!With!the!MRUnit!Framework”!lab,!you!will!use!
Eclipse!to!run!unit!tests.!
Follow!these!instructions!to!set!up!Eclipse!by!importing!projects!into!the!
environment:!
1.' Launch!Eclipse.!
2.' Select!Import!from!the!File!menu.!
3.' Select!General!J>!Existing!Projects!into!Workspace,!and!click!Next.!
4.' Specify!/home/training/workspace!in!the!Select!Root!Directory!field.!All!the!
exercise!projects!will!appear!in!the!Projects!field.!
5.' Click!OK,!then!click!Finish.!That!will!import!all!projects!into!your!workspace.!
The!steps!to!export!Java!code!to!a!JAR!file!and!run!the!code!are!described!in!the!
slides!for!this!chapter.!The!following!is!a!quick!review!of!the!steps!you!will!need!to!
perform!to!run!Hadoop!source!code!developed!in!Eclipse:!
The!Eclipse!software!in!your!VM!is!preJconfigured!to!compile!code!
automatically!without!performing!any!explicit!steps.!Compile!errors!and!
warnings!appear!as!red!and!yellow!icons!to!the!left!of!the!code.!
2.' RightJclick!the!default!package!entry!for!the!Eclipse!project!(under!the!src!
entry).!
3.' Select!Export!
4.' Select!Java!>!.JAR!File!from!the!Export!dialog!box,!then!click!Next.!
5.' Specify!a!location!for!the!JAR!file.!You!can!place!your!JAR!files!wherever!you!like.!
For!more!information!about!running!a!Hadoop!job!when!working!in!Eclipse,!
including!screen!shots,!refer!to!the!slides!for!this!chapter.!
!
1.' Define!the!driver!
This!class!should!configure!and!submit!your!basic!job.!!Among!the!basic!steps!
here,!configure!the!job!with!the!Mapper!class!and!the!Reducer!class!you!will!
write,!and!the!data!types!of!the!intermediate!and!final!keys.!
Note!these!simple!string!operations!in!Java:!
3.' Define!the!Reducer!
In!a!single!invocation!the!Reducer!receives!a!string!containing!one!letter!along!
with!an!iterator!of!integers.!!For!this!call,!the!reducer!should!emit!a!single!
output!of!the!letter!and!the!average!of!the!integers.!
4.' Test!your!program!
Compile,!jar,!and!test!your!program.!You!can!use!the!entire!Shakespeare!dataset!
for!your!input,!or!you!can!try!it!with!just!one!of!the!files!in!the!dataset,!or!with!
your!own!test!data.!
Solution in Java
The!directory!
~/training_materials/developer/exercises/averagewordlength/
sample_solution!contains!a!set!of!Java!class!definitions!that!solve!the!problem.!!!
1.' The!Mapper!Script!
The!Mapper!will!receive!lines!of!text!on!stdin.!Find!the!words!in!the!lines!to!
produce!the!intermediate!output,!and!emit!intermediate!(key,!value)!pairs!by!
writing!strings!of!the!form:!
These!strings!should!be!written!to!stdout.!
2.' The!Reducer!Script!
For!the!reducer,!multiple!values!with!the!same!key!are!sent!to!your!script!on!
stdin!as!successive!lines!of!input.!Each!line!contains!a!key,!a!tab,!a!value,!and!a!
newline.!All!lines!with!the!same!key!are!sent!one!after!another,!possibly!
followed!by!lines!with!a!different!key,!until!the!reducing!input!is!complete.!For!
example,!the!reduce!script!may!receive!the!following:!
t 3
t 4
w 4
w 6
For!this!input,!emit!the!following!to!stdout:!
t 3.5
w 5
Observe!that!the!reducer!receives!a!key!with!each!input!line,!and!must!“notice”!
when!the!key!changes!on!a!subsequent!line!(or!when!the!input!is!finished)!to!
know!when!the!values!for!a!given!key!have!been!exhausted.!
3.' Run!the!streaming!program!
You!can!run!your!program!with!Hadoop!Streaming!via:!
(Remember,!you!may!need!to!delete!any!previous!output!before!running!your!
program!!with!hadoop fs –rm -r dataToDelete.)!
Solution in Python
You!can!find!a!working!solution!to!this!exercise!written!in!Python!in!the!directory!
~/training_materials/developer/exercises/averagewordlength/
python.
Additional Exercise
If!you!have!more!time,!attempt!the!additional!exercise.!In!the!
log_file_analysis!directory,!you!will!find!stubs!for!the!Mapper!and!Driver.!
(There!is!also!a!sample!solution!available.)!
Your!task!is!to!count!the!number!of!hits!made!from!each!IP!address!in!the!sample!
(anonymized)!Apache!log!file!that!you!uploaded!to!the!/user/training/weblog!
directory!in!HDFS!when!you!performed!the!Using!HDFS!exercise.!!
Note:!If!you!want,!you!can!test!your!code!against!the!smaller!version!of!the!access!
log!in!the!/user/training/testlog!directory!before!you!run!your!code!against!
the!full!log!in!the!/user/training/weblog!directory.!
1.' Change!directory!to!~/training_materials/developer/exercises/
log_file_analysis.!
2.' Using!the!stub!files!in!that!directory,!write!Mapper!and!Driver!code!to!count!the!
number!of!hits!made!from!each!IP!address!in!the!access!log!file.!Your!final!result!
should!be!a!file!in!HDFS!containing!each!IP!address,!and!the!count!of!log!hits!
Note:!You!should!have!set!up!Eclipse!when!you!performed!the!“Writing!a!
MapReduce!Program”!exercise.!If!you!did!not!set!up!Eclipse,!locate!the!section!of!
that!exercise!titled!“Set!up!Eclipse”!and!follow!the!steps!to!import!projects!into!
Eclipse.!
1.' Launch!Eclipse!and!expand!the!mrunit!folder.!
2.' Examine!the!TestWordCount.java!file!in!the!mrunit!project.!Notice!that!
three!tests!have!been!created,!one!each!for!the!Mapper,!Reducer,!and!the!entire!
MapReduce!flow.!Currently,!all!three!tests!simply!fail.!
3.' Run!the!tests!by!rightJclicking!on!TestWordCount.java!and!choosing!‘Run!
As’!J>!‘JUnit!Test’.!
4.' Observe!the!failure.!Results!in!the!JUnit!tab!(next!to!the!Package!Explorer!tab)!
should!indicate!that!three!tests!ran!with!three!failures.!
5.' Now!implement!the!three!tests.!If!you!need!hints,!there!is!a!sample!solution!in!
the!sample!solution!directory!within!the!mrunit!directory.!
6.' Run!the!tests!again.!!Results!in!the!JUnit!tab!should!indicate!that!three!tests!ran!
with!no!failures.!
There!are!further!exercises!if!you!have!more!time.!
Implement a Combiner
1.' Change!directory!to!
~/training_materials/developer/exercises/combiner!
2.' Copy!the!WordCount!Mapper!and!Reducer!into!this!directory!
$ cp ../wordcount/WordMapper.java .
$ cp ../wordcount/SumReducer.java .
3.' Complete!the!WordCountDriver.java!code!and,!if!necessary,!the!
SumCombiner.java!code!to!implement!a!Combiner!for!the!Wordcount!
program.!
4.' Compile!and!test!your!solution.!
We!will!be!modifying!the!program!you!wrote!to!solve!the!Additional!Exercise!on!
page!21.!
2. If!you!completed!the!Additional!Exercise!on!page!21,!copy!the!code!for!your!
Mapper,!Reducer!and!driver!to!this!directory.!If!you!did!not,!copy!the!sample!
solution!from!~/training_materials/developer/exercises/
log_file_analysis/sample_solution!into!the!current!directory.
The Problem
The!code!you!now!have!in!the!partitioner!directory!counts!all!the!hits!made!to!a!
Web!server!from!each!different!IP!address.!(If!you!did!not!complete!the!exercise,!
view!the!code!and!be!sure!you!understand!how!it!works.)!Our!aim!in!this!exercise!is!
to!modify!the!code!such!that!we!get!a!different!result:!We!want!one'output'file'per'
month,'each!file!containing!the!number!of!hits!from!each!IP!address!in!that!month.!
In!other!words,!there!will!be!12!output!files,!one!for!each!month.!
Note:!we!are!actually!breaking!the!standard!MapReduce!paradigm!here.!The!
standard!paradigm!says!that!all!the!values!from!a!particular!key!will!go!to!the!same!
Reducer.!In!this!example!J!which!is!a!very!common!pattern!when!analyzing!log!files!J!
values!from!the!same!key!(the!IP!address)!will!go!to!multiple!Reducers,!based!on!the!
month!portion!of!the!line.!
1.' You!will!need!to!modify!your!driver!code!to!specify!that!you!want!12!Reducers.!
Hint:!job.setNumReduceTasks()!specifies!the!number!of!Reducers!for!the!
job.!
2.' Change!the!Mapper!so!that!instead!of!emitting!a!1!for!each!value,!instead!it!
emits!the!entire!line!(or!just!the!month!portion!if!you!prefer).!
3.' Modify!the!MyPartitioner.java!stub!file!to!create!a!Partitioner!that!sends!
the!(key,!value)!pair!to!the!correct!Reducer!based!on!the!month.!Remember!that!
the!Partitioner!receives!both!the!key!and!value,!so!you!can!inspect!the!value!to!
determine!which!Reducer!to!choose.!
4.' Configure!your!job!to!use!your!custom!Partitioner.!Hint:!use!
job.setPartitionerClass()!in!your!driver!code.!
5.' Compile!and!test!your!code.!Hints:!
a. Write!unit!tests!for!your!Partitioner!!
b. Test!your!code!against!the!smaller!version!of!the!access!log!in!the!
/user/training/testlog!directory!before!you!run!your!code!
against!the!full!log!in!the!/user/training/weblog!directory.!
c. Remember!that!the!log!file!may!contain!unexpected!data!J!that!is,!lines!
which!do!not!conform!to!the!expected!format.!Ensure!that!your!code!
copes!with!such!lines.!
1.' Change!directory!to!the!counters!directory!within!the!exercises!directory!
$ cd ~/training_materials/developer/exercises/counters
2.' Complete!the!stub!files!to!provide!the!solution.!
Hints
You!should!use!a!MapJonly!MapReduce!job,!by!setting!the!number!of!Reducers!to!0!
in!the!Driver!code.!
For!input!data,!use!the!Web!access!log!file!that!you!uploaded!to!the!HDFS!
/user/training/weblog!directory!in!the!Using!HDFS!exercise.!
Note:!If!you!want,!you!can!test!your!code!against!the!smaller!version!of!the!access!
log!in!the!/user/training/testlog!directory!before!you!run!your!code!against!
the!full!log!in!the!/user/training/weblog!directory.!
Use!a!counter!group!called!something!like!‘ImageCounter’,!with!names!‘gif’,!‘jpeg’!
and!‘other’.!
In!your!Driver!code,!retrieve!the!values!of!the!counters!after!the!job!has!completed!
and!report!them!using!System.out.println.!
As!always,!a!sample!solution!is!available!if!you!need!more!hints.!
First,!you!will!develop!a!MapReduce!application!to!convert!text!data!to!a!
SequenceFile.!Then!you!will!modify!the!application!to!compress!the!SequenceFile!
using!Snappy!file!compression.!
When!creating!the!SequenceFile,!use!the!full!access!log!file!for!input!data.!You!
uploaded!the!access!log!file!to!the!HDFS!/user/training/weblog!directory!
when!you!performed!the!“Using!HDFS”!exercise.!
After!you!have!created!the!compressed!SequenceFile,!you!will!write!a!second!
MapReduce!application!to!read!the!compressed!SequenceFile!and!write!a!text!file!
that!contains!the!original!log!file!text.!'
1.' Determine!the!number!of!HDFS!blocks!occupied!by!the!access!log!file:!
a. In!a!browser!window,!start!the!Name!Node!Web!UI.!The!URL!is!
http://localhost:50070.!
b. Click!“Browse!the!filesystem.”!!
c. Navigate!to!the!/user/training/weblog/access_log!file.!
d. Scroll!down!to!the!bottom!of!the!page.!The!total!number!of!blocks!
occupied!by!the!access!log!file!appears!in!the!browser!window.!
2.' Change!directory!to!the!createsequencefile!directory!within!the!
exercises!directory.!
$ cd ~/training_materials/developer/exercises/createsequencefile
Note:!If!you!specify!an!output!key!type!other!than!LongWritable,!you!must!
call!job.setOutputKeyClass!–!not!job.setMapOutputKeyClass.!If!you!
specify!an!output!value!type!other!than!Text,!you!must!call!
job.setOutputValueClass!–!not!job.setMapOutputValueClass.!
4.' Compile!the!code!and!run!your!MapReduce!job.!!
For!the!MapReduce!input,!use!the!/user/training/weblog!directory.!
For!the!MapReduce!output,!specify!the!uncompressedsf!directory.!
Note:!The!CreateUncompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!
5.' Examine!the!initial!portion!of!the!output!SequenceFile!using!the!following!
command:!
Some!of!the!data!in!the!SequenceFile!is!unreadable,!but!parts!of!the!
SequenceFile!should!be!recognizable:!
• The!string!SEQ,!which!appears!at!the!beginning!of!a!SequenceFile!
• The!Java!classes!for!the!keys!and!values!
• Text!from!the!access!log!file!
6.' Verify!that!the!number!of!files!created!by!the!job!is!equivalent!to!the!number!of!
blocks!required!to!store!the!uncompressed!SequenceFile.!
7.' Modify!your!MapReduce!job!to!compress!the!output!SequenceFile.!Add!
statements!to!your!driver!to!configure!the!output!SequenceFile!as!follows:!
• Compress!the!output!file.!
• Use!block!compression.!
8.' Compile!the!code!and!run!your!modified!MapReduce!job.!For!the!MapReduce!
output,!specify!the!compressedsf!directory.!
Note:!The!CreateCompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!
9.' Examine!the!first!portion!of!the!output!SequenceFile.!Notice!the!differences!
between!the!uncompressed!and!compressed!SequenceFiles:!
• The!compressed!SequenceFile!specifies!the!
org.apache.hadoop.io.compress.SnappyCodec!compression!
codec!in!its!header.!
• You!cannot!read!the!log!file!text!in!the!compressed!file.!
10.' Compare!the!file!sizes!of!the!uncompressed!and!compressed!SequenceFiles!in!
the!uncompressedsf!and!compressedsf!directories.!
The!compressed!SequenceFiles!should!be!smaller.!!
11.' Write!a!second!MapReduce!job!to!read!the!compressed!log!file!and!write!a!text!
file.!This!text!file!should!have!the!same!text!data!as!the!log!file,!plus!keys.!The!
keys!can!contain!any!values!you!like.!
12.' Compile!the!code!and!run!your!MapReduce!job.!!
For!the!MapReduce!input,!specify!the!uncompressedsf!directory.!You!created!
the!compressed!SequenceFile!in!this!directory.!
For!the!MapReduce!output,!specify!the!compressedsftotext!directory.!
Note:!The!ReadCompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!
13.' Examine!the!first!portion!of!the!output!in!the!compressedsftotext!
directory.!
You!should!be!able!to!read!the!textual!log!file!entries.!
For!this!lab!you!will!use!an!alternate!input,!provided!in!the!file!
invertedIndexInput.tgz.!When!decompressed,!this!archive!contains!a!
directory!of!files;!each!is!a!Shakespeare!play!formatted!as!follows:!
0 HAMLET
1
2
3 DRAMATIS PERSONAE
4
5
6 CLAUDIUS king of Denmark. (KING CLAUDIUS:)
7
8 HAMLET son to the late, and nephew to the present
king.
9
10 POLONIUS lord chamberlain. (LORD POLONIUS:)
...
Each!line!contains:!
! Line,number!
! separator:!a!tab!character!
! value:!the!line!of!text!
This!format!can!be!read!directly!using!the!KeyValueTextInputFormat!class!
provided!in!the!Hadoop!API.!This!input!format!presents!each!line!as!one!record!to!
your!Mapper,!with!the!part!before!the!tab!character!as!the!key,!and!the!part!after!the!
tab!as!the!value.!
honeysuckle 2kinghenryiv@1038,midsummernightsdream@2175,...
The!index!should!contain!such!an!entry!for!every!word!in!the!text.!
We!have!provided!stub!files!in!the!directory!
~/training_materials/developer/exercises/inverted_index!
$ cd ~/training_materials/developer/data
$ tar zxvf invertedIndexInput.tgz
$ hadoop fs -put invertedIndexInput invertedIndexInput
job.setInputFormatClass(KeyValueTextInputFormat.class);
Don’t!forget!to!import!this!class!for!your!use.!
Hints
You!may!like!to!complete!this!exercise!without!reading!any!further,!or!you!may!find!
the!following!hints!about!the!algorithm!helpful.!
The Mapper
Your!Mapper!should!take!as!input!a!key!and!a!line!of!words,!and!should!emit!as!
intermediate!values!each!word!as!key,!and!the!key!as!value.!!
For!example,!the!line!of!input!from!the!file!‘hamlet’:!
produces!intermediate!output:!
Have hamlet@282
heaven hamlet@282
and hamlet@282
earth hamlet@282
together hamlet@282
The Reducer
Your!Reducer!simply!aggregates!the!values!presented!to!it!for!the!same!key,!into!one!
value.!Use!a!separator!like!‘,’!between!the!values!listed.!
Note!that!this!implementation!is!a!specialization!of!Word!CoJOccurrence!as!we!
describe!it!in!the!notes;!in!this!case!we'are'only'interested'in'pairs'of'words'
which'appear'directly'next'to'each'other.)"
1.' Change!directories!to!the!word_co-occurrence!directory!within!the!
exercises!directory.!
2.' Complete!the!Driver!and!Mapper!stub!files;!you!can!use!the!standard!
SumReducer!from!the!WordCount!directory!as!your!Reducer.!Your!Mapper’s!
intermediate!output!should!be!in!the!form!of!a!Text!object!as!the!key,!and!an!
IntWritable!as!the!value;!the!key!will!be!‘word1,word2’,!and!the!value!will!be!1.!!
3.' Extra!credit:!Write!a!further!MapReduce!job!to!sort!the!output!from!the!first!job!
so!that!the!list!of!pairs!of!words!appears!in!ascending!frequency.!
1.' Change!directories!to!the!writables!directory!in!the!exercises!directory.!
2.' Edit!the!stub!files!to!solve!the!problem.!If!you!completed!the!Word!CoJ
Occurrence!additional!exercise!earlier,!use!those!files!as!your!starting!point!for!
the!Mapper!and!Reducer.!If!you!did!not!complete!the!exercise!earlier!in!the!
course,!you!will!find!a!sample!solution!in!the!word_co-occurrence!directory!
in!the!exercises!directory.!Copy!that!and!modify!it!appropriately.!
Hints
You!need!to!create!a!WritableComparable!object!that!will!hold!the!two!strings.!The!
stub!provides!an!empty!constructor!for!serialization,!a!standard!constructor!that!
will!be!given!two!strings,!a!toString!method,!and!the!generated!hashCode!and!
equals!methods.!You!will!need!to!implement!the!readFields,!write,!and!
compareTo!methods!required!by!WritableComparables.!!
Note!that!Eclipse!automatically!generated!the!hashCode!and!equals!methods!in!
the!stub!file.!You!can!generate!these!two!methods!in!Eclipse!by!rightJclicking!in!the!
source!code!and!choosing!‘Source’!!>!‘Generate!hashCode()!and!equals()’.!
As!always,!a!sample!solution!is!available.!
Consider!the!MySQL!database!movielens,!derived!from!the!MovieLens!project!
from!University!of!Minnesota.!(See!note!at!the!end!of!this!exercise.)!The!database!
consists!of!several!related!tables,!but!we!will!import!only!two!of!these:!movie,!
which!contains!about!3,900!movies;!and!movierating,!which!has!about!1,000,000!
ratings!of!those!movies.!!
1. Log!on!to!MySQL:!
2. Review!the!structure!and!contents!of!the!movie!table:!
3. Note!the!column!names!for!the!table.!
____________________________________________________________________________________________!
5. Note!these!column!names.!
____________________________________________________________________________________________!
6. Exit!mysql:!
mysql> quit
1.' Show!the!commands!available!in!Sqoop:!
$ sqoop help
2.' List!the!databases!(schemas)!in!your!database!server:!
$ sqoop list-databases \
--connect jdbc:mysql://localhost \
--username training --password training
(Note:!Instead!of!entering!--password training!on!your!command!line,!
you!may!prefer!to!enter!-P,!and!let!Sqoop!prompt!you!for!the!password,!which!
is!then!not!visible!when!you!type!it.)!
$ sqoop list-tables \
--connect jdbc:mysql://localhost/movielens \
--username training --password training
4.' Import!the!movie!table!into!Hadoop:!
$ sqoop import \!
--connect jdbc:mysql://localhost/movielens \!
--table movie --fields-terminated-by '\t' \!
--username training --password training !
Note:!separating!the!fields!in!the!HDFS!file!with!the!tab!character!is!one!way!to!
manage!compatibility!with!Hive!and!Pig,!which!we!will!use!in!a!future!exercise.!
5.' Verify!that!the!command!has!worked.!
6.' Import!the!movierating!table!into!Hadoop.!!
Repeat!steps!4!and!5,!but!for!the!movierating!table.!
This exercise uses the MovieLens data set, or subsets thereof. This data is freely
available for academic purposes, and is used and distributed by Cloudera with
the express permission of the UMN GroupLens Research Group. If you would
like to use this data for your own research purposes, you are free to do so, as
long as you cite the GroupLens Research Group in any resulting publications. If
you would like to use this data for commercial purposes, you must obtain
explicit permission. You may find the full dataset, as well as detailed license
terms, at http://www.grouplens.org/node/73
1.' Ensure!that!you!completed!the!Sqoop!HandsJOn!Exercise!to!import!the!movie!
and!movierating!data.!
2.' Create!a!list!of!users!for!whom!you!want!to!generate!recommendations,!by!
creating!and!editing!a!file!on!your!local!disk!named!users.!Into!that!file,!on!
separate!lines!place!the!user!IDs!6037,!6038,!6039,!6040,!so!that!the!file!looks!
like!this:!
6037
6038
6039
6040
Important:'Make'sure'there'is'not'a'blank'line'at'the'end'of'the'file.'The'
line'containing'the'last'user'ID'should'not'have'a'carriage'return'at'the'
end'of'that'line.'If'it'does,'the'job'will'fail.!
3.' Upload!the!file!to!HDFS:!
4.' Run!Mahout’s!itemJbased!recommender:!
5.' This!will!take!a!long!time!to!execute;!it!runs!approximately!10!MapReduce!jobs.!
Your!instructor!will!now!continue!with!the!notes,!but!when!the!final!job!is!
The!data!sets!for!this!exercise!are!the!movie!and!movierating!data!imported!
from!MySQL!into!Hadoop!in!a!previous!exercise.!
Prepare!the!Hive!tables!for!this!exercise!by!performing!the!following!steps:!
$ hive
2.' Create!the!movie!table:!
3.' Create!the!movierating!table:!
4.' Quit!the!Hive!shell:!
hive> QUIT;
The Questions
Now!that!the!data!is!imported!and!suitably!prepared,!use!Hive!to!answer!the!
following!questions.!
Hive:
You can enter Hive commands interactively in the Hive shell:
$ hive
. . .
hive> Enter interactive commands here
Or you can execute text files containing Hive commands with:
$ hive -f file_to_execute
1.' What!is!the!oldest!known!movie!in!the!database?!!Note!that!movies!with!
unknown!years!have!a!value!of!0!in!the!year!field!J!these!do!not!belong!in!your!
answer.!
2.' List!the!name!and!year!of!all!unrated!movies!(movies!where!the!movie!data!has!
no!related!movierating!data).!
3.' Produce!an!updated!copy!of!the!movie!data!with!two!new!fields:!
! ! numratings!! J!the!number!of!ratings!for!the!movie!
avgrating!! ! J!the!average!rating!for!the!movie!
Unrated!movies!are!not!needed!in!this!copy.!
4.' What!are!the!10!highestJrated!movies?!!(Notice!that!your!work!in!step!3!makes!
this!question!easy!to!answer.)!
1.' Create!a!text!file!called!listrecommendations!with!the!following!contents:!
dump srtd;
2.' Run!the!Pig!script!to!produce!a!list!of!movie!recommendations:!
$ pig listrecommendations
1.' Change!directories!to!the!oozie-labs!directory!within!the!exercises!
directory!
2.' Start!the!Oozie!server!
3.' Change!directories!to!lab1-java-mapreduce/job!
$ cd lab1-java-mapreduce/job
4.' Inspect!the!contents!of!the!job.properties!and!workflow.xml!files.!You!
will!see!that!this!is!our!standard!WordCount!job.!
5.' Change!directories!back!to!the!main!oozie-labs!directory!
$ cd ../..
6.' We!have!provided!a!simple!shell!script!to!submit!the!Oozie!workflow.!Inspect!
run.sh!
$ cat run.sh
7.' Submit!the!workflow!to!the!Oozie!server!
$ ./run.sh lab1-java-mapreduce
Notice!that!Oozie!returns!a!job!identification!number.!
9.' When!the!job!has!completed,!inspect!HDFS!to!confirm!that!the!output!has!been!
produced!as!expected.!
10.' Repeat!the!above!procedure!for!lab2JsortJwordcount.!Notice!when!you!inspect!
workflow.xml!that!this!workflow!includes!two!MapReduce!jobs!which!run!
one!after!the!other.!When!you!inspect!the!output!in!HDFS!you!will!see!that!the!
second!job!sorts!the!output!of!the!first!job!into!descending!numerical!order.!