0% found this document useful (0 votes)

26 views50 pages

Developer Exercise Instructions

This document describes hands-on exercises for using HDFS and MapReduce with Hadoop. The first exercise has users explore HDFS by listing directories and uploading sample Shakespeare data files. Users learn how to browse HDFS with the hadoop fs command and upload local files into HDFS for distributed storage and processing.

Uploaded by

seshuchoudary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views50 pages

Developer Exercise Instructions

Uploaded by

seshuchoudary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

201212!

Cloudera Developer Training for

Apache Hadoop:
Hands-On Exercises

General'Notes'...........................................................................................................................'3!
Hands0On'Exercise:'Using'HDFS'.........................................................................................'5!
Hands0On'Exercise:'Running'a'MapReduce'Job'..........................................................'11!
Hands0On'Exercise:'Writing'a'MapReduce'Program'.................................................'15!
Hands0On'Exercise:'Writing'Unit'Tests'With'the'MRUnit'Framework'...............'23!
Hands0On'Exercise:'Writing'and'Implementing'a'Combiner'.................................'24!
Hands0On'Exercise:'Writing'a'Partitioner'....................................................................'25!
Hands0On'Exercise:'Using'Counters'and'a'Map0Only'Job'.........................................'27!
Hands0On'Exercise:'Using'SequenceFiles'and'File'Compression'..........................'28!
Hands0On'Exercise:'Creating'an'Inverted'Index'.........................................................'32!
Hands0On'Exercise:'Calculating'Word'Co0Occurrence'.............................................'36!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 1

Not to be reproduced without prior written consent.
Optional'Hands0On'Exercise:'Implementing'Word'Co0Occurrence'with'a'
Custom'WritableComparable'............................................................................................'37!
Hands0On'Exercise:'Importing'Data'With'Sqoop'........................................................'38!
Hands0On'Exercise:'Using'a'Mahout'Recommender'.................................................'42!
Hands0On'Exercise:'Manipulating'Data'With'Hive'.....................................................'44!
Hands0On'Exercise:'Using'Pig'to'Retrieve'Movie'Names'From'Our'
Recommender'........................................................................................................................'47!
Hands0On'Exercise:'Running'an'Oozie'Workflow'......................................................'49!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 2

Not to be reproduced without prior written consent.
General Notes
Cloudera’s!training!courses!use!a!Virtual!Machine!running!the!CentOS!6.3!Linux!
distribution.!This!VM!has!CDH4.1!(Cloudera’s!Distribution,!including!Apache!
Hadoop!version!4.1)!installed!in!PseudoJDistributed!mode.!PseudoJDistributed!
mode!is!a!method!of!running!Hadoop!whereby!all!five!Hadoop!daemons!run!on!the!
same!machine.!It!is,!essentially,!a!cluster!consisting!of!a!single!machine.!It!works!just!
like!a!larger!Hadoop!cluster,!the!only!key!difference!(apart!from!speed,!of!course!)!
being!that!the!block!replication!factor!is!set!to!1,!since!there!is!only!a!single!
DataNode!available.!

Points to note while working in the VM

1.' The!VM!is!set!to!automatically!log!in!as!the!user!training.!Should!you!log!out!
at!any!time,!you!can!log!back!in!as!the!user!training!with!the!password!
training.!

2.' Should!you!need!it,!the!root!password!is!training.!You!may!be!prompted!for!
this!if,!for!example,!you!want!to!change!the!keyboard!layout.!In!general,!you!
should!not!need!this!password!since!the!training!user!has!unlimited!sudo!
privileges.!

3.' In!some!commandJline!steps!in!the!exercises,!you!will!see!lines!like!this:!

$ hadoop fs -put shakespeare \

/user/training/shakespeare

The!backslash!at!the!end!of!the!first!line!signifies!that!the!command!is!not!
completed,!and!continues!on!the!next!line.!You!can!enter!the!code!exactly!as!
shown!(on!two!lines),!or!you!can!enter!it!on!a!single!line.!If!you!do!the!latter,!you!
should!not!type!in!the!backslash.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 3

Not to be reproduced without prior written consent.
Points to note during the exercises
1.' There!are!additional!challenges!for!most!of!the!HandsJOn!Exercises.!If!you!finish!
the!main!exercise,!please!attempt!the!additional!exercise.!

2.' Sample!solutions!are!always!available!in!the!sample_solutions!
subdirectory!of!the!exercise!directory.!

3.' As!the!exercises!progress,!and!you!gain!more!familiarity!with!Hadoop!and!
MapReduce,!we!provide!fewer!stepJbyJstep!instructions!J!as!in!the!real!world,!
we!merely!give!you!a!requirement!and!it’s!up!to!you!to!solve!the!problem!!There!
are!‘stub’!files!for!each!exercise!to!get!you!started,!and!you!should!feel!free!to!
ask!your!instructor!for!assistance!at!any!time.!We!also!provide!some!hints!in!
many!of!the!exercises.!And,!of!course,!you!can!always!consult!with!your!fellow!
students.!

4.' If!you!would!like!to!have!more!hints!than!are!provided!in!the!default!stub!files,!
you!can!use!stub!files!with!additional!hints!in!the!stubs_with_hints!
subdirectory!of!each!exercise!directory.!

5.' If!you!are!working!in!Eclipse!and!would!prefer!to!use!the!stub!files!with!
additional!hints,!do!the!following:!

a. Close!Eclipse.!

b. In!a!terminal!window,!change!to!the!/home/training/scripts!
directory.!

c. Run!the!following!command:!

./eclipse_projects.sh hint

d. Restart!Eclipse.!

e. Refresh!each!of!the!Eclipse!projects.!(To!refresh!a!project,!rightJclick!
the!project!and!select!Refresh.)!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 4

Not to be reproduced without prior written consent.
Hands-On Exercise: Using HDFS
In'this'exercise'you'will'begin'to'get'acquainted'with'the'Hadoop'tools.'You'
will'manipulate'files'in'HDFS,'the'Hadoop'Distributed'File'System.'

Hadoop
Hadoop!is!already!installed,!configured,!and!running!on!your!virtual!machine.!!

Most!of!your!interaction!with!the!system!will!be!through!a!commandJline!wrapper!
called!hadoop.!If!you!start!a!terminal!and!run!this!program!with!no!arguments,!it!
prints!a!help!message.!To!try!this,!run!the!following!command:!

$ hadoop

(Note:!although!your!command!prompt!is!more!verbose,!we!use!‘$’!to!indicate!the!
command!prompt!for!brevity’s!sake.)!!!

The!hadoop!command!is!subdivided!into!several!subsystems.!For!example,!there!is!
a!subsystem!for!working!with!files!in!HDFS!and!another!for!launching!and!managing!
MapReduce!processing!jobs.!

Step 1: Exploring HDFS

The!subsystem!associated!with!HDFS!in!the!Hadoop!wrapper!program!is!called!
FsShell.!This!subsystem!can!be!invoked!with!the!command!hadoop fs.!

1.' Open!a!terminal!window!(if!one!is!not!already!open)!by!doubleJclicking!the!
Terminal!icon!on!the!desktop.!

2.' In!the!terminal!window,!enter:!

$ hadoop fs

You!see!a!help!message!describing!all!the!commands!associated!with!this!
subsystem.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 5

Not to be reproduced without prior written consent.
3.' Enter:!

$ hadoop fs -ls /

This!shows!you!the!contents!of!the!root!directory!in!HDFS.!!There!will!be!
multiple!entries,!one!of!which!is!/user.!Individual!users!have!a!“home”!
directory!under!this!directory,!named!after!their!username!J!your!home!
directory!is!/user/training.!!!

4.' Try!viewing!the!contents!of!the!/user!directory!by!running:!

$ hadoop fs -ls /user

You!will!see!your!home!directory!in!the!directory!listing.!!!

5.' Try!running:!

$ hadoop fs -ls /user/training

There!are!no!files,!so!the!command!silently!exits.!This!is!different!than!if!you!ran!
hadoop fs -ls /foo,!which!refers!to!a!directory!that!doesn’t!exist!and!
which!would!display!an!error!message.!

Note!that!the!directory!structure!in!HDFS!has!nothing!to!do!with!the!directory!
structure!of!the!local!filesystem;!they!are!completely!separate!namespaces.!

Step 2: Uploading Files

Besides!browsing!the!existing!filesystem,!another!important!thing!you!can!do!with!
FsShell!is!to!upload!new!data!into!HDFS.!

1.' Change!directories!to!the!directory!containing!the!sample!data!we!will!be!using!
in!the!course.!

cd ~/training_materials/developer/data

If!you!perform!a!‘regular’!ls!command!in!this!directory,!you!will!see!a!few!files,!
including!two!named!shakespeare.tar.gz!and!!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 6

Not to be reproduced without prior written consent.
shakespeare-stream.tar.gz.!Both!of!these!contain!the!complete!works!of!
Shakespeare!in!text!format,!but!with!different!formats!and!organizations.!For!
now!we!will!work!with!shakespeare.tar.gz.!!!

2.' Unzip!shakespeare.tar.gz!by!running:!

$ tar zxvf shakespeare.tar.gz

This!creates!a!directory!named!shakespeare/!containing!several!files!on!your!
local!filesystem.!!!

3.' Insert!this!directory!into!HDFS:!

$ hadoop fs -put shakespeare /user/training/shakespeare

This!copies!the!local!shakespeare!directory!and!its!contents!into!a!remote,!
HDFS!directory!named!/user/training/shakespeare.!!!

4.' List!the!contents!of!your!HDFS!home!directory!now:!

$ hadoop fs -ls /user/training

You!should!see!an!entry!for!the!shakespeare!directory.!!!

5.' Now!try!the!same!fs -ls!command!but!without!a!path!argument:!

$ hadoop fs -ls

You!should!see!the!same!results.!If!you!don’t!pass!a!directory!name!to!the!-ls!
command,!it!assumes!you!mean!your!home!directory,!i.e.!/user/training.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 7

Not to be reproduced without prior written consent.
Relative paths

If you pass any relative (non-absolute) paths to FsShell commands (or use
relative paths in MapReduce programs), they are considered relative to your
home directory.

6.' We!also!have!a!Web!server!log!file!which!we!will!put!into!HDFS!for!use!in!future!
exercises.!This!file!is!currently!compressed!using!GZip.!Rather!than!extract!the!
file!to!the!local!disk!and!then!upload!it,!we!will!extract!and!upload!in!one!step.!
First,!create!a!directory!in!HDFS!in!which!to!store!it:!

$ hadoop fs -mkdir weblog

7.' Now,!extract!and!upload!the!file!in!one!step.!The!-c!option!to!gunzip!
uncompresses!to!standard!output,!and!the!dash!(‘J‘)!in!the!hadoop fs -put!
command!takes!whatever!is!being!sent!to!its!standard!input!and!places!that!data!
in!HDFS.!

$ gunzip -c access_log.gz \
| hadoop fs -put - weblog/access_log

8.' Run!the!hadoop fs -ls!command!to!verify!that!the!Apache!log!file!is!in!your!

HDFS!home!directory.!

9.' The!access!log!file!is!quite!large!–!around!500!MB.!Create!a!smaller!version!of!
this!file,!consisting!only!of!its!first!5000!lines,!and!store!the!smaller!version!in!
HDFS.!You!can!use!the!smaller!version!for!testing!in!subsequent!exercises.!

hadoop fs -mkdir testlog

gunzip -c access_log.gz | head -n 5000 \
| hadoop fs –put - testlog/test_access_log

Step 3: Viewing and Manipulating Files

Now!let’s!view!some!of!the!data!copied!into!HDFS.!!!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 8

Not to be reproduced without prior written consent.
1.' Enter:!

$ hadoop fs -ls shakespeare

This!lists!the!contents!of!the!/user/training/shakespeare!directory,!
which!consists!of!the!files!comedies,!glossary,!histories,!poems,!and!
tragedies.!!!

2.' The!glossary!file!included!in!the!tarball!you!began!with!is!not!strictly!a!work!
of!Shakespeare,!so!let’s!remove!it:!

$ hadoop fs -rm shakespeare/glossary

Note!that!you!could!leave!this!file!in!place!if!you!so!wished.!If!you!did,!then!it!
would!be!included!in!subsequent!computations!across!the!works!of!
Shakespeare,!and!would!skew!your!results!slightly.!As!with!many!realJworld!big!
data!problems,!you!make!tradeJoffs!between!the!labor!to!purify!your!input!data!
and!the!precision!of!your!results.!

3.' Enter:!

$ hadoop fs -cat shakespeare/histories | tail -n 50

This!prints!the!last!50!lines!of!Henry,IV,,Part,1!to!your!terminal.!This!command!
is!handy!for!viewing!the!output!of!MapReduce!programs.!Very!often,!an!
individual!output!file!of!a!MapReduce!program!is!very!large,!making!it!
inconvenient!to!view!the!entire!file!in!the!terminal.!!For!this!reason,!it’s!often!a!
good!idea!to!pipe!the!output!of!the!fs -cat!command!into!head,!tail,!more,!
or!less.!

Note!that!when!you!pipe!the!output!of!the!fs -cat!command!to!a!local!UNIX!
command,!the!full!contents!of!the!file!are!still!extracted!from!HDFS!and!sent!to!
your!local!machine.!Once!on!your!local!machine,!the!file!contents!are!then!
modified!before!being!displayed.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 9

Not to be reproduced without prior written consent.
4.' If!you!want!to!download!a!file!and!manipulate!it!in!the!local!filesystem,!you!can!
use!the!fs -get!command.!This!command!takes!two!arguments:!an!HDFS!path!
and!a!local!path.!It!copies!the!HDFS!contents!into!the!local!filesystem:!

$ hadoop fs -get shakespeare/poems ~/shakepoems.txt!

$ less ~/shakepoems.txt

Other Commands
There!are!several!other!commands!associated!with!the!FsShell!subsystem,!to!
perform!most!common!filesystem!manipulations:!mv,!cp,!mkdir,!etc.!!!

1.' Enter: !

$ hadoop fs

This!displays!a!brief!usage!report!of!the!commands!within!FsShell.!Try!
playing!around!with!a!few!of!these!commands!if!you!like.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 10

Not to be reproduced without prior written consent.
Hands-On Exercise: Running a
MapReduce Job
In'this'exercise'you'will'compile'Java'files,'create'a'JAR,'and'run'MapReduce'
jobs.'

In!addition!to!manipulating!files!in!HDFS,!the!wrapper!program!hadoop!is!used!to!
launch!MapReduce!jobs.!The!code!for!a!job!is!contained!in!a!compiled!JAR!file.!!
Hadoop!loads!the!JAR!into!HDFS!and!distributes!it!to!the!worker!nodes,!where!the!
individual!tasks!of!the!MapReduce!job!are!executed.!

One!simple!example!of!a!MapReduce!job!is!to!count!the!number!of!occurrences!of!
each!word!in!a!file!or!set!of!files.!In!this!lab!you!will!compile!and!submit!a!
MapReduce!job!to!count!the!number!of!occurrences!of!every!word!in!the!works!of!
Shakespeare.!

Compiling and Submitting a MapReduce Job

1.' In!a!terminal!window,!change!to!the!working!directory,!and!take!a!directory!
listing:!

$ cd ~/training_materials/developer/exercises/wordcount
$ ls

This!directory!contains!the!following!Java!files:!

WordCount.java:!A!simple!MapReduce!driver!class.!
WordMapper.java:!A!mapper!class!for!the!job.!
SumReducer.java:!A!reducer!class!for!the!job.!

Examine!these!files!if!you!wish,!but!do!not!change!them.!Remain!in!this!
directory!while!you!execute!the!following!commands.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 11

Not to be reproduced without prior written consent.
2.' Compile!the!three!Java!classes:!

$ javac -classpath `hadoop classpath` *.java

Note:'in'the'command'above,'the'quotes'around'hadoop classpath'are'
back'quotes.'This'runs'the'hadoop classpath'command'and'uses'its'
output'as'part'of'the'javac'command.!

Your!command!includes!the!classpath!for!the!Hadoop!core!API!classes.!The!
compiled!(.class)!files!are!placed!in!your!local!directory.!!!

3.' Collect!your!compiled!Java!files!into!a!JAR!file:!

$ jar cvf wc.jar *.class!

4.' Submit!a!MapReduce!job!to!Hadoop!using!your!JAR!file!to!count!the!occurrences!
of!each!word!in!Shakespeare:!

$ hadoop jar wc.jar WordCount shakespeare wordcounts

This!hadoop jar!command!names!the!JAR!file!to!use!(wc.jar),!the!class!
whose!main!method!should!be!invoked!(WordCount),!and!the!HDFS!input!and!
output!directories!to!use!for!the!MapReduce!job.!

Your!job!reads!all!the!files!in!your!HDFS!shakespeare!directory,!and!places!its!
output!in!a!new!HDFS!directory!called!wordcounts.!

5.' Try!running!this!same!command!again!without!any!change:!

$ hadoop jar wc.jar WordCount shakespeare wordcounts!

Your!job!halts!right!away!with!an!exception,!because!Hadoop!automatically!fails!
if!your!job!tries!to!write!its!output!into!an!existing!directory.!This!is!by!design:!
since!the!result!of!a!MapReduce!job!may!be!expensive!to!reproduce,!Hadoop!
prevents!you!from!accidentally!overwriting!previously!existing!files.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 12

Not to be reproduced without prior written consent.
6.' Review!the!result!of!your!MapReduce!job:!

$ hadoop fs -ls wordcounts

This!lists!the!output!files!for!your!job.!(Your!job!ran!with!only!one!Reducer,!so!
there!should!be!one!file,!named!part-r-00000,!along!with!a!_SUCCESS!file!
and!a!_logs!directory.)!

7.' View!the!contents!of!the!output!for!your!job:!

$ hadoop fs -cat wordcounts/part-r-00000 | less

You!can!page!through!a!few!screens!to!see!words!and!their!frequencies!in!the!
works!of!Shakespeare.!Note!that!you!could!have!specified!wordcounts/*!just!
as!well!in!this!command.!

8.' Try!running!the!WordCount!job!against!a!single!file:!

$ hadoop jar wc.jar WordCount shakespeare/poems pwords

When!the!job!completes,!inspect!the!contents!of!the!pwords!directory.!

9.' Clean!up!the!output!files!produced!by!your!job!runs:!

$ hadoop fs –rm -r wordcounts pwords

Stopping MapReduce Jobs

It!is!important!to!be!able!to!stop!jobs!that!are!already!running.!This!is!useful!if,!for!
example,!you!accidentally!introduced!an!infinite!loop!into!your!Mapper.!An!
important!point!to!remember!is!that!pressing!^C!to!kill!the!current!process!(which!
is!displaying!the!MapReduce!job's!progress)!does!not!actually!stop!the!job!itself.!The!
MapReduce!job,!once!submitted!to!the!Hadoop!daemons,!runs!independently!of!any!
initiating!process.!!

Losing!the!connection!to!the!initiating!process!does!not!kill!a!MapReduce!job.!!
Instead,!you!need!to!tell!the!Hadoop!JobTracker!to!stop!the!job.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 13

Not to be reproduced without prior written consent.
1.' Start!another!word!count!job!like!you!did!in!the!previous!section:!

$ hadoop jar wc.jar WordCount shakespeare count2

2.' While!this!job!is!running,!open!another!terminal!window!and!enter:!

$ mapred job -list

This!lists!the!job!ids!of!all!running!jobs.!A!job!id!looks!something!like:!
job_200902131742_0002!

3.' Copy!the!job!id,!and!then!kill!the!running!job!by!entering:!

$ mapred job -kill jobid

The!JobTracker!kills!the!job,!and!the!program!running!in!the!original!terminal,!
reporting!its!progress,!informs!you!that!the!job!has!failed.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 14

Not to be reproduced without prior written consent.
Hands-On Exercise: Writing a
MapReduce Program
In'this'exercise'you'write'a'MapReduce'job'that'reads'any'text'input'and'
computes'the'average'length'of'all'words'that'start'with'each'character.'You'
can'write'the'job'in'Java'or'using'Hadoop'Streaming.''

For!any!text!input,!the!job!should!report!the!average!length!of!words!that!begin!with!
‘a’,!‘b’,!and!so!forth.!!For!example,!for!input:!

Now is definitely the time

The!output!would!be:!

N 3
d 10
i 2
t 3.5

(For!the!initial!solution,!your!program!can!be!caseJsensitive—which!is!the!case!for!
Java!string!processing!by!default.)!

The Algorithm
The!algorithm!for!this!program!is!a!simple!oneJpass!MapReduce!program:!

The'Mapper'

The!Mapper!receives!a!line!of!text!for!each!input!value.!(Ignore!the!input!key.)!For!
each!word!in!the!line,!emit!the!first!letter!of!the!word!as!a!key,!and!the!length!of!the!
word!as!a!value.!For!example,!for!input!value:!

Now is definitely the time

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 15

Not to be reproduced without prior written consent.
Your!Mapper!should!emit:!

N 3
i 2
d 10
t 3
t 4

The'Reducer'

Thanks!to!the!sort/shuffle!phase!built!in!to!MapReduce,!the!Reducer!receives!the!
keys!in!sorted!order,!and!all!the!values!for!one!key!appear!together.!!So,!for!the!
Mapper!output!above,!the!Reducer!(if!written!in!Java)!receives!this:!

N (3)
d (10)
i (2)
t (3, 4)

If!you!will!be!writing!your!code!using!Hadoop!Streaming,!your!Reducer!would!
receive!the!following:!

N 3
d 10
i 2
t 3
t 4

For!either!type!of!input,!the!final!output!should!be:!

N 3
d 10
i 2
t 3.5

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 16

Not to be reproduced without prior written consent.
Choose Your Language
You!can!perform!this!exercise!in!Java!or!Hadoop!Streaming!(or!both!if!you!have!the!
time).!Your!virtual!machine!has!Perl,!Python,!PHP,!and!Ruby!installed,!so!you!can!
choose!any!of!these—or!even!shell!scripting—to!develop!a!Streaming!solution!if!you!
like.!Following!are!a!discussion!of!the!program!in!Java,!and!then!a!discussion!of!the!
program!in!Streaming.!

If!you!complete!the!first!part!of!the!exercise,!there!is!a!further!exercise!for!you!to!try.!
See!page!21!for!instructions.!

Set up Eclipse
We!have!created!Eclipse!projects!for!each!of!the!HandsJOn!Exercises.!Using!Eclipse!
will!speed!up!your!development!time.!!

Even!if!you!do!not!plan!to!use!Eclipse!to!develop!your!code,!you!still!need!to!set!up!
Eclipse.!In!the!“Writing!Unit!Tests!With!the!MRUnit!Framework”!lab,!you!will!use!
Eclipse!to!run!unit!tests.!

Follow!these!instructions!to!set!up!Eclipse!by!importing!projects!into!the!
environment:!

1.' Launch!Eclipse.!

2.' Select!Import!from!the!File!menu.!

3.' Select!General!J>!Existing!Projects!into!Workspace,!and!click!Next.!

4.' Specify!/home/training/workspace!in!the!Select!Root!Directory!field.!All!the!
exercise!projects!will!appear!in!the!Projects!field.!

5.' Click!OK,!then!click!Finish.!That!will!import!all!projects!into!your!workspace.!

The!steps!to!export!Java!code!to!a!JAR!file!and!run!the!code!are!described!in!the!
slides!for!this!chapter.!The!following!is!a!quick!review!of!the!steps!you!will!need!to!
perform!to!run!Hadoop!source!code!developed!in!Eclipse:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 17

Not to be reproduced without prior written consent.
1.' Verify!that!your!Java!code!does!not!have!any!compiler!errors!or!warnings.!

The!Eclipse!software!in!your!VM!is!preJconfigured!to!compile!code!
automatically!without!performing!any!explicit!steps.!Compile!errors!and!
warnings!appear!as!red!and!yellow!icons!to!the!left!of!the!code.!

2.' RightJclick!the!default!package!entry!for!the!Eclipse!project!(under!the!src!
entry).!

3.' Select!Export!

4.' Select!Java!>!.JAR!File!from!the!Export!dialog!box,!then!click!Next.!

5.' Specify!a!location!for!the!JAR!file.!You!can!place!your!JAR!files!wherever!you!like.!

6.' Run!the!hadoop jar!command!as!you!did!previously!in!the!Running!a!

MapReduce!Job!exercise.!

For!more!information!about!running!a!Hadoop!job!when!working!in!Eclipse,!
including!screen!shots,!refer!to!the!slides!for!this!chapter.!
!

The Program in Java

Basic!stub!files!for!the!exercise!can!be!found!in!
~/training_materials/developer/exercises/averagewordlength.!If!
you!like,!you!can!use!the!wordcount!example!(in!
~/training_materials/developer/exercises/wordcount!as!a!starting!
point!for!your!Java!code.!Here!are!a!few!details!to!help!you!begin!your!Java!
programming:!

1.' Define!the!driver!

This!class!should!configure!and!submit!your!basic!job.!!Among!the!basic!steps!
here,!configure!the!job!with!the!Mapper!class!and!the!Reducer!class!you!will!
write,!and!the!data!types!of!the!intermediate!and!final!keys.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 18

Not to be reproduced without prior written consent.
2.' Define!the!Mapper!

Note!these!simple!string!operations!in!Java:!

str.substring(0, 1) // String : first letter of str

str.length() // int : length of str

3.' Define!the!Reducer!

In!a!single!invocation!the!Reducer!receives!a!string!containing!one!letter!along!
with!an!iterator!of!integers.!!For!this!call,!the!reducer!should!emit!a!single!
output!of!the!letter!and!the!average!of!the!integers.!

4.' Test!your!program!

Compile,!jar,!and!test!your!program.!You!can!use!the!entire!Shakespeare!dataset!
for!your!input,!or!you!can!try!it!with!just!one!of!the!files!in!the!dataset,!or!with!
your!own!test!data.!

Solution in Java
The!directory!
~/training_materials/developer/exercises/averagewordlength/
sample_solution!contains!a!set!of!Java!class!definitions!that!solve!the!problem.!!!

The Program Using Hadoop Streaming

For!your!Hadoop!Streaming!program,!launch!a!text!editor!to!write!your!Mapper!
script!and!your!Reducer!script.!Here!are!some!notes!about!solving!the!problem!in!
Hadoop!Streaming:!

1.' The!Mapper!Script!

The!Mapper!will!receive!lines!of!text!on!stdin.!Find!the!words!in!the!lines!to!
produce!the!intermediate!output,!and!emit!intermediate!(key,!value)!pairs!by!
writing!strings!of!the!form:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 19

Not to be reproduced without prior written consent.
key <tab> value <newline>

These!strings!should!be!written!to!stdout.!

2.' The!Reducer!Script!

For!the!reducer,!multiple!values!with!the!same!key!are!sent!to!your!script!on!
stdin!as!successive!lines!of!input.!Each!line!contains!a!key,!a!tab,!a!value,!and!a!
newline.!All!lines!with!the!same!key!are!sent!one!after!another,!possibly!
followed!by!lines!with!a!different!key,!until!the!reducing!input!is!complete.!For!
example,!the!reduce!script!may!receive!the!following:!

t 3
t 4
w 4
w 6

For!this!input,!emit!the!following!to!stdout:!

t 3.5
w 5

Observe!that!the!reducer!receives!a!key!with!each!input!line,!and!must!“notice”!
when!the!key!changes!on!a!subsequent!line!(or!when!the!input!is!finished)!to!
know!when!the!values!for!a!given!key!have!been!exhausted.!

3.' Run!the!streaming!program!

You!can!run!your!program!with!Hadoop!Streaming!via:!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 20

Not to be reproduced without prior written consent.
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/ \
contrib/streaming/hadoop-streaming*.jar \
-input inputDir -output outputDir \
-file pathToMapScript -file pathToReduceScript \
-mapper mapBasename -reducer reduceBasename

(Remember,!you!may!need!to!delete!any!previous!output!before!running!your!
program!!with!hadoop fs –rm -r dataToDelete.)!

Solution in Python
You!can!find!a!working!solution!to!this!exercise!written!in!Python!in!the!directory!
~/training_materials/developer/exercises/averagewordlength/
python.

Additional Exercise
If!you!have!more!time,!attempt!the!additional!exercise.!In!the!
log_file_analysis!directory,!you!will!find!stubs!for!the!Mapper!and!Driver.!
(There!is!also!a!sample!solution!available.)!

Your!task!is!to!count!the!number!of!hits!made!from!each!IP!address!in!the!sample!
(anonymized)!Apache!log!file!that!you!uploaded!to!the!/user/training/weblog!
directory!in!HDFS!when!you!performed!the!Using!HDFS!exercise.!!

Note:!If!you!want,!you!can!test!your!code!against!the!smaller!version!of!the!access!
log!in!the!/user/training/testlog!directory!before!you!run!your!code!against!
the!full!log!in!the!/user/training/weblog!directory.!

1.' Change!directory!to!~/training_materials/developer/exercises/
log_file_analysis.!

2.' Using!the!stub!files!in!that!directory,!write!Mapper!and!Driver!code!to!count!the!
number!of!hits!made!from!each!IP!address!in!the!access!log!file.!Your!final!result!
should!be!a!file!in!HDFS!containing!each!IP!address,!and!the!count!of!log!hits!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 21

Not to be reproduced without prior written consent.
from!that!address.!Note:'You'can're0use'the'Reducer'provided'in'the'
WordCount'hands0on'exercise,'or'you'can'write'your'own'if'you'prefer.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 22

Not to be reproduced without prior written consent.
Hands-On Exercise: Writing Unit
Tests With the MRUnit Framework
In!this!Exercise,!you!will!write!Unit!Tests!for!the!WordCount!code.!

Note:!You!should!have!set!up!Eclipse!when!you!performed!the!“Writing!a!
MapReduce!Program”!exercise.!If!you!did!not!set!up!Eclipse,!locate!the!section!of!
that!exercise!titled!“Set!up!Eclipse”!and!follow!the!steps!to!import!projects!into!
Eclipse.!

1.' Launch!Eclipse!and!expand!the!mrunit!folder.!

2.' Examine!the!TestWordCount.java!file!in!the!mrunit!project.!Notice!that!
three!tests!have!been!created,!one!each!for!the!Mapper,!Reducer,!and!the!entire!
MapReduce!flow.!Currently,!all!three!tests!simply!fail.!

3.' Run!the!tests!by!rightJclicking!on!TestWordCount.java!and!choosing!‘Run!
As’!J>!‘JUnit!Test’.!

4.' Observe!the!failure.!Results!in!the!JUnit!tab!(next!to!the!Package!Explorer!tab)!
should!indicate!that!three!tests!ran!with!three!failures.!

5.' Now!implement!the!three!tests.!If!you!need!hints,!there!is!a!sample!solution!in!
the!sample!solution!directory!within!the!mrunit!directory.!

6.' Run!the!tests!again.!!Results!in!the!JUnit!tab!should!indicate!that!three!tests!ran!
with!no!failures.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 23

Not to be reproduced without prior written consent.
Hands-On Exercise: Writing and
Implementing a Combiner
In!this!HandsJOn!Exercise,!you!will!write!and!implement!a!Combiner!to!reduce!the!
amount!of!intermediate!data!sent!from!the!Mapper!to!the!Reducer.!You!will!use!the!
WordCount!Mapper!and!Reducer!from!an!earlier!exercise.!

There!are!further!exercises!if!you!have!more!time.!

Implement a Combiner
1.' Change!directory!to!
~/training_materials/developer/exercises/combiner!

2.' Copy!the!WordCount!Mapper!and!Reducer!into!this!directory!

$ cp ../wordcount/WordMapper.java .
$ cp ../wordcount/SumReducer.java .

3.' Complete!the!WordCountDriver.java!code!and,!if!necessary,!the!
SumCombiner.java!code!to!implement!a!Combiner!for!the!Wordcount!
program.!

4.' Compile!and!test!your!solution.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 24

Not to be reproduced without prior written consent.
Hands-On Exercise: Writing a
Partitioner
In!this!HandsJOn!Exercise,!you!will!write!a!MapReduce!job!with!multiple!Reducers,!
and!create!a!Partitioner!to!determine!which!Reducer!each!piece!of!Mapper!output!is!
sent!to.!

We!will!be!modifying!the!program!you!wrote!to!solve!the!Additional!Exercise!on!
page!21.!

Prepare The Exercise

1. Change!directories!to!
~/training_materials/developer/exercises/partitioner

2. If!you!completed!the!Additional!Exercise!on!page!21,!copy!the!code!for!your!
Mapper,!Reducer!and!driver!to!this!directory.!If!you!did!not,!copy!the!sample!
solution!from!~/training_materials/developer/exercises/
log_file_analysis/sample_solution!into!the!current!directory.

The Problem
The!code!you!now!have!in!the!partitioner!directory!counts!all!the!hits!made!to!a!
Web!server!from!each!different!IP!address.!(If!you!did!not!complete!the!exercise,!
view!the!code!and!be!sure!you!understand!how!it!works.)!Our!aim!in!this!exercise!is!
to!modify!the!code!such!that!we!get!a!different!result:!We!want!one'output'file'per'
month,'each!file!containing!the!number!of!hits!from!each!IP!address!in!that!month.!
In!other!words,!there!will!be!12!output!files,!one!for!each!month.!

Note:!we!are!actually!breaking!the!standard!MapReduce!paradigm!here.!The!
standard!paradigm!says!that!all!the!values!from!a!particular!key!will!go!to!the!same!
Reducer.!In!this!example!J!which!is!a!very!common!pattern!when!analyzing!log!files!J!
values!from!the!same!key!(the!IP!address)!will!go!to!multiple!Reducers,!based!on!the!
month!portion!of!the!line.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 25

Not to be reproduced without prior written consent.
Writing The Solution
Before!you!start!writing!your!code,!ensure!that!you!understand!the!required!
outcome;!ask!your!instructor!if!you!are!unsure.!

1.' You!will!need!to!modify!your!driver!code!to!specify!that!you!want!12!Reducers.!
Hint:!job.setNumReduceTasks()!specifies!the!number!of!Reducers!for!the!
job.!

2.' Change!the!Mapper!so!that!instead!of!emitting!a!1!for!each!value,!instead!it!
emits!the!entire!line!(or!just!the!month!portion!if!you!prefer).!

3.' Modify!the!MyPartitioner.java!stub!file!to!create!a!Partitioner!that!sends!
the!(key,!value)!pair!to!the!correct!Reducer!based!on!the!month.!Remember!that!
the!Partitioner!receives!both!the!key!and!value,!so!you!can!inspect!the!value!to!
determine!which!Reducer!to!choose.!

4.' Configure!your!job!to!use!your!custom!Partitioner.!Hint:!use!
job.setPartitionerClass()!in!your!driver!code.!

5.' Compile!and!test!your!code.!Hints:!

a. Write!unit!tests!for!your!Partitioner!!

b. Test!your!code!against!the!smaller!version!of!the!access!log!in!the!
/user/training/testlog!directory!before!you!run!your!code!
against!the!full!log!in!the!/user/training/weblog!directory.!

c. Remember!that!the!log!file!may!contain!unexpected!data!J!that!is,!lines!
which!do!not!conform!to!the!expected!format.!Ensure!that!your!code!
copes!with!such!lines.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 26

Not to be reproduced without prior written consent.
Hands-On Exercise: Using Counters
and a Map-Only Job
In!this!HandsJon!Exercise!you!will!create!a!MapJonly!MapReduce!job!which!will!use!
a!Web!server’s!access!log!to!count!the!number!of!times!gifs,!jpegs!and!other!
resources!have!been!retrieved.!Your!job!will!report!three!figures:!number!of!gif!
requests,!number!of!jpeg!requests,!and!number!of!other!requests.!

1.' Change!directory!to!the!counters!directory!within!the!exercises!directory!

$ cd ~/training_materials/developer/exercises/counters

2.' Complete!the!stub!files!to!provide!the!solution.!

Hints
You!should!use!a!MapJonly!MapReduce!job,!by!setting!the!number!of!Reducers!to!0!
in!the!Driver!code.!

For!input!data,!use!the!Web!access!log!file!that!you!uploaded!to!the!HDFS!
/user/training/weblog!directory!in!the!Using!HDFS!exercise.!

Use!a!counter!group!called!something!like!‘ImageCounter’,!with!names!‘gif’,!‘jpeg’!
and!‘other’.!

In!your!Driver!code,!retrieve!the!values!of!the!counters!after!the!job!has!completed!
and!report!them!using!System.out.println.!

As!always,!a!sample!solution!is!available!if!you!need!more!hints.!

This is the end of the Exercise

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 27
Not to be reproduced without prior written consent.
Hands-On Exercise: Using
SequenceFiles and File Compression
In!this!exercise!you!will!explore!reading!and!writing!uncompressed!and!compressed!
SequenceFiles.!!

First,!you!will!develop!a!MapReduce!application!to!convert!text!data!to!a!
SequenceFile.!Then!you!will!modify!the!application!to!compress!the!SequenceFile!
using!Snappy!file!compression.!

When!creating!the!SequenceFile,!use!the!full!access!log!file!for!input!data.!You!
uploaded!the!access!log!file!to!the!HDFS!/user/training/weblog!directory!
when!you!performed!the!“Using!HDFS”!exercise.!

After!you!have!created!the!compressed!SequenceFile,!you!will!write!a!second!
MapReduce!application!to!read!the!compressed!SequenceFile!and!write!a!text!file!
that!contains!the!original!log!file!text.!'

1.' Determine!the!number!of!HDFS!blocks!occupied!by!the!access!log!file:!

a. In!a!browser!window,!start!the!Name!Node!Web!UI.!The!URL!is!
http://localhost:50070.!

b. Click!“Browse!the!filesystem.”!!

c. Navigate!to!the!/user/training/weblog/access_log!file.!

d. Scroll!down!to!the!bottom!of!the!page.!The!total!number!of!blocks!
occupied!by!the!access!log!file!appears!in!the!browser!window.!

2.' Change!directory!to!the!createsequencefile!directory!within!the!
exercises!directory.!

$ cd ~/training_materials/developer/exercises/createsequencefile

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 28

Not to be reproduced without prior written consent.
3.' Complete!the!stub!files!to!read!the!access!log!file!and!create!a!SequenceFile.!
Records!emitted!to!the!SequenceFile!can!have!any!key!you!like,!but!the!values!
should!match!the!text!in!the!access!log!file.!

Note:!If!you!specify!an!output!key!type!other!than!LongWritable,!you!must!
call!job.setOutputKeyClass!–!not!job.setMapOutputKeyClass.!If!you!
specify!an!output!value!type!other!than!Text,!you!must!call!
job.setOutputValueClass!–!not!job.setMapOutputValueClass.!

4.' Compile!the!code!and!run!your!MapReduce!job.!!

For!the!MapReduce!input,!use!the!/user/training/weblog!directory.!

For!the!MapReduce!output,!specify!the!uncompressedsf!directory.!

Note:!The!CreateUncompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!

5.' Examine!the!initial!portion!of!the!output!SequenceFile!using!the!following!
command:!

$ hadoop fs -cat uncompressedsf/part-m-00000 | less

Some!of!the!data!in!the!SequenceFile!is!unreadable,!but!parts!of!the!
SequenceFile!should!be!recognizable:!

• The!string!SEQ,!which!appears!at!the!beginning!of!a!SequenceFile!

• The!Java!classes!for!the!keys!and!values!

• Text!from!the!access!log!file!

6.' Verify!that!the!number!of!files!created!by!the!job!is!equivalent!to!the!number!of!
blocks!required!to!store!the!uncompressed!SequenceFile.!

7.' Modify!your!MapReduce!job!to!compress!the!output!SequenceFile.!Add!
statements!to!your!driver!to!configure!the!output!SequenceFile!as!follows:!

• Compress!the!output!file.!

• Use!block!compression.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 29

Not to be reproduced without prior written consent.
• Use!the!Snappy!compression!codec.!

8.' Compile!the!code!and!run!your!modified!MapReduce!job.!For!the!MapReduce!
output,!specify!the!compressedsf!directory.!

Note:!The!CreateCompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!

9.' Examine!the!first!portion!of!the!output!SequenceFile.!Notice!the!differences!
between!the!uncompressed!and!compressed!SequenceFiles:!

• The!compressed!SequenceFile!specifies!the!
org.apache.hadoop.io.compress.SnappyCodec!compression!
codec!in!its!header.!

• You!cannot!read!the!log!file!text!in!the!compressed!file.!

10.' Compare!the!file!sizes!of!the!uncompressed!and!compressed!SequenceFiles!in!
the!uncompressedsf!and!compressedsf!directories.!

The!compressed!SequenceFiles!should!be!smaller.!!

11.' Write!a!second!MapReduce!job!to!read!the!compressed!log!file!and!write!a!text!
file.!This!text!file!should!have!the!same!text!data!as!the!log!file,!plus!keys.!The!
keys!can!contain!any!values!you!like.!

12.' Compile!the!code!and!run!your!MapReduce!job.!!

For!the!MapReduce!input,!specify!the!uncompressedsf!directory.!You!created!
the!compressed!SequenceFile!in!this!directory.!

For!the!MapReduce!output,!specify!the!compressedsftotext!directory.!

Note:!The!ReadCompressedSequenceFile.java!file!in!the!
sample_solutions!directory!contains!the!solution!for!the!preceding!part!of!
the!exercise.!

13.' Examine!the!first!portion!of!the!output!in!the!compressedsftotext!
directory.!

You!should!be!able!to!read!the!textual!log!file!entries.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 30

Not to be reproduced without prior written consent.
This is the end of the Exercise
'

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 31

Not to be reproduced without prior written consent.
Hands-On Exercise: Creating an
Inverted Index
In!this!exercise,!you!will!write!a!MapReduce!job!that!produces!an!inverted!index.!

For!this!lab!you!will!use!an!alternate!input,!provided!in!the!file!
invertedIndexInput.tgz.!When!decompressed,!this!archive!contains!a!
directory!of!files;!each!is!a!Shakespeare!play!formatted!as!follows:!

0 HAMLET
1
2
3 DRAMATIS PERSONAE
4
5
6 CLAUDIUS king of Denmark. (KING CLAUDIUS:)
7
8 HAMLET son to the late, and nephew to the present
king.
9
10 POLONIUS lord chamberlain. (LORD POLONIUS:)
...

Each!line!contains:!

! Line,number!
! separator:!a!tab!character!
! value:!the!line!of!text!

This!format!can!be!read!directly!using!the!KeyValueTextInputFormat!class!
provided!in!the!Hadoop!API.!This!input!format!presents!each!line!as!one!record!to!
your!Mapper,!with!the!part!before!the!tab!character!as!the!key,!and!the!part!after!the!
tab!as!the!value.!

Copyright © 2010-2012 Cloudera, Inc. All rights reserved. 32

Not to be reproduced without prior written consent.
Given!a!body!of!text!in!this!form,!your!indexer!should!produce!an!index!of!all!the!
words!in!the!text.!!For!each!word,!the!index!should!have!a!list!of!all!the!locations!
where!the!word!appears.!For!example,!for!the!word!‘honeysuckle’!your!output!
should!look!like!this:!

honeysuckle 2kinghenryiv@1038,midsummernightsdream@2175,...

The!index!should!contain!such!an!entry!for!every!word!in!the!text.!

We!have!provided!stub!files!in!the!directory!
~/training_materials/developer/exercises/inverted_index!

Prepare the Input Data

1.' Extract!the!invertedIndexInput!directory!and!upload!to!HDFS:!

$ cd ~/training_materials/developer/data
$ tar zxvf invertedIndexInput.tgz
$ hadoop fs -put invertedIndexInput invertedIndexInput

Define the MapReduce Solution

Remember!that!for!this!program!you!use!a!special!input!format!to!suit!the!form!of!
your!data,!so!your!driver!class!will!include!a!line!like:!

job.setInputFormatClass(KeyValueTextInputFormat.class);

Don’t!forget!to!import!this!class!for!your!use.!

Retrieving the File Name

Note!that!the!exercise!requires!you!to!retrieve!the!file!name!J!since!that!is!the!name!
of!the!play.!The!Context!object!can!be!used!to!retrieve!the!name!of!the!file!like!this:!

Not to be reproduced without prior written consent.
FileSplit fileSplit = (FileSplit) context.getInputSplit();
Path path = fileSplit.getPath();
String fileName = path.getName();

Build and Test Your Solution

Test!against!the!invertedIndexInput data!you!loaded!in!step!1!above.!

Hints
You!may!like!to!complete!this!exercise!without!reading!any!further,!or!you!may!find!
the!following!hints!about!the!algorithm!helpful.!

The Mapper
Your!Mapper!should!take!as!input!a!key!and!a!line!of!words,!and!should!emit!as!
intermediate!values!each!word!as!key,!and!the!key!as!value.!!

For!example,!the!line!of!input!from!the!file!‘hamlet’:!

282 Have heaven and earth together

produces!intermediate!output:!

Have hamlet@282
heaven hamlet@282
and hamlet@282
earth hamlet@282
together hamlet@282

The Reducer
Your!Reducer!simply!aggregates!the!values!presented!to!it!for!the!same!key,!into!one!
value.!Use!a!separator!like!‘,’!between!the!values!listed.!

Not to be reproduced without prior written consent.
Solution
You!can!find!a!working!solution!to!this!exercise!in!the!directory!
~/training_materials/developer/exercises/inverted_index/sam
ple_solution.!

This is the end of the Exercise

Not to be reproduced without prior written consent.
Hands-On Exercise: Calculating Word
Co-Occurrence
In!this!exercise,!you!will!write!an!application!that!counts!the!number!of!times!words!
appear!next!to!each!other,!using!the!same!input!data!that!you!used!in!the!“Creating!
an!Inverted!Index”!exercise.!!

Note!that!this!implementation!is!a!specialization!of!Word!CoJOccurrence!as!we!
describe!it!in!the!notes;!in!this!case!we'are'only'interested'in'pairs'of'words'
which'appear'directly'next'to'each'other.)"

1.' Change!directories!to!the!word_co-occurrence!directory!within!the!
exercises!directory.!

2.' Complete!the!Driver!and!Mapper!stub!files;!you!can!use!the!standard!
SumReducer!from!the!WordCount!directory!as!your!Reducer.!Your!Mapper’s!
intermediate!output!should!be!in!the!form!of!a!Text!object!as!the!key,!and!an!
IntWritable!as!the!value;!the!key!will!be!‘word1,word2’,!and!the!value!will!be!1.!!

3.' Extra!credit:!Write!a!further!MapReduce!job!to!sort!the!output!from!the!first!job!
so!that!the!list!of!pairs!of!words!appears!in!ascending!frequency.!

This is the end of the Exercise

Not to be reproduced without prior written consent.
Optional Hands-On Exercise:
Implementing Word Co-Occurrence
with a Custom WritableComparable
In!this!HandsJOn!Exercise,!you!will!create!a!custom!WritableComparable!to!improve!
the!Word!CoJOccurrence!exercise!from!earlier!in!the!course.!

1.' Change!directories!to!the!writables!directory!in!the!exercises!directory.!

2.' Edit!the!stub!files!to!solve!the!problem.!If!you!completed!the!Word!CoJ
Occurrence!additional!exercise!earlier,!use!those!files!as!your!starting!point!for!
the!Mapper!and!Reducer.!If!you!did!not!complete!the!exercise!earlier!in!the!
course,!you!will!find!a!sample!solution!in!the!word_co-occurrence!directory!
in!the!exercises!directory.!Copy!that!and!modify!it!appropriately.!

Hints
You!need!to!create!a!WritableComparable!object!that!will!hold!the!two!strings.!The!
stub!provides!an!empty!constructor!for!serialization,!a!standard!constructor!that!
will!be!given!two!strings,!a!toString!method,!and!the!generated!hashCode!and!
equals!methods.!You!will!need!to!implement!the!readFields,!write,!and!
compareTo!methods!required!by!WritableComparables.!!

Note!that!Eclipse!automatically!generated!the!hashCode!and!equals!methods!in!
the!stub!file.!You!can!generate!these!two!methods!in!Eclipse!by!rightJclicking!in!the!
source!code!and!choosing!‘Source’!!>!‘Generate!hashCode()!and!equals()’.!

As!always,!a!sample!solution!is!available.!

This is the end of the Exercise

Not to be reproduced without prior written consent.
Hands-On Exercise: Importing Data
With Sqoop
For!this!exercise!you!will!import!data!from!a!relational!database!using!Sqoop.!The!
data!you!load!here!will!be!used!in!a!subsequent!exercise.!

Consider!the!MySQL!database!movielens,!derived!from!the!MovieLens!project!
from!University!of!Minnesota.!(See!note!at!the!end!of!this!exercise.)!The!database!
consists!of!several!related!tables,!but!we!will!import!only!two!of!these:!movie,!
which!contains!about!3,900!movies;!and!movierating,!which!has!about!1,000,000!
ratings!of!those!movies.!!

Review the Database Tables

First,!review!the!database!tables!to!be!loaded!into!Hadoop.!

1. Log!on!to!MySQL:!

$ mysql --user=training --password=training movielens

2. Review!the!structure!and!contents!of!the!movie!table:!

mysql> DESCRIBE movie;!

. . .!
mysql> SELECT * FROM movie LIMIT 5;

3. Note!the!column!names!for!the!table.!

____________________________________________________________________________________________!

Not to be reproduced without prior written consent.
4. Review!the!structure!and!contents!of!the!movierating!table:!

mysql> DESCRIBE movierating;

. . .
mysql> SELECT * FROM movierating LIMIT 5;

5. Note!these!column!names.!

____________________________________________________________________________________________!

6. Exit!mysql:!

mysql> quit

Import with Sqoop

You!invoke!Sqoop!on!the!command!line!to!perform!several!commands.!With!it!you!
can!connect!to!your!database!server!to!list!the!databases!(schemas)!to!which!you!
have!access,!and!list!the!tables!available!for!loading.!For!database!access,!you!
provide!a!connect!string!to!identify!the!server,!and!J!if!required!J!your!username!and!
password.!

1.' Show!the!commands!available!in!Sqoop:!

$ sqoop help

2.' List!the!databases!(schemas)!in!your!database!server:!

$ sqoop list-databases \
--connect jdbc:mysql://localhost \
--username training --password training

(Note:!Instead!of!entering!--password training!on!your!command!line,!
you!may!prefer!to!enter!-P,!and!let!Sqoop!prompt!you!for!the!password,!which!
is!then!not!visible!when!you!type!it.)!

Not to be reproduced without prior written consent.
3.' List!the!tables!in!the!movielens!database:!

$ sqoop list-tables \
--connect jdbc:mysql://localhost/movielens \
--username training --password training

4.' Import!the!movie!table!into!Hadoop:!

$ sqoop import \!
--connect jdbc:mysql://localhost/movielens \!
--table movie --fields-terminated-by '\t' \!
--username training --password training !

Note:!separating!the!fields!in!the!HDFS!file!with!the!tab!character!is!one!way!to!
manage!compatibility!with!Hive!and!Pig,!which!we!will!use!in!a!future!exercise.!

5.' Verify!that!the!command!has!worked.!

$ hadoop fs -ls movie

$ hadoop fs -tail movie/part-m-00000

6.' Import!the!movierating!table!into!Hadoop.!!

Repeat!steps!4!and!5,!but!for!the!movierating!table.!

This is the end of the Exercise

Not to be reproduced without prior written consent.
Note:

This exercise uses the MovieLens data set, or subsets thereof. This data is freely
available for academic purposes, and is used and distributed by Cloudera with
the express permission of the UMN GroupLens Research Group. If you would
like to use this data for your own research purposes, you are free to do so, as
long as you cite the GroupLens Research Group in any resulting publications. If
you would like to use this data for commercial purposes, you must obtain
explicit permission. You may find the full dataset, as well as detailed license
terms, at http://www.grouplens.org/node/73

Not to be reproduced without prior written consent.
Hands-On Exercise: Using a Mahout
Recommender
In!this!exercise!you!will!use!Mahout!to!generate!movie!recommendations!for!users.!

1.' Ensure!that!you!completed!the!Sqoop!HandsJOn!Exercise!to!import!the!movie!
and!movierating!data.!

2.' Create!a!list!of!users!for!whom!you!want!to!generate!recommendations,!by!
creating!and!editing!a!file!on!your!local!disk!named!users.!Into!that!file,!on!
separate!lines!place!the!user!IDs!6037,!6038,!6039,!6040,!so!that!the!file!looks!
like!this:!

6037
6038
6039
6040

Important:'Make'sure'there'is'not'a'blank'line'at'the'end'of'the'file.'The'
line'containing'the'last'user'ID'should'not'have'a'carriage'return'at'the'
end'of'that'line.'If'it'does,'the'job'will'fail.!

3.' Upload!the!file!to!HDFS:!

$ hadoop fs -put users users

4.' Run!Mahout’s!itemJbased!recommender:!

$ mahout recommenditembased --input movierating \

--output recs --usersFile users \
--similarityClassname SIMILARITY_LOGLIKELIHOOD

5.' This!will!take!a!long!time!to!execute;!it!runs!approximately!10!MapReduce!jobs.!
Your!instructor!will!now!continue!with!the!notes,!but!when!the!final!job!is!

Not to be reproduced without prior written consent.
complete,!investigate!the!part-r-00000!file!in!your!newlyJcreated!recs!
directory.!

$ hadoop fs -cat recs/part-r-00000

6037
[2010:5.0,1036:5.0,1035:5.0,3703:5.0,2076:5.0,3108:5.0
,1028:5.0,3105:5.0,3104:5.0,2064:5.0]
6038
[671:5.0,2761:5.0,745:5.0,741:5.0,720:5.0,2857:5.0,838
:5.0,3114:5.0,3044:5.0,3000:5.0]
6039
[1946:5.0,1036:5.0,1035:5.0,3703:5.0,1032:5.0,2078:5.0
,3114:5.0,1029:5.0,1028:5.0,2076:5.0]
6040
[2610:5.0,1904:5.0,3910:5.0,3811:5.0,6:5.0,3814:5.0,16
:5.0,17:5.0,1889:5.0,3794:5.0]

This is the end of the Exercise

Not to be reproduced without prior written consent.
Hands-On Exercise: Manipulating
Data With Hive
In'this'exercise,'you'will'practice'data'processing'in'Hadoop'using'Hive.'

The!data!sets!for!this!exercise!are!the!movie!and!movierating!data!imported!
from!MySQL!into!Hadoop!in!a!previous!exercise.!

Review the Data

1.' Review!the!data!already!loaded!into!HDFS:!

$ hadoop fs -cat movie/part-m-00000 | head

. . .
$ hadoop fs -cat movierating/part-m-00000 | head

Prepare The Data For Hive

For!Hive!data!sets,!you!create!tables,!which!attach!field!names!and!data!types!to!
your!Hadoop!data!for!subsequent!queries.!You!can!create!external!tables!on!the!
movie!and!movierating!data!sets,!without!having!to!move!the!data!at!all.!

Prepare!the!Hive!tables!for!this!exercise!by!performing!the!following!steps:!

Not to be reproduced without prior written consent.
1.' Invoke!the!Hive!shell:!

$ hive

2.' Create!the!movie!table:!

hive> CREATE EXTERNAL TABLE movie

> (id INT, name STRING, year INT)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/user/training/movie';

3.' Create!the!movierating!table:!

hive> CREATE EXTERNAL TABLE movierating

> (userid INT, movieid INT, rating INT)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/user/training/movierating';

4.' Quit!the!Hive!shell:!

hive> QUIT;

The Questions
Now!that!the!data!is!imported!and!suitably!prepared,!use!Hive!to!answer!the!
following!questions.!

Not to be reproduced without prior written consent.
Working Interactively or In Batch

Hive:
You can enter Hive commands interactively in the Hive shell:
$ hive
. . .
hive> Enter interactive commands here
Or you can execute text files containing Hive commands with:
$ hive -f file_to_execute

1.' What!is!the!oldest!known!movie!in!the!database?!!Note!that!movies!with!
unknown!years!have!a!value!of!0!in!the!year!field!J!these!do!not!belong!in!your!
answer.!

2.' List!the!name!and!year!of!all!unrated!movies!(movies!where!the!movie!data!has!
no!related!movierating!data).!

3.' Produce!an!updated!copy!of!the!movie!data!with!two!new!fields:!
! ! numratings!! J!the!number!of!ratings!for!the!movie!
avgrating!! ! J!the!average!rating!for!the!movie!
Unrated!movies!are!not!needed!in!this!copy.!

4.' What!are!the!10!highestJrated!movies?!!(Notice!that!your!work!in!step!3!makes!
this!question!easy!to!answer.)!

This is the end of the Exercise

Not to be reproduced without prior written consent.
Hands-On Exercise: Using Pig to
Retrieve Movie Names From Our
Recommender
!
In!this!HandsJOn!Exercise!you!will!use!Pig!to!extract!movie!names!from!the!
recommendations!file!you!created!earlier.!

1.' Create!a!text!file!called!listrecommendations!with!the!following!contents:!

movies = load 'movie' AS (movieid, name, year);

recs = load 'recs' AS (userid, reclist);

longlist = FOREACH recs GENERATE userid,

FLATTEN(TOKENIZE(reclist)) AS movieandscore;

finallist = FOREACH longlist GENERATE userid,

REGEX_EXTRACT(movieandscore, '(\\d+)', 1) AS movieid;

results = JOIN finallist BY movieid, movies BY movieid;

final = FOREACH results GENERATE userid, name;

srtd = ORDER final BY userid;

dump srtd;

2.' Run!the!Pig!script!to!produce!a!list!of!movie!recommendations:!

$ pig listrecommendations

Not to be reproduced without prior written consent.
This is the end of the Exercise

Not to be reproduced without prior written consent.
Hands-On Exercise: Running an
Oozie Workflow
In!this!exercise,!you!will!inspect!and!run!Oozie!workflows.!

1.' Change!directories!to!the!oozie-labs!directory!within!the!exercises!
directory!

2.' Start!the!Oozie!server!

$ sudo /etc/init.d/oozie start

3.' Change!directories!to!lab1-java-mapreduce/job!

$ cd lab1-java-mapreduce/job

4.' Inspect!the!contents!of!the!job.properties!and!workflow.xml!files.!You!
will!see!that!this!is!our!standard!WordCount!job.!

5.' Change!directories!back!to!the!main!oozie-labs!directory!

$ cd ../..

6.' We!have!provided!a!simple!shell!script!to!submit!the!Oozie!workflow.!Inspect!
run.sh!

$ cat run.sh

7.' Submit!the!workflow!to!the!Oozie!server!

$ ./run.sh lab1-java-mapreduce

Notice!that!Oozie!returns!a!job!identification!number.!

Not to be reproduced without prior written consent.
8.' Inspect!the!progress!of!the!job!

$ oozie job -oozie http://localhost:11000/oozie -info <job_id>

9.' When!the!job!has!completed,!inspect!HDFS!to!confirm!that!the!output!has!been!
produced!as!expected.!

10.' Repeat!the!above!procedure!for!lab2JsortJwordcount.!Notice!when!you!inspect!
workflow.xml!that!this!workflow!includes!two!MapReduce!jobs!which!run!
one!after!the!other.!When!you!inspect!the!output!in!HDFS!you!will!see!that!the!
second!job!sorts!the!output!of!the!first!job!into!descending!numerical!order.!

This is the end of the Exercise

Not to be reproduced without prior written consent.

Admin Cloudera
100% (3)
Admin Cloudera
637 pages
Cloudera Developer Training Slides
No ratings yet
Cloudera Developer Training Slides
729 pages
Cloudera Developer Training Slides
No ratings yet
Cloudera Developer Training Slides
784 pages
Cloudera Developer Training Exercise Manual
No ratings yet
Cloudera Developer Training Exercise Manual
131 pages
Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
Nifi 210415 Exercise Manual
100% (1)
Nifi 210415 Exercise Manual
140 pages
ASP.NET
100% (2)
ASP.NET
135 pages
Cloudera Administrator Training PDF
No ratings yet
Cloudera Administrator Training PDF
639 pages
Cloudera Developer Training
89% (9)
Cloudera Developer Training
593 pages
Developer Training For Apache Spark and Hadoop: Hands-On Exercises
No ratings yet
Developer Training For Apache Spark and Hadoop: Hands-On Exercises
113 pages
Cloudera Data Analyst Training Exercise Manual
No ratings yet
Cloudera Data Analyst Training Exercise Manual
82 pages
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
No ratings yet
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
35 pages
An Introduction To Programming Using Visual Basic 2010
No ratings yet
An Introduction To Programming Using Visual Basic 2010
736 pages
Autocad MCQ
100% (1)
Autocad MCQ
5 pages
Cloudera Administrator Training Slides PDF
No ratings yet
Cloudera Administrator Training Slides PDF
601 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
Cloudera Administration: 1. Getting Started With Apache Hadoop
No ratings yet
Cloudera Administration: 1. Getting Started With Apache Hadoop
2 pages
Devsh 201611 Student Exercisemanual
No ratings yet
Devsh 201611 Student Exercisemanual
101 pages
Cloudera Administrator Training
100% (6)
Cloudera Administrator Training
373 pages
Big Data Analytics and Visualization Lab
No ratings yet
Big Data Analytics and Visualization Lab
193 pages
Iseries COBOL Reference
No ratings yet
Iseries COBOL Reference
664 pages
Admin 171103b Exercise Manual
No ratings yet
Admin 171103b Exercise Manual
107 pages
Data Analyst Training Exercise Manual 201403
No ratings yet
Data Analyst Training Exercise Manual 201403
76 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
26 pages
Exp1 Hirday Merged
No ratings yet
Exp1 Hirday Merged
102 pages
Cloudera Administrator Training For Apache Hadoop PDF
50% (2)
Cloudera Administrator Training For Apache Hadoop PDF
2 pages
Ccs334 Bda Lab Manual PRINT
No ratings yet
Ccs334 Bda Lab Manual PRINT
53 pages
Devsh 201605 Student Exercisemanual
No ratings yet
Devsh 201605 Student Exercisemanual
86 pages
Hadoop Admin 171103e Exercise Manual
No ratings yet
Hadoop Admin 171103e Exercise Manual
103 pages
Java Swing Leaning by Code
No ratings yet
Java Swing Leaning by Code
254 pages
Chapter - 4: OOP With C#
No ratings yet
Chapter - 4: OOP With C#
34 pages
2019 - 04 - 10 - KickOff
No ratings yet
2019 - 04 - 10 - KickOff
27 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
Vnrvjiet
No ratings yet
Vnrvjiet
70 pages
Big Data Analytics Laboratory
No ratings yet
Big Data Analytics Laboratory
57 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Big Data Analytics lab-JD
No ratings yet
Big Data Analytics lab-JD
49 pages
BDA Record
No ratings yet
BDA Record
36 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Hands On-Exercies
No ratings yet
Hands On-Exercies
17 pages
BDH Record - Merged
No ratings yet
BDH Record - Merged
47 pages
Cloudera Administrator Training Slides Part 40
No ratings yet
Cloudera Administrator Training Slides Part 40
12 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
Ccs 334 Bigdata Manual
No ratings yet
Ccs 334 Bigdata Manual
45 pages
ClouderaManager ExerciseInstructions
No ratings yet
ClouderaManager ExerciseInstructions
25 pages
SSJ Bda File
No ratings yet
SSJ Bda File
16 pages
Hadoop 1
No ratings yet
Hadoop 1
15 pages
Experiment - Hdfs Commands
No ratings yet
Experiment - Hdfs Commands
8 pages
Project - Report
No ratings yet
Project - Report
56 pages
HDFS (Hadoop Distributed File System) : HDFS Architecture Components of The Architecture
No ratings yet
HDFS (Hadoop Distributed File System) : HDFS Architecture Components of The Architecture
10 pages
Afc Project Report
No ratings yet
Afc Project Report
31 pages
CS 1101 Programming Assignment Unit 4
No ratings yet
CS 1101 Programming Assignment Unit 4
7 pages
Wordpress Masterclass 04 - 20
No ratings yet
Wordpress Masterclass 04 - 20
13 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
F
No ratings yet
F
27 pages
HDFS Commands
No ratings yet
HDFS Commands
8 pages
HANDS Hadoop Cloud
No ratings yet
HANDS Hadoop Cloud
10 pages
Basic HDFS Commands
No ratings yet
Basic HDFS Commands
7 pages
General Notes: Hands-On Exercises: Apache Hadoop For Developers 2012
No ratings yet
General Notes: Hands-On Exercises: Apache Hadoop For Developers 2012
5 pages
HDFS
No ratings yet
HDFS
6 pages
Enterprise Services Repository: An Overview
No ratings yet
Enterprise Services Repository: An Overview
22 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
ICONICS Suite Resolved Issues 10971
No ratings yet
ICONICS Suite Resolved Issues 10971
27 pages
CHAPTER 5: Microsoft Excel Basics: Objectives
No ratings yet
CHAPTER 5: Microsoft Excel Basics: Objectives
26 pages
Week 1 in Terminal
No ratings yet
Week 1 in Terminal
10 pages
GUI Design in C++ 2 PDF
No ratings yet
GUI Design in C++ 2 PDF
26 pages
Tensorflow Installation Errors
No ratings yet
Tensorflow Installation Errors
23 pages
Adv Verif Topics Book Final-ACCELERATION
No ratings yet
Adv Verif Topics Book Final-ACCELERATION
25 pages
Cooperation Between Concurrent Objects
No ratings yet
Cooperation Between Concurrent Objects
23 pages
Restraunt Management System
No ratings yet
Restraunt Management System
17 pages
Stylecop Aka Source Analysis (Or How To Start A Fight) : Guy Smith-Ferrier
No ratings yet
Stylecop Aka Source Analysis (Or How To Start A Fight) : Guy Smith-Ferrier
21 pages
Project: Csi College of Engineering
No ratings yet
Project: Csi College of Engineering
15 pages
CC Hadoop Lab
No ratings yet
CC Hadoop Lab
6 pages
Cloudera Developer Training For Apache Hadoop
No ratings yet
Cloudera Developer Training For Apache Hadoop
3 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Cloudera Developer Training For Apache Hadoop v2
No ratings yet
Cloudera Developer Training For Apache Hadoop v2
3 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
3 pages
Ranjitha
No ratings yet
Ranjitha
15 pages
Zs 241113 171943
No ratings yet
Zs 241113 171943
7 pages
Script Membuka Excel Ke Password
No ratings yet
Script Membuka Excel Ke Password
3 pages
Fleet Management System Ijariie14144
No ratings yet
Fleet Management System Ijariie14144
8 pages
Example Perl PDF Api2
No ratings yet
Example Perl PDF Api2
2 pages
DA Lab Program-1
No ratings yet
DA Lab Program-1
3 pages
(XXXX) Syllabus - Big Data Administration Training For Apache Hadoop - 280715
No ratings yet
(XXXX) Syllabus - Big Data Administration Training For Apache Hadoop - 280715
1 page
Internet Programming With DELPHI
No ratings yet
Internet Programming With DELPHI
3 pages
Latihan Java Part 5
No ratings yet
Latihan Java Part 5
2 pages
Red Hat RHCSA 9 (EX200) Capsules Certification Guide
From Everand
Red Hat RHCSA 9 (EX200) Capsules Certification Guide
Waleed Hassan
5/5 (1)
Red Hat Certified Specialist in OpenShift Administration (EX280) Capsules
From Everand
Red Hat Certified Specialist in OpenShift Administration (EX280) Capsules
Waleed Hassan
No ratings yet
Red Hat RHCE 9 (EX294) Capsules: Certification Guide
From Everand
Red Hat RHCE 9 (EX294) Capsules: Certification Guide
Waleed Hassan
No ratings yet
Red Hat RHCE 8 (EX294) Capsules: Certification Guide
From Everand
Red Hat RHCE 8 (EX294) Capsules: Certification Guide
Waleed Hassan
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.