Hadoop Python MapReduce Tutorial For Beginners
Hadoop Python MapReduce Tutorial For Beginners
Hadoop Python MapReduce Tutorial For Beginners
Hire Me (https://rathbonelabs.com)
Share (http://www.facebook.com/sharer/sharer.php?
u=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop-a-
beginners-tutorial.html)
Post (http://www.linkedin.com/shareArticle?
mini=true&url=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-
hadoop-a-beginners-tutorial.html&title=Hadoop Python MapReduce Tutorial for
Beginners&summary=A step-by-step tutorial for writing your rst map reduce with Python
and Hadoop Streaming.&source=rathboma)
This article originally accompanied my tutorial session at the Big
Data Madison Meetup, November 2013
(http://www.meetup.com/BigDataMadison/events/149122882/).
Once you’re booted into the quickstart VM we’re going to get our
dataset. I’m going to use the play-by-play n data by Brian Burke
(http://www.advancedn stats.com/2010/04/play-by-play-data.html).
To start with we’re only going to use the data in his Git repository
(https://github.com/eljefe6a/n data).
cd ~/workspace
git clone https://github.com/eljefe6a/nfldata.git
cd workspace/nfldata
cat stadiums.csv # BAH! Everything is a single line
dos2unix -l -n stadiums.csv unixstadiums.csv
cat unixstadiums.csv # Hooray! One stadium per line
2. A Mapper Class
takes K,V inputs, writes K,V outputs
3. A Reducer Class
takes K, Iterator[V] inputs, and writes K,V outputs
It’s just like running a normal mapreduce job, except that you need to
provide some information about what scripts you want to use.
Hadoop comes with the streaming jar in it’s lib directory, so just nd
that to use it. The job below counts the number of lines in our
stadiums le. (This is really overkill, because there are only 32
records)
A good way to make sure your job has run properly is to look at the
jobtracker dashboard. In the quickstart VM there is a link in the
bookmarks bar.
Lets use map reduce to nd the number of stadiums with arti cial
and natrual playing surfaces.
IMPORTANT GOTCHA!
It will receive:
TRUE 1
TRUE 1
TRUE 1
TRUE 1
FALSE 1
FALSE 1
FALSE 1
This means you have to do a little state tracking in your reducer. This
will be demonstrated in the code below.
To follow along, check out my git repository
(https://github.com/rathboma/hadoop-framework-examples) (on the
virtual machine):
cd ~/workspace
git clone https://github.com/rathboma/hadoop-framework-examples.g
cd hadoop-framework-examples
MAPPER
import sys
REDUCER
import sys
last_turf = None
turf_count = 0
line = line.strip()
turf, count = line.split("\t")
count = int(count)
# if this is the first iteration
if not last_turf:
last_turf = turf
# this is to catch the final counts after all records have been re
print("\t".join(str(v) for v in [last_turf, turf_count]))
You might notice that the reducer is signi cantly more complex then
the pseudocode. That is because the streaming interface is limited
and cannot really provide a way to implement the standard API.
As noted, each line read contains both the KEY and the VALUE , so it’s up
to our reducer to keep track of Key changes and act accordingly.
chmod +x simple/mapper.py
chmod +x simple/reducer.py
TESTING
cd streaming-python
cat ~/workspace/nfldata/unixstadiums.csv | simple/mapper.py | sort
# FALSE 15
# TRUE 17
(http://amzn.to/2hVekf0)
(HTTP://AMZN.TO/2HVEKF0)
Share (http://www.facebook.com/sharer/sharer.php?
u=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop-a-
beginners-tutorial.html)
Post (http://www.linkedin.com/shareArticle?
mini=true&url=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-
hadoop-a-beginners-tutorial.html&title=Hadoop Python MapReduce Tutorial for
Beginners&summary=A step-by-step tutorial for writing your rst map reduce with Python
and Hadoop Streaming.&source=rathboma)
Matthew Rathbone
Consultant Big Data Infrastructure Engineer at Rathbone Labs
(https://www.rathbonelabs.com). British. Data Nerd. Lucky husband and
father.
(https://click.linksynergy.com/deeplink?
id=KuenSouhzgQ&mid=40328&murl=https%3A%2F%2Fwww.courser
data)
PREVIOUS
NEXT
Join my newsletter
Learn more about Hadoop and Big Data.
email address
Subscribe
Beekeeper Studio
I maintain an open source SQL editor and database manager with a focus on usability.
It is cross-platform and really nice to use.
Related Articles
Should you use Parquet? (/2019/12/20/parquet-or-bust.html)
4 Fun and Useful Things to Know about Scala's apply() functions (/2017/03/06/scala-
object-apply-functions.html)
10+ Great Books and Resources for Learning and Perfecting Scala
(/2017/02/14/scala-books.html)
Links
Twitter (https://twitter.com/rathboma)
GitHub (https://github.com/rathboma)
Beekeeper Studio (https://beekeeperstudio.io)
Resources
Hadoop HDFS Cheatsheet (/pages/hdfs-cheatsheet.html)
Copyright Matthew Rathbone 2020, All Rights Reserved. Background image from Subtle Patterns
(http://subtlepatterns.com/)