0% found this document useful (0 votes)

45 views20 pages

05 Movies Data Analysis Using Mapreduce

The document discusses how to analyze movie data using MapReduce. It describes MapReduce, the map and reduce processes, and provides an example of using movie rating data to calculate average ratings for each movie.

Uploaded by

mohammadkhaja.shaik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views20 pages

05 Movies Data Analysis Using Mapreduce

Uploaded by

mohammadkhaja.shaik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Movies data analysis using MapReduce

Tushar B. Kute,
http://tusharkute.com
What is MapReduce?

• MapReduce is a framework using which we

can write applications to process huge
amounts of data, in parallel, on large clusters
of commodity hardware in a reliable manner.
• MapReduce is a processing technique and a
program model for distributed computing
based on java.
• The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
Map and Reduce

• Map takes a set of data and converts it into

another set of data, where individual elements
are broken down into tuples (key/value pairs).
• Secondly, reduce task, which takes the output
from a map as an input and combines those
data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the
reduce task is always performed after the map
job.
Map and Reduce

• The major advantage of MapReduce is that it is easy to

scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing
primitives are called mappers and reducers.
• Decomposing a data processing application into mappers
and reducers is sometimes nontrivial. But, once we write
an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens
of thousands of machines in a cluster is merely a
configuration change.
• This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm

• MapReduce program executes in three stages, namely

map stage, shuffle stage, and reduce stage.
• Map stage: The map or mapper’s job is to process the
input data. Generally the input data is in the form of file
or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function
line by line. The mapper processes the data and creates
several small chunks of data.
• Reduce stage: This stage is the combination of the
Shuffle stage and the Reduce stage. The Reducer’s job is
to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be
stored in the HDFS.
The MapReduce
Inserting Data into HDFS

• The MapReduce framework operates on <key, value>

pairs, that is, the framework views the input to the job as
a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of
different types.
• The key and the value classes should be in serialized
manner by the framework and hence, need to implement
the Writable interface. Additionally, the key classes have
to implement the Writable-Comparable interface to
facilitate sorting by the framework.
• Input and Output types of a MapReduce job: (Input)
<k1,v1> -> map -> <k2, v2>-> reduce -> <k3, v3> (Output).
Data input and output
Terminologies

• Mapper - Mapper maps the input key/value pairs to a set of

intermediate key/value pair.
• NamedNode - Node that manages the Hadoop Distributed File System
(HDFS).
• DataNode - Node where data is presented in advance before any
processing takes place.
• MasterNode - Node where JobTracker runs and which accepts job
requests from clients.
• SlaveNode - Node where Map and Reduce program runs.
• JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker - Tracks the task and reports status to JobTracker.
• Job - A program is an execution of a Mapper and Reducer across a
dataset.
• Task - An execution of a Mapper or a Reducer on a slice of data.
Example:

• Use movies dataset. Write a map and reduce

methods to determine the average rating of
movies. The input consists of series of lines, each
containing movie number, user number, rating and
timestamp. The map should emit movie number
and list of ratings and reduce should return the
average rating for each movie number.
The dataset:u.data
Example:

• Movies.java
Compilation and Execution

• Let us assume we are in the home directory of a

Hadoop user (e.g. /home/rashmi).
• Follow the steps given below to compile and
execute the above program.
• Step 1
– The following command is to create a directory
to store the compiled java classes.
– $ mkdir movies
Compilation and Execution

• Step 2
Download hadoop-core-1.2.1.jar, which is used to compile
and execute the MapReduce program. Visit the following
link
http://mvnrepository.com/artifact/org.apache.hadoop/hado
op-core/1.2.1
To download the jar. Let us assume the downloaded folder
is /home/rashmi/movies.
• Step 3
The following commands are used for compiling the
wordcount.java program and creating a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar movies/Movies.java
$ jar -cvf snow.jar -C movies/ .
Compilation and Execution

• Step 4
– The following command is used to create an input directory in
HDFS.
– $hadoop fs -mkdir /input
• Step 5
– The following command is used to copy input dataset file on
HDFS.
– $hadoop fs -put u.data /input
• Step 6
– The following command is used to verify the files in the input
directory.
– $hadoop fs -ls /input
Compilation and Execution

• Step 7
– The following command is used to run the Snow
application by taking the input files from the input
directory.
– $hadoop jar movies.jar Movies /input /output
– Wait for a while until the file is executed. After
execution, the output will contain the number of
input splits, the number of Map tasks, the number of
reducer tasks, etc. The output directory must not be
existing already.
Compilation and Execution

• Step 8
– The following command is used to verify the
resultant files in the output folder.
– $hadoop fs -ls /output
• Step 9
– The following command is used to see the output in
part-r-00000 file. This file is generated by HDFS.
– $hadoop fs -cat /output/part-r-00000
Compilation and Execution

• Step 10
The following command is used to copy
the output file from HDFS to the local file
system for analyzing.
– $hadoop fs -get /output/part-r-00000
Output:
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group

Web Resources
http://mitu.co.in
http://tusharkute.com

tushar@tusharkute.com
contact@mitu.co.in

Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
UNIT 3 BDA
No ratings yet
UNIT 3 BDA
41 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
BDA UNIT-3
No ratings yet
BDA UNIT-3
44 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
BDA notes
No ratings yet
BDA notes
39 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Assignment 2 Write-up
No ratings yet
Assignment 2 Write-up
7 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Assgnment2 Group B
No ratings yet
Assgnment2 Group B
5 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Hadoop
No ratings yet
Hadoop
34 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
unit 2
No ratings yet
unit 2
12 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Bda 03
No ratings yet
Bda 03
10 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
ASSIGNMENT-questions-course
No ratings yet
ASSIGNMENT-questions-course
85 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
C# - Net
No ratings yet
C# - Net
1,006 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
TOSCA Automation
67% (3)
TOSCA Automation
3 pages
exp5
No ratings yet
exp5
12 pages
bidhan proposal
No ratings yet
bidhan proposal
20 pages
Oracle DBA Interview Questions Experienced Candidate
No ratings yet
Oracle DBA Interview Questions Experienced Candidate
6 pages
API Metadata Guide
No ratings yet
API Metadata Guide
588 pages
Dot Net Core - Lab - 3
No ratings yet
Dot Net Core - Lab - 3
34 pages
MapReduce Word Count Example - Javatpoint
No ratings yet
MapReduce Word Count Example - Javatpoint
12 pages
Module 4-1 - Req Eng Overview
No ratings yet
Module 4-1 - Req Eng Overview
18 pages
System Requirements Specification
No ratings yet
System Requirements Specification
33 pages
GitHub - Prince6635 - Movie-Ratings-By-Mapreduce-And-Hadoop - Big Data (Movie Ratings) Based On Hadoop and MapReduce
No ratings yet
GitHub - Prince6635 - Movie-Ratings-By-Mapreduce-And-Hadoop - Big Data (Movie Ratings) Based On Hadoop and MapReduce
11 pages
Loader
No ratings yet
Loader
17 pages
Mod Menu Log - Com - Dogbytegames.offtheroad
No ratings yet
Mod Menu Log - Com - Dogbytegames.offtheroad
20 pages
OOP - Assignment
No ratings yet
OOP - Assignment
13 pages
Power Point Chapter - 13
No ratings yet
Power Point Chapter - 13
5 pages
MIPS Instruction Reference
No ratings yet
MIPS Instruction Reference
12 pages
Bachelor in Computer Application: Online Notice Board System
No ratings yet
Bachelor in Computer Application: Online Notice Board System
16 pages
MongoDB Developer Training Program Datasheet
No ratings yet
MongoDB Developer Training Program Datasheet
13 pages
Develop SRS For Online Examination System Computer Science Simplified
No ratings yet
Develop SRS For Online Examination System Computer Science Simplified
7 pages
2 - Familiarisation of Linux Commands
No ratings yet
2 - Familiarisation of Linux Commands
6 pages
Added Label For Scattered Graphs Atharvaunde - Data-Analytics-Lab@f75baf9 GitHub
No ratings yet
Added Label For Scattered Graphs Atharvaunde - Data-Analytics-Lab@f75baf9 GitHub
2 pages
Bda Aids Syllabus
No ratings yet
Bda Aids Syllabus
3 pages
Pima Indians Diabetes Dataset Analysis - Notebook by Swapnil Gupta (Swapnilg4u) - Jovian
No ratings yet
Pima Indians Diabetes Dataset Analysis - Notebook by Swapnil Gupta (Swapnilg4u) - Jovian
1 page
STEP V5 7 Orderlist en
No ratings yet
STEP V5 7 Orderlist en
3 pages
F4 Help in Interactive Adobe Form
No ratings yet
F4 Help in Interactive Adobe Form
7 pages
Fix Update Error 80072EFE On New Installations of Windows 7
No ratings yet
Fix Update Error 80072EFE On New Installations of Windows 7
1 page
3 AWS CodeBuild Buildspec File Elements
No ratings yet
3 AWS CodeBuild Buildspec File Elements
3 pages
Software Testing:: Unit - 7 Software Verification & Validation
No ratings yet
Software Testing:: Unit - 7 Software Verification & Validation
4 pages
Resume Nepal Prabesh Shrestha
No ratings yet
Resume Nepal Prabesh Shrestha
2 pages
Siddharth Prakash Singh
No ratings yet
Siddharth Prakash Singh
1 page
Version Management: BS 8110 EC2 (EN1992-1)
No ratings yet
Version Management: BS 8110 EC2 (EN1992-1)
1 page
Nagios Enable NRPE
No ratings yet
Nagios Enable NRPE
2 pages
Rac Command
No ratings yet
Rac Command
2 pages
Ehimares Resume
No ratings yet
Ehimares Resume
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

05 Movies Data Analysis Using Mapreduce

Uploaded by

05 Movies Data Analysis Using Mapreduce

Uploaded by

Movies data analysis using MapReduce

• MapReduce is a framework using which we

• Map takes a set of data and converts it into

• The major advantage of MapReduce is that it is easy to

• MapReduce program executes in three stages, namely

• The MapReduce framework operates on <key, value>

• Mapper - Mapper maps the input key/value pairs to a set of

• Use movies dataset. Write a map and reduce

• Let us assume we are in the home directory of a

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.