0% found this document useful (0 votes)

101 views

BDA Lab Manual-1

This lab manual provides instructions for setting up Hadoop in pseudo-distributed and distributed modes. The document outlines 12 programs to develop MapReduce solutions for common problems using datasets like weather data, retail data, and more. Key steps include installing Java and configuring SSH for passwordless access. Hadoop is then installed and configuration files like core-site.xml and hdfs-site.xml are edited. The cluster is started by formatting the NameNode and launching HDFS and YARN services. Commands are provided to check the cluster status and stop the services.

Uploaded by

pavan chittala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

BDA Lab Manual-1

Uploaded by

pavan chittala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Lab Manual

Big Data Analytics Lab

S.SURYANARAYANARAJU

Department of Computer Science and Engineering

SRKR Engineering College (A), Bhimavaram, India
Course objectives:

1. Able to run the tools like UBUNTU Operating System, Java 8, and Eclipse.

2. Able to create Hadoop Environment and develop Map-Reduce Programs for

Real time applications.

Course outcomes:

1. Create Hadoop environment.

2. Develop a solution for a given problem using map reduce.

2
LIST OF PROGRAMS

S.No TITLE OF PROGRAM Page.No

1 Setting up Hadoop on single node cluster (Pseudo distributed mode) 5
2 Setting up Hadoop on multi-node(Distributed cluster environment) 11
3 Basic Hadoop file system commands. 19
4 Write a map-reduce program for word count. 21
Write a map-reduce program for finding maximum temperature in
5 33
weather dataset.
6 Write a map-reduce program for join of two records. 37
7 Write a map reduce program to find duplicate record in csv file. 41
Write a map reduce program to find patent citations in patent
8 45
dataset.
Write aMap reduce program for total retail collection.(retail
9 47
dataset)
Write a Map reduce program for store wise collection.(retail
10 51
dataset)
Write a Map reduce program for product wise collection.(retail
11 55
dataset)

EXTRA PROGRAMS LIST

1 Write a map-reduce program for word count(with combiner)
2 Write a map-reduce program for word count(without case sensitive)
3 Write a map reduce program to sum of numbers in text file.
Write a map-reduce program for finding minimum temperature in weather
4
dataset.
Write a map reduce program for display highest ctc in each dept in employ ctc
5 dataset.
Note: Cost to company (CTC) is a term for the total salary package of an employee

3
4
Experiment #1

Setting up Hadoop on Pseudo distributed mode

(Single node cluster).

Objective:: Setting up Hadoop on Pseudo distributed mode

The pseudo-distributed mode is also known as a single-node cluster where both NameNode
and DataNode will reside on the same machine.

In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such
configuration is mainly used while testing when we don’t need to think about the resources
and other users sharing the resource.

In this architecture, a separate JVM is spawned for every Hadoop components as they
could communicate across network sockets, effectively producing a fully functioning and
optimized mini-cluster on a single host.

PROGRAM MODULE:

Step 1: Install Java 8.

Install Java 8 (Recommended Oracle Java) Hadoop requires a working Java 1.5+
installation. However, using Java 8 is recommended for running Hadoop.
1.1Install Python Software Properties

Command: sudo apt-get install python-software-properties

1.2 Add Repository

Command: sudo add-apt-repository ppa:webupd8team/java

1.3 Update the source list

Command :sudo apt-get update

1.4 Install Java

Command: sudo apt-get install oracle-java8-installer

5
Step 2: Configure SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local

machine if you want to use Hadoop on it.

2.1 Install Open SSH Server-Client

Command: sudo apt-get install openssh-server openssh-client

2.2 Generate KeyPairs

Command: ssh-keygen -t rsa -P ""

2.3 Configure password-less SSH

Command: cat $HOME/.ssh/id_rsa.pub >> HOME/.ssh/authorized_keys

2.4 Check by SSH to localhost

Command: ssh localhost

Step 3: Install Hadoop

3.1 Download Hadoop

Command:Wgethttp://archive.cloudera.com/cdh5/cdh/5/hadoop-2.5.0-

dh5.3.2.tar.gz

3.2 Untar Tar ball

Command: tar xzf hadoop-2.5.0-cdh5.3.2.tar.gz

Step 4: Setup Configuration

4.1 Edit .bashrc

Edit .bashrc file located in user‟s home directory and add following parameters.

Command :vi .bashrc

export HADOOP_PREFIX="/home/cse/hadoop-2.5.0-cdh5.3.2"
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}
Command: source .bashrc

6
4.2 Edit hadoop-env.sh

hadoop-env.sh contains the environment variables that are used in the scriptto run

Hadoop like Java home path, etc. Edit configuration file hadoop-env.sh(located in

HADOOP_HOME/etc/hadoop) and set JAVA_HOME.

Command: vi hadoop–env.shexport JAVA_HOME=/usr/lib/jvm/java-8-oracle/

4.3 Edit core-site.xml

core-site.xml informs Hadoop daemon where NameNode runs in the cluster.It

contains configuration settings of Hadoop core such as I/O settings that

arecommon to HDFS & MapReduce. Edit configuration file core-site.xml (located

inHADOOP_HOME/etc/hadoop) and add following entries.

Command :vi core-site.xml

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

<name>hadoop.tmp.dir</name>

<value>/home/cse/hdata</value>

</property>

</configuration>

Note

/home/cse/hdata is a sample location; please specify a location where you have

ReadWrite privileges.

7
4.4 Edit hdfs-site.xml

hdfs-site.xml contains configuration settings of HDFS daemons (i.e.NameNode,

DataNode, Secondary NameNode). It also includes the replication factorand block

size of HDFS. Edit configuration file hdfs-site.xml (located

inHADOOP_HOME/etc/hadoop) and add following entries.

Command: vi hdfs-site.xml

<name>dfs.replication</name>

</property>

</configuration>

4.5 Edit mapred-site.xml

mapred-site.xml contains configuration settings of MapReduce applicationlike

number of JVM that can run in parallel, the size of the mapper and the

reducerprocess, CPU cores available for a process, etc.In some cases, mapred-

site.xml fileis not available. So, we have to create the mapred-site.xml file using

mapred-site.xmltemplate. Edit configuration file mapred-site.xml (located in

HADOOP_HOME/etc/hadoop) and add following entries.

Command:vi mapred-site.xml

<name>mapreduce.framework.name</name>

</property>

</configuration>

8
4.6 Edit yarn-site.xml

yarn-site.xmlcontains configuration settings of ResourceManager andNodeManager

like application memory management size, the operation needed on program &

algorithm, etc.Edit configuration file mapred-site.xml (located in

HADOOP_HOME/etc/hadoop) and add following entries.

Command:vi yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

Step 5: Start the Cluster

5.1 Format the name node:

Command: bin/hdfs namenode -format

5.2 Start HDFS Services

Command: sbin/start-dfs.sh

5.3 Start YARN Services

Command: sbin/start-yarn.sh

5.4 Check whether services have been started

To check that all the Hadoop services are up and running, run the below command.
Command: jps

9
NameNode

DataNode

ResourceManager

NodeManager

SecondaryNameNode

Step 6. Stop the Cluster

6.1 Stop HDFS Services

Command:sbin/stop-dfs.sh

6.2 Stop YARN Services

Command: sbin/stop-yarn.sh

10
Experiment # 2

Setting up Hadoop on multi-node

Objective ::Setting up Hadoop on distributed cluster environment

In Fully Distributed Mode, the daemons Name Node, Job Tracker, SecondaryNameNode

(Optional and can be run on a separate node) run on the Master Node. The daemons Data

Node and Task Tracker run on the Slave Node.

PROGRAM MODULE:

Step 1. Add Entries in hosts file:Edit hosts file and add entries of both master and

slaves.

sudo vi /etc/hosts

MASTER-IP master

SLAVE01-IP slave01

SLAVE02-IP slave02

(NOTE: In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of the

corresponding IP).

Example

192.168.1.190 master

192.168.1.191 slave01

192.168.1.195 slave02

Step 2. Install Java 8 (Recommended Oracle Java)

Hadoop requires a working Java 1.5+ installation. However, using Java 8 is

recommended for running Hadoop.

2.1 Install Python Software Properties

Command :sudo apt-get install python-software-properties

11
2.2 Add Repository

Command :sudo add-apt-repository ppa:webupd8team/java

2.3 Update the source list

Command :sudo apt-get update

2.4 Install Java

Command :sudo apt-get install oracle-java8-installer

Step 3. Configure SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your

local machine if you want to use Hadoop on it.

3.1 Install Open SSH Server-Client

Command :sudo apt-get install openssh-server openssh-client

3.2 Generate KeyPairs

Command :ssh-keygen -t rsa -P ""

3.3 Configure password-less SSH

3.3.1 Copy the generated ssh key to master node’s authorized keys.

Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

3.3.2 Copy the master node’s ssh key to slave’s authorized keys.

Command:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub cse@slave01

ssh-copy-id -i $HOME/.ssh/id_rsa.pub cse@slave02

3.4 Check by SSH to all the Slaves

ssh slave01

ssh slave02

Step 4. Install Hadoop

1 Download Hadoop

wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.5.0-cdh5.3.2.tar.gz

2 Untar Tar ball

Command : tar xzf hadoop-2.5.0-cdh5.3.2.tar.gz

12
Step 5. Hadoop multi-node cluster setup Configuration

1.Edit .bashrc

Edit .bashrc file located in user‟s home directory and add following parameters.

Command : vi .bashrc

export HADOOP_PREFIX="/home/cse/hadoop-2.5.0-cdh5.3.2"

export PATH=$PATH:$HADOOP_PREFIX/bin

export PATH=$PATH:$HADOOP_PREFIX/sbin

export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}

export HADOOP_COMMON_HOME=${HADOOP_PREFIX}

export HADOOP_HDFS_HOME=${HADOOP_PREFIX}

export YARN_HOME=${HADOOP_PREFIX}

Command : source .bashrc

2.Edit hadoop-env.sh

hadoop-env.shcontains the environment variables that are used in the scripttorun

Hadoop like Java home path, etc. Edit configuration file hadoop-env.sh(located in

HADOOP_HOME/etc/hadoop) and set JAVA_HOME.

Command :vi hadoop–env.shexport JAVA_HOME=/usr/lib/jvm/java-8-oracle/

3.Edit core-site.xml

core-site.xmlinforms Hadoop daemon where NameNode runs in the cluster.

Itcontains configuration settings of Hadoop core such as I/O settings that

arecommon to HDFS & MapReduce.Edit configuration file core-site.xml (located in

HADOOP_HOME/etc/hadoop)and add following entries.

Command : vi core-site.xml

<name>fs.defaultFS</name>

<value>hdfs://master:9000</value>

</property>

13
<property>

<name>hadoop.tmp.dir</name>

<value>/home/cse/hdata</value>

</property>

</configuration>

Note:Here /home/ashok/hdata is a sample location; please specify a location where

you have Read Write privileges.

4.Edit hdfs-site.xml

hdfs-site.xmlcontains configuration settings of HDFS daemons (i.e.NameNode,

DataNode, Secondary NameNode). It also includes thereplication factor and block

size of HDFS.Edit configuration file hdfs-site.xml (located in

HADOOP_HOME/etc/hadoop)

and add following entries

Command : vi hdfs-site.xml

<name>dfs.replication</name>

</property>

</configuration>

5.Edit mapred-site.xml

mapred-site.xmlcontains configuration settings of MapReduce applicationlike

number of JVM that can run in parallel, the size of the mapper andthe reducer

process, CPU cores available for a process, etc.In some cases, mapred-site.xml file

is not available. So, we have to create themapred- site.xml file using mapred-

site.xml template. Edit configurationfile mapred-site.xml (located in

HADOOP_HOME/ etc/hadoop) and add followingentries.

Command : vi mapred-site.xml

14
<configuration>

<name>mapreduce.framework.name</name>

</property>

</configuration>

6.Edit yarn-site.xml

yarn-site.xmlcontains configuration settings of ResourceManager andNodeManager

like application memory management size, the operationneeded on program &

algorithm, etc.

Edit configuration file mapred-site.xml(located in HADOOP_HOME/etc/hadoop)

and add following entries

Command : vi yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
15
<value>master:8040</value>
</property>
</configuration>
7. Edit salves

Edit configuration file slaves (located in HADOOP_HOME/etc/hadoop) and

addfollowing entries:

slave01

slave02

Now Hadoop is set up on Master, now setup Hadoop on all the Slaves.

Install Hadoop On Slaves.

Step 6. Setup Prerequisites on all the slaves

Run following steps on all the slaves

1. Add Entries in hosts file

2. Install Java 8 (Recommended Oracle Java)

Step 7. Copy configured setups from master to all the slaves

7.1. Create tarball of configured setup

Command :tar czf hadoop-2.5.0-cdh5.3.2 .tar.gz hadoop-2.5.0-cdh5.3.2

(NOTE: Run this command on Master)

7.2. Copy the configured tarball on all the slaves

Command :scp hadoop.tar.gz slave01:

(NOTE: Run this command on Master)

Command :scp hadoop.tar.gz slave02:

(NOTE: Run this command on Master)

7.3. Un-tar configured Hadoop setup on all the slaves

Command :tar xvzf hadoop.tar.gz

(NOTE: Run this command on all the slaves)

Now Hadoop is set up on all the Slaves. Now Start the Cluster.

Step 8. Start the Hadoop Cluster

Let us now learn how to start Hadoop cluster?

16
8.1. Format the name node

Command :bin/hdfs namenode -format

(Note1: Run this command on Master

(Note2: This activity should be done once when you install Hadoop, else it will

delete all the data from HDFS)

8.2. Start HDFS Services

Command :sbin/start-dfs.sh

(Note: Run this command on Master)

8.3. Start YARN Services

Command :sbin/start-yarn.sh

(Note: Run this command on Master)

8.4. Check for Hadoop services

8.4.1. Check daemons on Master

Command :jps

NameNode

ResourceManager

8.4.2. Check daemons on Slaves

Command :jps

DataNode

NodeManager

Step 9. Stop The Hadoop Cluster

Let us now see how to stop the Hadoop cluster?

9.1. Stop YARN Services

Command :sbin/stop-yarn.sh

(Note: Run this command on Master)

9.2. Stop HDFS Services

Command :sbin/stop-dfs.sh

(Note: Run this command on Master)

17
18
Experiment # 3

Basic Hadoop file system commands.

Objective :: Basic commands in Hadoop file sysem

PROGRAM MODULE:

1. Create a directory: hdfs dfs -mkdir /foldername

example:hdfs dfs -mkdir /input

2. Create a text file: vi filename.extension

example:vi sample.txt

3. Upload into HDFS : hdfs dfs -put „/local file path/filename.extension‟

/hdfsfoldername

example:hdfs dfs -put ‘/home/cse/sample.txt’ /input

4. Download HDFS to local:hdfs dfs -get ‘/hdfs path of a

file/filen.extension’/localfoldername

example:hdfs dfs -get ‘/inp/sample.txt’ home/cse

5. To display the content of a file : hdfs dfs -cat /path of a

file/filename.extension

example:hdfs dfs -cat input/sample.txt

6. To display first few lines of a file: hdfs dfs -cat /path of a

file/filename.extension |head-number

example:hdfs dfs -cat /inp/sampe.txt |head -5

7. To display last few lines of a file: hdfs dfs -cat /path of a file
/filename.extension |tail -number

19
example:hdfs dfs -cat /inp/sampe.txt |tail -10

8. Remove a file or directory: hdfs dfs -rm /path of a file or directory

example:hdfs dfs -rm np/sample.txt

9. Copy a file from source to destination: hdfs dfs -cp sourcefile.extension

destinationfolder

example: hdfs dfs – cp home/cse/sample.txt home/desktop

20
Experiment # 4

Write a map-reduce program for word count.

Objective :: Simple word count program in Map reduce environment

Running the WordCount Example in Hadoop MapReduce using Java Project with Eclipse

Now, let‟s create the WordCount java project with eclipse IDE for Hadoop. Even if you

are working on Cloudera VM, creating the Java project can be applied to any environment.

Step 1 –
Let‟s create the java project with the name “Sample WordCount” as shown below - File >

New > Project > Java Project > Next. "Sample WordCount" as our project name and click

"Finish":

Step 2 -
The next step is to get references to hadoop libraries by clicking on Add JARS as
follows

21
22
Step 3 -
Create a new package within the project with the name com.code.dezyre-

23
Step 4 –
Now let‟s implement the WordCount example program by creating a WordCount class

under the project com.code.dezyre.

24
Step 5 -

Create a Mapper class within the WordCount class which extends MapReduceBase Class
toimplement mapper interface. The mapper class will contain –
1. Code to implement "map" method.
2. Code for implementing the mapper-stage business logic should be written withinthis
method.

PROGRAM MODULE:

Mapper Class Code for WordCount (Mapper.java)

package wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Mapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
} }}

Step 6 –

Create a Reducer class within the WordCount class extending MapReduceBase Class to

implement reducer interface. The reducer class for the wordcount example in hadoop will

contain the –

1. Code to implement "reduce" method

2. Code for implementing the reducer-stage business logic should be written within this

method.

25
Reducer Class Code for WordCount (Reducer.java)
package wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class Reducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Step 7 –
Create main() method within the WordCount class and set the following properties

1. OutputKeyClass

2. OutputValueClass

3. Mapper Class

4. Reducer Class

5. InputFormat

6. OutputFormat

7. InputFilePath

8. OutputFolderPath

26
Driver Class Code for WordCount (WordCount.java)
package wordcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

27
Step 8 –

Create the JAR file for the wordcount class –

28
29
Execute the Hadoop MapReduce WordCount program

Command1:

JPS stands for Java Virtual Machine Process Status Tool. JPS is a command is used to

check all the Hadoop daemons like NameNode, DataNode, ResourceManager, NodeManager

etc. JPS command displays all java based processes for a particular user.

root@cse-OptiPlex-3046:/home/cse#jps

8594 Jps

6980 DataNode

7190 SecondaryNameNode

7702 NodeManager

6809 NameNode

7354 ResourceManager

Command2:

Once all resources are running.Then Create a input file using vi command

Vi is a Screen editorin linux.vi stands for visual instrument. It is a widely-used default

text editor for Unix-based systems and is shipped with vitually all versions of Unix.

root@cse-OptiPlex-3046:/home/cse# vi word.txt

1. To enter vi, type: vi filename

2. To enter insert mode, type: i.

3. Type in the text: bus car train train car train bus bus

4. To leave insert mode and return to commandmode, press: <Esc>

5. In command mode, save changes and exit vi by typing: :wq<Return> You areback

at the Unix prompt.

Command3:

Create a directory in hdfs

root@cse-OptiPlex-3046:/home/cse#hdfs dfs -mkdir /wordcountinput

30
Command4:

Upload single source, or multiple sources from local file system to the destination

filesystem.

root@cse-OptiPlex-3046:/home/cse# hdfs dfs -put '/home/cse/word.txt'

/wordcountinput

Command5:

http://localhost:50070/dfshealth.html#tab-overview

Click on Utilities and then browse file system.

Command6:

Yarn commands are invoked by the bin/yarn script.Run a jar file. Users can bundle

theirYarn code in a jar file and execute it using this command.Run the WordCount

applicationfrom the JAR file, passing the paths to the input and output directories in

HDFS.

root@cse-OptiPlex-3046:/#yarn jar wordcount.jar wordcount.WordCount

wordcountinput/word.txt /wordcountoutput

Command7:

root@cse-OptiPlex-3046:/# hdfs dfs -ls /wordcountoutput

Found 2 items

-rw-r--r-- 1 cse supergroup 0 2018-11-01 15:30 /wordcountoutput/_SUCCESS

-rw-r--r-- 1 cse supergroup 20 2018-11-01 15:30 /wordcountoutput/part-r-00000

Command8:

root@cse-OptiPlex-3046:/# hdfs dfs -cat wordcountoutput/part-r-00000

bus 3

car 2

train 3

31
32
Experiment # 5

Write a map-reduce program for finding

maximum temperature in weather dataset.

Objective :: Temperature program in map reduce environment

PROGRAM MODULE:

Mapper Program(Mapper.java)
package temperature;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class Mapper extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable>
{
public static final int MISSING = 9999;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException
{
String line = value.toString();
String year = line.substring(15,19);
int temperature;
if (line.charAt(87)=='+')
temperature = Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
output.collect(new Text(year),new IntWritable(temperature));
}}

33
Reducer Program(Reducer.java)
package temperature;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class Reducer extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
int max_temp = 0; //for minimum int max_temp = 100;

while (values.hasNext())
{
int current=values.next().get();
if ( max_temp < current) // for minimum if ( max_temp > current)
max_temp = current;
}
output.collect(key, new IntWritable(max_temp/10));
}}

Driver Program(Driver.java)
package temperature;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Driver extends Configured implements Tool{
public int run(String[] args) throws Exception
{
JobConf conf = new JobConf(getConf(), HighestDriver.class);
conf.setJobName("Driver");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reducer.class);
Path inp = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.addInputPath(conf, inp);
34
FileOutputFormat.setOutputPath(conf, out);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new HighestDriver(),args);
System.exit(res);
}
}

Execution Commands:
root@cse-OptiPlex-3046:/#hdfs dfs -mkdir /tempinput
root@cse-OptiPlex-3046:/#hdfs dfs -cat ‘/home/cse/1901.txt’ /tempinput
root@cse-OptiPlex-3046:/#yarn jar temp.jar temperature.FileJoinerDriver
/tempinput/1901.txt /temoutput
root@cse-OptiPlex-3046:/#hdfs dfs -ls /tempoutput
Found 2 items
-rw-r--r-- 1 cse supergroup0 2018-11-14 15:14 /joinoutput/_SUCCESS
-rw-r--r-- 1 cse supergroup84 2018-11-14 15:14 /joinoutput/part-r-00000
root@cse-OptiPlex-3046:/#hdfs dfs -cat /tempoutput/part-r-00000
1901 23.

35
36
Experiment # 6

Write a map-reduce program for join of two

records.

Objective :: Two records joining in Map reduce environment

PROGRAM MODULE:

Mapper Program( FileJoinerMapper.java)

package join;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class FileJoinerMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String[] tokens = value.toString().split(",");
if (null != tokens && tokens.length == 2) {
context.write(new Text(tokens[0]), new Text(tokens[1]));
}}}

Reducer Program( FileJoinerReducer.java)

package join;
import java.io.IOException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class FileJoinerReducer extends Reducer<Text, Text, NullWritable, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
StringBuffer rec = new StringBuffer(key.toString()).append(",");
int count = 0;
for (Text val : values) {
rec.append(val.toString());
if (count < 1) {
37
rec.append(",");
} count++;
}
context.write(NullWritable.get(), new Text(rec.toString()));

} }

Driver Program( FileJoinerDriver.java)

package join;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class FileJoinerDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length != 3) {
System.err.println("Usage: Join <in-1><in-2><out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "File Joiner");
job.setJarByClass(FileJoinerDriver.class);
job.setMapperClass(FileJoinerMapper.class);
job.setReducerClass(FileJoinerReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}}

38
Data sets:
empname.txt
101,Gaurav
102,Rohit
103,Karishma
104,Darshan
105,Divya
27
empdept.txt
101,Sales
102,Research
103,NMG
104,Admin
105,HR

Execution Commands:
root@cse-OptiPlex-3046:/#hdfs dfs -mkdir /joininput
root@cse-OptiPlex-3046:/#hdfs dfs -cat ‘/home/cse/empname.txt’ /joininput
root@cse-OptiPlex-3046:/#hdfs dfs -cat ‘/home/cse/empdept.txt’ /joininput
root@cse-OptiPlex-3046:/#yarn jar join.jar join.FileJoinerDriver
/joininput/empdept.txt /joininput/empname.txt /joinoutput
root@cse-OptiPlex-3046:/#hdfs dfs -ls /joinoutput
Found 2 items
-rw-r--r-- 1 cse supergroup 0 2018-11-14 15:14 /joinoutput/_SUCCESS
-rw-r--r-- 1 cse supergroup 84 2018-11-14 15:14 /joinoutput/part-r-00000
root@cse-OptiPlex-3046:/#hdfs dfs -cat /joinoutput/part-r-00000
101,Sales,Gaurav
102,Rohit,Research
103,NMG,Karishma
104,Darshan,Admin
105,HR,Divya

39
40
Experiment # 7

Write a map reduce program to find duplicate

record in csv file.

Objective: Duplicate record finding in Map reduce Environment from

csv file.

PROGRAM MODULE:

Mapper:
package duplicate;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class DuplicateValueMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
private static final IntWritable one = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
//Skipping the header of the input
if (key.get() == 0 && value.toString().contains("first_name")) {
return;
}
else {
String values[] = value.toString().split(",");
context.write(new Text(values[1]), one); //Writing first_name value as a key
}
}
}
41
Reducer:
package duplicate;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import com.google.common.collect.Iterables;
/*
* This reducer will get mapper data as input and return only key that is duplicate
value.
*
*/
public class DuplicateValueReducer

extends Reducer<Text, IntWritable, Text, NullWritable>{

@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, NullWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
//Check if the key has duplicate value
if (Iterables.size(values) > 1) {
context.write(key, NullWritable.get());
}}}

Driver:
package duplicates;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import duplicate1.DuplicateValueMapper;
import duplicate1.DuplicateValueReducer;
public class DuplicateValueDriver
extends Configured implements Tool{
public int run(String[] arg0) throws Exception {

42
// TODO Auto-generated method stub
@SuppressWarnings("deprecation")
Job job = new Job(getConf(), "Duplicate value");
job.setJarByClass(getClass());
job.setMapperClass(DuplicateValueMapper.class);
job.setReducerClass(DuplicateValueReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(arg0[0]));
FileOutputFormat.setOutputPath(job, new Path(arg0[1]));

return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int jobStatus = ToolRunner.run(new DuplicateValueDriver(), args);
System.out.println(jobStatus);
}
}

Execution Commands:
cse@cse-OptiPlex-3046:~$ hdfs dfs -mkdir /duplicate
cse@cse-OptiPlex-3046:~$ hdfs dfs -put '/home/cse/duplicate.csv' /duplicate
cse@cse-OptiPlex-3046:~$ yarn jar duplicate.jar
duplicate1.DuplicateValueDriver /duplicate/duplicate.csv /duplicateout
cse@cse-OptiPlex-3046:~$ hdfs dfs -cat /duplicateout/part-r-00000
Celie
Hercule

43
44
Experiment # 8

Write a map reduce program to find patent

citations in patent dataset

Objective :: Patent citation program in Map reduce environment

PROGRAM MODULE:

package com.citation;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class PatentCitation {
public static class PatentCitationMapper extends Mapper<Text, Text, Text, Text> {
public void map(Text key, Text value, Context context) throws IOException,
InterruptedException {
String[] citation = key.toString().split( "," );
Text cited = new Text(citation[1]);
Text citing = new Text(citation[0]);
context.write(cited, citing);
}}
public static class PatentCitationReducer extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String csv = "";
for (Text value : values) {

45
if (csv.length() > 0) {
csv += ",";
} csv += value.toString();
} context.write(key, new Text(csv));
} }
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "Hadoop Patent Citation" );
job.setJarByClass(PatentCitation.class );
job.setMapperClass(PatentCitationMapper.class);
job.setCombinerClass(PatentCitationReducer.class);
job.setReducerClass(PatentCitationReducer.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1 );
}}

Execution Commands:
root@cse-OptiPlex-3046:/#hdfs dfs -mkdir /citationinp
root@cse-OptiPlex-3046:/#hdfs dfs -cat ‘/home/cse/cite75_99.txt’
/citationinp
root@cse-OptiPlex-3046:/#yarn jar citation.jar com.citation.PatentCitation
/citationinp/empdept.txt /citationout
root@cse-OptiPlex-3046:/#hdfs dfs -ls /citationout
Found 2 items
-rw-r--r-- 1 cse supergroup 0 2018-11-14 15:14 /citationout/_SUCCESS
-rw-r--r-- 1 cse supergroup 84 2018-11-14 15:14 /citationout/part-r-00000
root@cse-OptiPlex-3046:/#hdfs dfs -cat /citationout/part-r-00000
955948 5794647
955954 5288283,5445585
955955 4001940,4768950
955957 5203827
955959 4429622
955970 3969088,4184456

46
Experiment # 9

Write a Map reduce program for total retail

collection.(retail dataset)

Objective: Finding total retail collection in Map reduce environment.

PROGRAM MODULE:

Mapper.java

package retailtotal;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class RetailDataAnalysisMapper extends Mapper<LongWritable, Text, Text,
FloatWritable> {
private FloatWritable percentVal = new FloatWritable();
private Text moKey = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
//Date Time City Product-Category Sale-Vale Payment-Mode
//2012-01-01 09:00 San Jose Men's Clothing214.05 Amex
try {
String valueTokens[] = value.toString().split("\t");
float saleValue ;
if (valueTokens.length > 0 && valueTokens.length == 6) {
moKey.set("All Stores ");
saleValue = Float.parseFloat(valueTokens[4]);
percentVal.set(saleValue);
context.write(moKey, percentVal);
} } catch(Exception e) {
e.printStackTrace(); } } }

47
Reducer.java

package retailtotal;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class RetailDataAnalysisReducer extends Reducer<Text, FloatWritable, Text,
FloatWritable> {
private FloatWritable result = new FloatWritable();
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float sum = 0.0f;
int count = 0;
for (FloatWritable val : values) {
count += 1;
sum += val.get();
}
result.set(sum);
String reduceKey = "Number of sales " + String.valueOf(count) + ", Sales
Value : ";
context.write(new Text(reduceKey), result);
}
}

Driver:
package retailtotal;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class RetailDataAnalysis {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: Number Sum <in><out>");

48
System.exit(2);
}
Job job = Job.getInstance(conf, "Retail Data All Store Analysis");
job.setJarByClass(RetailDataAnalysis.class);
job.setMapperClass(RetailDataAnalysisMapper.class);
job.setReducerClass(RetailDataAnalysisReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

49
50
Experiment # 10

Write a Map reduce program for store wise

collection. (retail dataset)

Objective: Perform the Map reduce program on retail dataset for

store wise collection

Program Module:

Mapper:
package retailstore;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class RetailDataAnalysisMapper extends Mapper<LongWritable, Text, Text,
FloatWritable> {
private FloatWritable percentVal = new FloatWritable();
private Text moKey = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
// Date Time City Product-Category Sale-Vale Payment-Mode
// 2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex
try {
String valueTokens[] = value.toString().split("\t");
String date, store;
float saleValue;
if (valueTokens.length > 0 && valueTokens.length == 6) {
date = valueTokens[0];
store = valueTokens[2];
moKey.set(date + "\t" + store);
saleValue = Float.parseFloat(valueTokens[4]);
percentVal.set(saleValue);
51
context.write(moKey, percentVal);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

Reducer:
package retailstore;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class RetailDataAnalysisReducer extends Reducer<Text, FloatWritable, Text,
FloatWritable> {
private FloatWritable result = new FloatWritable();
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float sum = 0.0f;
for (FloatWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}}

Driver:
package retailstore;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class RetailDataAnalysis {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
52
if (otherArgs.length != 2) {
System.err.println("Usage: Number Sum <in><out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "Retail Data Store Analysis");
job.setJarByClass(RetailDataAnalysis.class);
job.setMapperClass(RetailDataAnalysisMapper.class);
job.setReducerClass(RetailDataAnalysisReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Execution Commands:
cse@cse-OptiPlex-3046:~$ hdfs dfs -mkdir /retail
cse@cse-OptiPlex-3046:~$ hdfs dfs -put '/home/cse/Retail.txt' /retail
cse@cse-OptiPlex-3046:~$ yarn jar retailstore.jar
retailstore.RetailDataAnalysis
/retail/Retail.txt /retailout1
cse@cse-OptiPlex-3046:~$ hdfs dfs -cat /retailout1/part-r-00000
2012-01-01 Albuquerque 1074.88
2012-01-01 Anaheim 114.41
2012-01-01 Anchorage 1086.22
2012-01-01 Arlington 400.08
2012-01-01 Atlanta 254.62
2012-01-01 Aurora 117.81
2012-01-01 Austin 1787.88
2012-01-01 Bakersfield 217.79
2012-01-01 Baltimore 7.98
2012-01-01 Boise 481.08997
2012-01-01 Boston 1114.54
2012-01-01 Buffalo 483.82
2012-01-01 Chandler 1648.7699
2012-01-01 Charlotte 440.11
2012-01-01 Chesapeake 676.35
2012-01-01 Chicago 146.15
53
2012-01-01 Cincinnati 323.37997
2012-01-01 Cleveland 427.43
2012-01-01 Columbus 392.5
2012-01-01 Corpus Christi 25.38
2012-01-01 Dallas 273.49
2012-01-01 Denver 413.21002
2012-01-01 Detroit 134.89
2012-01-01 Durham 980.32007
2012-01-01 El Paso 103.01
2012-01-01 Fort Wayne 370.55
2012-01-01 Fort Worth 1128.1399

54
Experiment # 11

Write a Map reduce program for product wise

collection.(retail dataset)

Objective:: Perform the Map reduce program on retail dataset for

product wise collection

PROGRAM MODULE:
Mapper:
package retailproduct;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class RetailDataAnalysisMapper extends Mapper<LongWritable, Text, Text,
FloatWritable> {
private FloatWritable percentVal = new FloatWritable();
private Text moKey = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
//Date Time City Product-Category Sale-Vale Payment-Mode
//2012-01-01 09:00 San Jose Men's Clothing214.05 Amex
try {
String valueTokens[] = value.toString().split("\t");
String date = valueTokens[0];
String productCat = valueTokens[3];
float saleValue ;
if (valueTokens.length > 0 && valueTokens.length == 6) {
moKey.set(date + "\t" + productCat);
saleValue = Float.parseFloat(valueTokens[4]);
percentVal.set(saleValue);
context.write(moKey, percentVal);
}
55
} catch(Exception e) {
e.printStackTrace();}}}

Reducer:
package retailproduct;
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class RetailDataAnalysisReducer extends Reducer<Text, FloatWritable, Text,
FloatWritable> {
private FloatWritable result = new FloatWritable();
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float sum = 0.0f;
for (FloatWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Driver:
package retailproduct;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class RetailDataAnalysis {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: Number Sum <in><out>");
System.exit(2);
56
}
Job job = Job.getInstance(conf, "Retail Data Product Analysis");
job.setJarByClass(RetailDataAnalysis.class);
job.setMapperClass(RetailDataAnalysisMapper.class);
job.setReducerClass(RetailDataAnalysisReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Execution Commands:
cse@cse-OptiPlex-3046:~$ hdfs dfs -mkdir /retail
cse@cse-OptiPlex-3046:~$ hdfs dfs -put '/home/cse/Retail.txt' /retail
cse@cse-OptiPlex-3046:~$ yarn jar retailproduct.jar
retailproduct.RetailDataAnalysis /retail/Retail.txt /retailout
cse@cse-OptiPlex-3046:~$ hdfs dfs -cat /retailout/part-r-00000
2012-01-01 Baby 2034.23
2012-01-01 Books 3492.8
2012-01-01 CDs 2644.5098
2012-01-01 Cameras 2591.27
2012-01-01 Children's Clothing 2778.21
2012-01-01 Computers 2102.66
2012-01-01 Consumer Electronics 2963.59
2012-01-01 Crafts 3258.0898
2012-01-01 DVDs 2831.0
2012-01-01 Garden 1882.25
2012-01-01 Health and Beauty 2467.3198
2012-01-01 Men's Clothing 4030.89
2012-01-01 Music 2396.4
2012-01-01 Pet Supplies 2660.83
2012-01-01 Sporting Goods 1952.89
2012-01-01 Toys 3188.18
2012-01-01 Video Games 2573.3801
2012-01-01 Women's Clothing 3736.87

57
58
59
Department of Computer Science and Engineering
SRKR Engineering College (A), Bhimavaram, India

Paging Logic or Scrolling Logic
No ratings yet
Paging Logic or Scrolling Logic
5 pages
BDA lab manual UPDATED
No ratings yet
BDA lab manual UPDATED
45 pages
213nt1306- Big Data Analytics Lab Manual
No ratings yet
213nt1306- Big Data Analytics Lab Manual
80 pages
Big_data_Lab_Manual[1] (4)
No ratings yet
Big_data_Lab_Manual[1] (4)
32 pages
bi lab file
No ratings yet
bi lab file
19 pages
TP2 _3IM - En
No ratings yet
TP2 _3IM - En
7 pages
Amrita CC 3.1
No ratings yet
Amrita CC 3.1
7 pages
Hadoop Single Node Cluster Setup Steps
No ratings yet
Hadoop Single Node Cluster Setup Steps
7 pages
Bda Lab
No ratings yet
Bda Lab
37 pages
Experiment 1
No ratings yet
Experiment 1
17 pages
Experiment-2_BDA_Lab
No ratings yet
Experiment-2_BDA_Lab
13 pages
BDA LAB Programs
No ratings yet
BDA LAB Programs
56 pages
Install Hadoop
No ratings yet
Install Hadoop
8 pages
BDA Lab manual
No ratings yet
BDA Lab manual
49 pages
BDAO
No ratings yet
BDAO
23 pages
Big Data Analytics Lab Experiments
No ratings yet
Big Data Analytics Lab Experiments
16 pages
Week 1 in Terminal
No ratings yet
Week 1 in Terminal
10 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Lab Manual
No ratings yet
Lab Manual
27 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
34 pages
Bda Record
No ratings yet
Bda Record
27 pages
$ Sudo Apt-Get Install Oracle-Java8-Installer
No ratings yet
$ Sudo Apt-Get Install Oracle-Java8-Installer
4 pages
bigdatamanual(2)
No ratings yet
bigdatamanual(2)
45 pages
Computer Science & Engineering: Department of
No ratings yet
Computer Science & Engineering: Department of
6 pages
EXP 1-2
No ratings yet
EXP 1-2
9 pages
Bigdata Manual Final
No ratings yet
Bigdata Manual Final
65 pages
Online:: Setting Up The Environment
No ratings yet
Online:: Setting Up The Environment
9 pages
HadoopfilePP
No ratings yet
HadoopfilePP
83 pages
Installation of Hadoop in Ubuntu
No ratings yet
Installation of Hadoop in Ubuntu
15 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
BigData_Lab_Manual
No ratings yet
BigData_Lab_Manual
44 pages
Experiment No - 1
No ratings yet
Experiment No - 1
13 pages
Exp-1-1
No ratings yet
Exp-1-1
24 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
A Report On Distributed Computing
No ratings yet
A Report On Distributed Computing
25 pages
Hadoop Installation Manual 2.odt
No ratings yet
Hadoop Installation Manual 2.odt
20 pages
big-data-file
No ratings yet
big-data-file
32 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
34 pages
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Big Data File
No ratings yet
Big Data File
16 pages
How To Install Hadoop On Ubuntu 18.04 or 20.04
No ratings yet
How To Install Hadoop On Ubuntu 18.04 or 20.04
15 pages
Single Node Hadoop Cluster
No ratings yet
Single Node Hadoop Cluster
9 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
CCS334-BDA LAB MANUAL final (1)
No ratings yet
CCS334-BDA LAB MANUAL final (1)
46 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
CCS334 Bda
No ratings yet
CCS334 Bda
23 pages
Installing Standalone and Pseudocode Hadoop Cluster: 1. Setting Up Vmware Virtual Machine
No ratings yet
Installing Standalone and Pseudocode Hadoop Cluster: 1. Setting Up Vmware Virtual Machine
14 pages
Final Copy - BDA LAB Record
No ratings yet
Final Copy - BDA LAB Record
44 pages
Big Data Manual Ai
No ratings yet
Big Data Manual Ai
33 pages
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
No ratings yet
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
5 pages
BDA Lab File
No ratings yet
BDA Lab File
4 pages
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
No ratings yet
Recommended Platform:: 4.1. Install Java 7 (Recommended Oracle Java)
5 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
GCC Lab Manual2
No ratings yet
GCC Lab Manual2
50 pages
BIG data file
No ratings yet
BIG data file
28 pages
Anurag 1-6 Merged
No ratings yet
Anurag 1-6 Merged
60 pages
1 Big Data Lab - 230823 - 103054
No ratings yet
1 Big Data Lab - 230823 - 103054
34 pages
BDA LAB MANUEL
No ratings yet
BDA LAB MANUEL
9 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Mongodb Architecture Guide
No ratings yet
Mongodb Architecture Guide
13 pages
Physical Design
No ratings yet
Physical Design
32 pages
Question 1 2
100% (1)
Question 1 2
38 pages
Session 8 9 Questions
No ratings yet
Session 8 9 Questions
27 pages
Sas PDF
No ratings yet
Sas PDF
16 pages
Final-MySQL - Revision
No ratings yet
Final-MySQL - Revision
15 pages
Unit - 1 CS-25 - Advance Java Programming (J2ee) : Two-Tier Architecture
No ratings yet
Unit - 1 CS-25 - Advance Java Programming (J2ee) : Two-Tier Architecture
27 pages
Sap Hana Native Storage Extension Implementation With Vsp
No ratings yet
Sap Hana Native Storage Extension Implementation With Vsp
15 pages
Online Food-Ordering System: Group Member1 V.Sriram (19BCE0934) Group Member2 S.Sai Lokesh (19BCE2379)
No ratings yet
Online Food-Ordering System: Group Member1 V.Sriram (19BCE0934) Group Member2 S.Sai Lokesh (19BCE2379)
37 pages
5230 11042 1 SM PDF
No ratings yet
5230 11042 1 SM PDF
16 pages
Wahyu Illahi
No ratings yet
Wahyu Illahi
1 page
Test Questions
No ratings yet
Test Questions
10 pages
Lesson 7databinding
No ratings yet
Lesson 7databinding
24 pages
Server Notes
No ratings yet
Server Notes
19 pages
Documentum Distributed Configuration Guide 6.5
No ratings yet
Documentum Distributed Configuration Guide 6.5
166 pages
Una Vida Milagrosa by Bruce Van Natta 71pdf Aws - 5a3944fe1723dd33287675ce
No ratings yet
Una Vida Milagrosa by Bruce Van Natta 71pdf Aws - 5a3944fe1723dd33287675ce
2 pages
Real-Time Databases: DR Shamala Subramaniam
100% (1)
Real-Time Databases: DR Shamala Subramaniam
27 pages
Module 4 Quiz - Coursera
No ratings yet
Module 4 Quiz - Coursera
1 page
Bacon Shakespeare and The Rosicrucians
No ratings yet
Bacon Shakespeare and The Rosicrucians
321 pages
VamsiKrishna Hyderabad Secunderabad, Telangana 2.03 Yrs
No ratings yet
VamsiKrishna Hyderabad Secunderabad, Telangana 2.03 Yrs
3 pages
Expert System
No ratings yet
Expert System
11 pages
Data Manipulation in Excel Cheat Sheet
No ratings yet
Data Manipulation in Excel Cheat Sheet
1 page
Linux
No ratings yet
Linux
89 pages
Anshi Resume
No ratings yet
Anshi Resume
5 pages
A Guide To Using Eloquent ORM in Laravel Scotch PDF
No ratings yet
A Guide To Using Eloquent ORM in Laravel Scotch PDF
35 pages
Chapter 05 Database Systems and Applications
100% (1)
Chapter 05 Database Systems and Applications
20 pages
DBAS QP Spring 2021 SAMPLE 2
No ratings yet
DBAS QP Spring 2021 SAMPLE 2
7 pages
Oracle R12 EBTax SQL Queries For Functional Implementers For Troubleshooting - Oracle R12 Upgrade and E-Business Tax Implementation Resources
No ratings yet
Oracle R12 EBTax SQL Queries For Functional Implementers For Troubleshooting - Oracle R12 Upgrade and E-Business Tax Implementation Resources
21 pages
cblecspu09
50% (2)
cblecspu09
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.