0% found this document useful (0 votes)

31 views

BDA LAB Programs

The document discusses setting up Hadoop in three modes: local, pseudo-distributed, and fully-distributed. It provides step-by-step instructions for installing and configuring Hadoop and its components in pseudo-distributed mode on a single machine.

Uploaded by

raghu rama teja vegesna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

BDA LAB Programs

Uploaded by

raghu rama teja vegesna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 56

1.

Install,Configure and run Hadoop and HDFS

i) Local (Standalone) Mode

ii)Pseudo-Distributed Mode

iii)Fully-Distributed Mode

Prerequisites
Supported Platforms

GNU/Linux is supported as a development and production platform. Hadoop has been

demonstrated on GNU/Linux clusters with 2000 nodes.

Windows is also a supported platform but the followings steps are for Linux only. To set up
Hadoop on Windows, see wiki page.

Required Software

Required software for Linux include:

Java™ must be installed. Recommended Java versions are described at HadoopJavaVersions.

ssh must be installed and sshd must be running to use the Hadoop scripts that manage
remote Hadoop daemons.

Installing Software

If your cluster doesn’t have the requisite software you will need to install

it. For example on Ubuntu Linux:

$ sudo apt-get install ssh

$ sudo apt-get install rsync

Download

To get a Hadoop distribution, download a recent stable release from one of the Apache
Download Mirrors.

Prepare to Start the Hadoop Cluster

Unpack the downloaded Hadoop distribution. In the distribution, edit the file
etc/hadoop/hadoop-env.sh to define some parameters as follows:

# set to the root of your Java installation

export JAVA_HOME=/usr/java/latest
1 Hadoop and BigData Lab

Try the following command:

$ bin/hadoop

This will display the usage documentation for the hadoop script.Now you are ready to start
your Hadoop cluster in one of the three supported modes:

Local (Standalone) Mode

Pseudo-Distributed Mode

Fully-Distributed Mode

i) Local (Standalone) Mode

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.
This is useful for debugging.

The following example copies the unpacked conf directory to use as input and then finds
and displays every match of the given regular expression. Output is written to the given
output directory.

$ mkdir input

$ cp etc/hadoop/*.xml input

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep

input output 'dfs[a-z.]+'

ii) Pseudo-Distributed Mode

Part -1

1. Installing Oracle Java 8

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update

$ sudo apt-get install oracle-java8-installer

2 Hadoop and BigData Lab

It will install java source in your machine at /usr/lib/jvm/java-8-

oracle To check java version

$ java -version

2. Creating a Hadoop user for accessing HDFS and MapReduce

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

3. Installing SSH

$ sudo apt-get install openssh-server

Configuring SSH

# First login with hduser (and from now use only hduser account for further steps)

$ sudo su hduser

$ ssh-keygen -t rsa -P ""

4. Disabling IPv6

Since Hadoop doesn’t work on IPv6, we should disable it. One of another reason is also that
it has been developed and tested on IPv4 stacks. Hadoop nodes will be able to communicate
if we are having IPv4 cluster. (Once you have disabled IPV6 on your machine, you need to
reboot your machine in order to check its effect. In case if you don’t know how to reboot
with command use sudo reboot )

For getting your IPv6 disable in your Linux machine, you need to update /etc/sysctl.conf by
adding following line of codes at end of the file,open sysctl.conf by using the following
command

$ sudo gedit /etc/sysctl.conf

copy the next 4 lines add to end of the file

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1
Tip:- You can use nano, gedit, and Vi editor for updating all text files for this configuration
purpose.

5. Download latest Apache Hadoop source from Apache mirrors

First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or latest version
source from Apache download Mirrors. You can also try stable hadoop to get all latest
features as well as recent bugs solved with Hadoop source. Choose location where you want
to place all your hadoop installation, I have chosen /usr/local/hadoop

if your downloaded file is at Downloads move to Downloads

$cd Download

# #Extract Hadoop source run this following command where your hadoop is downloaded

$sudo tar -xzvf hadoop-2.*.tar.gz

## Locate to hadoop installation parent dir

$ cd /usr/local/

## Move hadoop-2.6.0 to hadoop folder

$sudo mv hadoop-2.6.0 /usr/local/hadoop

## Assign ownership of this folder to Hadoop user

$sudo chown hduser:hadoop -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode

$sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

$sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

## Again assign ownership of this Hadoop temp folder to Hadoop user

$sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

6. Update Hadoop configuration files

a. Update hduser configuration file by appending the

$ sudo gedit .bashrc

## Update hduser configuration file by appending the

## following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

# -- HADOOP ENVIRONMENT VARIABLES END -- #

b. Configuration file : hadoop-env.sh

## To edit file, fire the below given command

$ sudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

## Update JAVA_HOME variable,

JAVA_HOME=/usr/lib/jvm/java-8-oracle

c. move to /usr/local/hadoop/etc/hadoop

$ cd /usr/local/hadoop/etc/hadoop

d. Configuration file : core-site.xml

## To edit file, fire the below given command

$ sudo gedit core-site.xml

## Paste these lines into <configuration> tag

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

e. Configuration file : hdfs-site.xml

## To edit file, fire the below given command

$ sudo gedit hdfs-site.xml

## Paste these lines into <configuration> tag

<name>dfs.replication</name>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>

</property>

f. Configuration file : yarn-site.xml

## To edit file, fire the below given command

$ sudo gedit yarn-site.xml

## Paste these lines into <configuration> tag

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

g. Configuration file : mapred-site.xml

## Copy template of mapred-site.xml.template file

$cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml

## To edit file, fire the below given command

$ sudo gedit mapred-site.xml

## Paste these lines into <configuration> tag

<name>mapreduce.framework.name</name>

7. Format Namenode

$hadoop namenode -format

8. Start all Hadoop daemons

Start hdfs daemons

$ start-dfs.sh
Start MapReduce daemons:

$ start-yarn.sh

Instead both of these above command you can also use start-all.sh, but its now deprecated
so its not recommended to be used for better Hadoop operations.

9. Track/Monitor/Verify

Verify Hadoop daemons:

$ jps

Monitor Hadoop ResourseManage and Hadoop NameNode

If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring Hadoop
web view of ResourceManager and NameNode which are usually used by hadoop
administrators. Open your default browser and visit to the following links.
For ResourceManager – Http://localhost:8088

For NameNode – Http://localhost:50070

If you are getting output as shown in the above snapshot then Congratulations! You
have successfully installed Apache Hadoop in your Ubuntu and if not then post your
error messages in comments. We will be happy to help you. Happy Hadooping.!!
iii) Fully-Distributed Mode

Prerequisites

1. Installation and Configuration of Single node Hadoop :

Install and Confiure Single node Hadoop which will be our Masternode.

2. Prepare your computer network (Decide no of nodes to set up cluster) :

Based on the helping parameters like Purpose of Hadoop Multinode cluster,Size

of the dataset to be processed andAvailability of Machines, you need to define no of
Master nodes and no of Slave nodes to be configured for Hadoop Cluster setup.

3. Basic installation and configuration :

Step 3A: Hostname identification of your nodes to be configured in the further steps. To
Masternode, we will name it as HadoopMaster and to 2 different Slave nodes, we will name
them as HadoopSlave1, HadoopSlave2 respectively in /etc/hosts directory. After deciding a
hostname of all nodes, assign their names by updating hostnames (You can ignore this step
if you do not want to setup names.) Add all host names to /etc/hosts directory in all
Machines (Master and Slave nodes).

# Edit the /etc/hosts file with following command

$ sudo gedit /etc/hosts

# Add following hostname and their ip in host table

192.168.2.14 HadoopMaster

192.168.2.15 HadoopSlave1

192.168.2.16 HadoopSlave2

Step 3B: Create hadoop as group and hduser as user in all Machines (if not created !!).

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

If you require to add hdusers to sudoers, then fire following command

$ sudo usermod -a -G sudo hduser

$sudo gedit /etc/sudoers

Add following line in /etc/sudoers/

hduser ALL=(ALL:ALL) ALL

Step 3C: Install rsync for sharing hadoop source with rest all Machines,

$sudo apt-get install rsync

Step 3D: To make above changes reflected, we need to reboot all of the Machines.

$sudo reboot

4. Hadoop configuration steps

☑ Applying Common Hadoop Configuration :

However, we will be configuring Master-Slave architecture we need to apply the

common changes in Hadoop config files (i.e. common for both type of Mater and Slave
nodes) before we distribute these Hadoop files over the rest of the machines/nodes. Hence,
these changes will be reflected over your single node Hadoop setup. And from the step 6,
we will make changes specifically for Master and Slave nodes respectively.

Changes:

1. Update core-site.xml

Update this file by changing hostname from localhost to

HadoopMaster ## To edit file, fire the below given command

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit core-site.xml

## Paste these lines into <configuration> tag OR Just update it by replacing localhost

<property>
<name>fs.default.name</name>
<value>hdfs://HadoopMaster:9000</value>
</property>
with master

2. Update hdfs-site.xml

Update this file by updating repliction factor from 1 to

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit hdfs-site.xml

3. ## To edit file, fire the below given command

<property>
<name>dfs.replication</name>
<value>3</value>
</property>
## Paste/Update these lines into <configuration> tag

3. Update yarn-site.xml

Update this file by updating the following three properties by updating hostname
from localhost to HadoopMaster,

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit yarn-site.xml

## To edit file, fire the below given command

<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>HadoopMaster:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>HadoopMaster:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>HadoopMaster:8050</value>
</property>
## Paste/Update these lines into <configuration> tag

4. Update Mapred-site.xml

Update this file by updating and adding following

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit mapred-site.xml

properties, ## To edit file, fire the below given command
## Paste/Update these lines into <configuration> tag
<property>
<name>mapreduce.job.tracker</name>
<value>HadoopMaster:5431</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
5. Update masters

Update the directory of master nodes of Hadoop

cluster ## To edit file, fire the below given command
hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit masters

HadoopMaster
## Add name of master nodes

6. Update slaves

Update the directory of slave nodes of Hadoop

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit slaves

cluster ## To edit file, fire the below given command

HadoopSlave1
HadoopSlave2
## Add name of slave nodes

☑ Copying/Sharing/Distributing Hadoop config files to rest all nodes master/slave

Use rsync for distributing configured Hadoop source among rest of nodes via network.
☑ create hadoop folder in Hadoop slaves at /usr/local :

# In HadoopSlave1 machine
$sudo mkdir /usr/local/hadoop

$sudo rsync -avxP /usr/local/hadoop/ hduser@HadoopSlave1:/usr/local/hadoop/

# In HadoopSlave2 machine
$sudo mkdir /usr/local/hadoop

$sudo rsync -avxP /usr/local/hadoop/ hduser@HadoopSlave2:/usr/local/hadoop/

The above command will share the files stored within hadoop folder to Slave nodes with
location /usr/local/hadoop. So, you dont need to again download as well as setup the above
configuration in rest of all nodes. You just need Java and rsync to be installed over all nodes. And this
JAVA_HOME path need to be matched with $HADOOP_HOME/etc/hadoop/hadoop-env.sh file of
your Hadoop distribution which we had already configured in Single node Hadoop configuration.

☑ Applying Master node specific Hadoop configuration: (Only for master nodes)
These are some configuration to be applied over Hadoop MasterNodes (Since we have only
one master node it will be applied to only one master node.)

step a: Remove existing Hadoop_data folder (which was created while singlenode hadoop

setup.
$sudo rm -rf /usr/local/hadoop_tmp/
step b: Make same (/usr/local/hadoop_tmp/hdfs) directory and create NameNode
(/usr/local/hadoop_tmp/hdfs/namenode) directory
$ sudo mkdir -p /usr/local/hadoop_tmp/

$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

step c: Make hduser as owner of that directory.
$sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

☑ Applying Slave node specific Hadoop configuration : (Only for slave nodes)

Since we have three slave nodes, we will be applying the following changes over HadoopSlave1,
HadoopSlave2 and HadoopSlave3 nodes.

step a: Remove existing Hadoop_data folder (which was created while single node hadoop

setup)
$sudo rm -rf /usr/local/hadoop_tmp/hdfs/

setp b: Creates same (/usr/local/hadoop_tmp/) directory/folder, an inside this folder

again Create DataNode (/usr/local/hadoop_tmp/hdfs/namenode) directory/folder

$sudo mkdir -p /usr/local/hadoop_tmp/

$sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

step c: Make hduser as owner of that directory
$sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

☑ Copying ssh key for Setting up passwordless ssh access from Master to Slave node :

To manage (start/stop) all nodes of Master-Slave architecture, hduser (hadoop user of

Masternode) need to be login on all Slave as well as all Master nodes which can be possible through
setting up passwrdless SSH login. (If you are not setting this then you need to provide password
while starting and stoping daemons on Slave nodes from Master node).

Fire the following command for sharing public SSH key –$HOME/.ssh/id_rsa.pub file (of
HadoopMaster node) to authorized_keys file of hduser@HadoopSlave1 and also on
hduser@HadoopSlave1 (in$HOME/.ssh/authorized_keys)

hduser@HadoopMaster:~$ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@HadoopSlave1

hduser@HadoopMaster:~$ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@HadoopSlave2

Format Namenonde (Run on MasterNode) :

# Run this command from Masternode

hduser@HadoopMaster: usr/local/hadoop/$ hdfs namenode -format
Starting up Hadoop cluster daemons : (Run on MasterNode)

Start HDFS daemons:

hduser@HadoopMaster:/usr/local/hadoop$ start-dfs.sh

Start MapReduce daemons:

hduser@HadoopMaster:/usr/local/hadoop$ start-yarn.sh
Instead both of these above command you can also use start-all.sh, but its now deprecated so its not
recommended to be used for better Hadoop operations.

Track/Monitor/Verify Hadoop cluster : (Run on any Node)

Verify Hadoop daemons on Master :

hduser@HadoopMaster: jps

Verify Hadoop daemons on all slave nodes :

hduser@HadoopSlave1: jps
hduser@HadoopSlave2: jps

(As shown in above snap- The running services of HadoopSlave1 will be the same for all Slave nodes
configured in Hadoop Cluster.)

Monitor Hadoop ResourseManage and Hadoop NameNode via web-version,

If you wish to track Hadoop MapReduce as well as HDFS, you can also try exploring Hadoop web
view of ResourceManager and NameNode which are usually used by hadoop administrators. Open
your default browser and visit to the following links from any of the node.
Hadoop and BigData Lab

For ResourceManager – Http://HadoopMaster:8088

For NameNode – Http://HadoopMaster:50070

If you are getting the similar output as shown in the above snapshot for Master and Slave noes then
Congratulations! You have successfully installed Apache Hadoop in your Cluster and if not then post
your error messages in comments.
3. Implement the following file management tasks in Hadoop:

i) Adding files and directories.

ii) Retrieving files.

iii) Deleting files

Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and
copies them into HDFS using one of the above command line utilities.

> To know your hadoop version

$ hadoop version

> Result

Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-
2.7.3.jar

> Start Hadoop using commads

$ start-dfs.sh
$ start-yarn.sh

> Check all demons running or not by using command.

$ jps

> if your are dealing with filesystem use fs comman

$ hadoop fs
the above command gives the list of options for hadoop filesystem like

Usage: hadoop fs [generic options]

[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
3 Hadoop and BigData Lab

[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-

count [-q] [-h] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir>
[<snapshotName>]] [-deleteSnapshot
<snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]

Generic options supported are

-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to
the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in
the classpath.
-archives <comma separated list of archives> specify comma separated archives to
be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]
4 Hadoop and BigData Lab

i) Adding files and directories.

> Create directory in HDFS
$ hadoop fs -mkdir /path

ii) Retrieving files.

> Retrive files from HDFS

$ hadoop fs -cat /path

iii) Deleting files

> Delete file from HDFS
$ hadoop fs -rm /filepath

> Delete folder from HDFS

$hadoop fs -rm -R /folder_path

How Run Hadoop Program

> 1st setup environment varivales for java if not

HINT :: To know java home type “echo $JAVA_HOME ”

$ export JAVA_HOME=/usr/lib/jvm/jdk1.8.0._101
$ export PATH=${JAVA_HOME}/bin:${PATH}
$ export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

> Load your input data into HDFS.

Using filesystem comnnads. for example your example is wordcount. Create a folder
worcount in HDFS using following commad

$ haoop fs -mkdir /wordcount

next create input directory for text files what you are storing.

$hadoop fs -mkdir /wordcount/input

OR
if you want to create a directory using single command use following command.

$ hadoop fs -mkdir -p /wordcount/input

copy your data into HDFS directory

$ hadoop fs -put /localdirectiory files /hdfs directory path

For example your files name is wordfile.txt is in your home directory. load this file into hdfs
directory /wordcount/input

5 Hadoop and BigData Lab

$ hadoop fs -put ~/wordfile.txt /wordcount/input/

> Compile you Mapreduce program

$ hadoop com.sun.tools.javac.Main WordCount.java

> Create Jar file using the following command.

$ jar cf <filename.jar> <class names1,...>

Syntax for creating jar file

$ jar cf wc.jar WordCount*.class

example

> Run Hadoop program

$ hadoop jar <filename.jar> <classname> </input directory > <output directory>

sysntax

$ hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output

example
2. Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm

Solution:

import java.io.IOException;
import java.util.StringTokenizer;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text,
IntWritable>{ private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken())
; context.write(word,
one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{ private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values)
{ sum += val.get();
}
result.set(sum);
context.write(key,
result);
}
}
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word
count"); job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
3. Hadoop and BigData Lab

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Compiling and Running a Program

$ hadoop fs -mkdir -p /wordcount/input
$ hadoop fs -put ~/intput.txt /wordcount/input

$ hadoop com.sun.tools.javac.Main WordCount.java

$ jar cf wc.jar WordCount*.class
$ hadoop jar wc.jar WordCount /wordcount/input /wordcount/output

Input.txt(input file)

Apache > Hadoop > Apache Hadoop 3.0.0-alpha1 Wiki | git | Last Published: 2016-08-30 |
Version: 3.0.0-alpha1 General Overview Single Node Setup Cluster Setup commands
Reference FileSystem Shell Compatibility Interface Classification FileSystem Specification
Common CLI Mini Cluster Etc ……………

Output:

(jar/
executable1
(multi-
terabyte1 (see
2
(thousands 1
A 1
Architecture 2
Distributed 1
File 1
Guide) 1
Guide). 1
HDFS 1
Hadoop 3
MRAppMaster 1
MapReduce 4
Minimally, 1
NodeManager 1
ResourceManager 1
3. Write a Map Reduce program that mines weather

data.

Solution:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperature {

public static class MaxTemperatureMapper extends Mapper<LongWritable, Text,
Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)throws
IOException, InterruptedException {
String line = value.toString();
String year = line.substring(0,4);
int airTemperature;
if (line.charAt(5) != '-') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(6,8));
}
else {
airTemperature = Integer.parseInt(line.substring(5,8));
}
context.write(new Text(year), new IntWritable(airTemperature));
}
}
public static class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
public static void main(String[] args) throws Exception {

if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path>
<output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "maxtemprature"); job.setJarByClass(MaxTemperature.class); job.setJobName("Max
temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new
Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

Compiling and Running a Program

$ hadoop fs -mkdir -p /weather/input
$ hadoop fs -put ~/intput.txt / weather/input

$ hadoop com.sun.tools.javac.Main MaxTemperature.java

$ jar cf weather.jar MaxTemperature *.class
$ hadoop jar weather.jar /weather/ /weather/
MaxTemperature input output

Input file:
1940,+761
1940,+341
1940,-041
1940,+221
1940,+481
1940,+921
1940,+981
1940,+861
1950,-241
1950,+521
1950,+041
1950,+041
1950,+941
1950,+761
1950,+101
1950,+101
1950,+401
1950,+041
1955,+841
1955,+621
1955,-021
1955,+761
1955,+261
1955,+981
1955,+941
1955,+301
1955,+961
1955,+721
1955,+721
1955,+881
1955,+981
1955,-061
1955,+361
1955,+581
1955,+941

Output:

1940 98
1950 94
1955 98
4a. Implement Linear Regression using R

library(datasets)

InputData <- as.data.frame(state.x77)

colnames(InputData)[4] = "Life.Exp"

colnames(InputData)[6] = "HS.Grad"

InputData$Density = InputData$Population * 1000 / InputData$Area

fit1 <- lm(Life.Exp ~ ., data=InputData)

summary(fit1)

#It appears higher populations are related to increased life expectancy and

#higher murder rates are strongly related to decreased life expectancy.

#High school graduation rate is marginal. Other than that,

#we're not seeing much. Another kind of summary of the model can be obtained like this.

fit2 <- lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +

HS.Grad + Frost + Density, data = InputData)

summary(fit2)

anova(fit1, fit2)

#As you can see, removing "Area" had no significant effect on the model (p = .4205).

#Compare the p-value to that for "Area" in the first summary table above.
fit3 <- lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost + Density, data = InputData)

summary(fit3)

fit4 <- lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost , data = InputData)

summary(fit4)

confint(fit4)

par(mfrow=c(2,2))

plot(fit1, 1)

plot(fit2, 1)

plot(fit3, 1)

plot(fit4, 1)

par(mfrow=c(1,1))

#The model object is a list containing quite a lot of information.

names(model5)

Output:

Call:

lm(formula = Life.Exp ~ ., data = InputData)

Residuals:

Min 1Q Median 3Q Max

-1.47514 -0.45887 -0.06352 0.59362 1.21823

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.995e+01 1.843e+00 37.956 < 2e-16 ***

Population 6.480e-05 3.001e-05 2.159 0.0367 *

Income 2.701e-04 3.087e-04 0.875 0.3867

Illiteracy 3.029e-01 4.024e-01 0.753 0.4559

Murder -3.286e-01 4.941e-02 -6.652 5.12e-08 ***

HS.Grad 4.291e-02 2.332e-02 1.840 0.0730 .

Frost -4.580e-03 3.189e-03 -1.436 0.1585

Area -1.558e-06 1.914e-06 -0.814 0.4205

Density -1.105e-03 7.312e-04 -1.511 0.1385

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7337 on 41 degrees of freedom

Multiple R-squared: 0.7501, Adjusted R-squared: 0.7013

F-statistic: 15.38 on 8 and 41 DF, p-value: 3.787e-10

Call:

lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +

HS.Grad + Frost + Density, data = InputData)

Residuals:

Min 1Q Median 3Q Max

-1.50252 -0.40471 -0.06079 0.58682 1.43862

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.094e+01 1.378e+00 51.488 < 2e-16 ***

Population 6.249e-05 2.976e-05 2.100 0.0418 *

Income 1.485e-04 2.690e-04 0.552 0.5840

Illiteracy 1.452e-01 3.512e-01 0.413 0.6814

Murder -3.319e-01 4.904e-02 -6.768 3.12e-08 ***

HS.Grad 3.746e-02 2.225e-02 1.684 0.0996 .

Frost -5.533e-03 2.955e-03 -1.873 0.0681 .

Density -7.995e-04 6.251e-04 -1.279 0.2079

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7307 on 42 degrees of freedom

Multiple R-squared: 0.746, Adjusted R-squared: 0.7037

F-statistic: 17.63 on 7 and 42 DF, p-value: 1.173e-10

Analysis of Variance Table

Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +

Frost + Area + Density

Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +

Frost + Density

Res.Df RSS Df Sum of Sq F Pr(>F)

1 41 22.068

2 42 22.425 -1 -0.35639 0.6621 0.4205

Call:

lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost +

Density, data = InputData)

Residuals:

Min 1Q Median 3Q Max

-1.56877 -0.40951 -0.04554 0.57362 1.54752

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.142e+01 1.011e+00 70.665 < 2e-16 ***

Population 6.083e-05 2.676e-05 2.273 0.02796 *

Murder -3.160e-01 3.910e-02 -8.083 3.07e-10 ***

HS.Grad 4.233e-02 1.525e-02 2.776 0.00805 **

Frost -5.999e-03 2.414e-03 -2.485 0.01682 *

Density -5.864e-04 5.178e-04 -1.132 0.26360

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7174 on 44 degrees of freedom

Multiple R-squared: 0.7435, Adjusted R-squared: 0.7144

F-statistic: 25.51 on 5 and 44 DF, p-value: 5.524e-12

Call:

lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,

data = InputData)

Residuals:

Min 1Q Median 3Q Max

-1.47095 -0.53464 -0.03701 0.57621 1.50683

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***

Population 5.014e-05 2.512e-05 1.996 0.05201 .

Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***

HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **

Frost -5.943e-03 2.421e-03 -2.455 0.01802 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7197 on 45 degrees of freedom

Multiple R-squared: 0.736, Adjusted R-squared: 0.7126

F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12

2.5 % 97.5 %

(Intercept) 6.910798e+01 72.9462729104

Population -4.543308e-07 0.0001007343

Murder -3.738840e-01 -0.2264135705

HS.Grad 1.671901e-02 0.0764454870

Frost -1.081918e-02 -0.0010673977

4b) Implement Logistic regression using R

# Loading package
library(caTools)
library(ROCR)

# Splitting dataset
split <- sample.split(mtcars, SplitRatio = 0.8)
split

train_reg <- subset(mtcars, split == "TRUE")

test_reg <- subset(mtcars, split == "FALSE")

# Training model
logistic_model <- glm(vs ~ wt + disp,
data = train_reg,
family = "binomial")
logistic_model

# Summary
summary(logistic_model)

# Predict test data based on model

predict_reg <- predict(logistic_model,
test_reg, type = "response")
predict_reg

# Changing probabilities
predict_reg <- ifelse(predict_reg >0.5, 1, 0)

# Evaluating model accuracy

# using confusion matrix
table(test_reg$vs, predict_reg)

missing_classerr <- mean(predict_reg != test_reg$vs)

print(paste('Accuracy =', 1 - missing_classerr))

# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_reg$vs)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")

auc <- auc@y.values[[1]]
auc

# Plotting curve
plot(ROCPer)
plot(ROCPer, colorize = TRUE,
print.cutoffs.at = seq(0.1, by = 0.1),
main = "ROC CURVE")
abline(a = 0, b = 1)

auc <- round(auc, 4)

legend(.6, .4, auc, title = "AUC", cex = 1)

Output:

[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE

Call: glm(formula = vs ~ wt + disp, family = "binomial", data = train_reg)

Coefficients:

(Intercept) wt disp

2.03395 2.54026 -0.05441

Degrees of Freedom: 22 Total (i.e. Null); 20 Residual

Null Deviance: 31.49

Residual Deviance: 12.69 AIC: 18.69

Call:

glm(formula = vs ~ wt + disp, family = "binomial", data = train_reg)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.70195 -0.16130 -0.01561 0.49586 1.80907

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.03395 3.02219 0.673 0.5009

wt 2.54026 2.10430 1.207 0.2274

disp -0.05441 0.02883 -1.887 0.0591 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 31.492 on 22 degrees of freedom

Residual deviance: 12.692 on 20 degrees of freedom

AIC: 18.692

Number of Fisher Scoring iterations: 7

Hornet 4 Drive Hornet Sportabout Merc 230 Cadillac Fleetwood

2.108528e-02 1.482411e-04 9.148477e-01 3.319776e-05

Lincoln Continental Toyota Corolla Fiat X1-9 Porsche 914-2

9.922485e-05 9.440919e-01 9.340524e-01 7.158836e-01

Maserati Bora

5.087486e-03

predict_reg
01

041

113

[1] "Accuracy = 0.777777777777778"

[1] 0.775
5. Implement SVM/ Decision tree techniques

1) Step 1 :Import the required packages.

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.3

2) Step 2: Import the dataset.

m<-read.csv("C:/Users/pradeep/OneDrive/datasets/students_placement_data.csv")
head(m) # Check the first 6 rows.
## Roll.No Gender Section SSC.Percentage inter_Diploma_percentage
## 1 1 M A 87.30 65.3
## 2 2 F B 89.00 92.4
## 3 3 F A 67.00 68.0
## 4 4 M A 71.00 70.4
## 5 5 M A 67.00 65.5
## 6 6 M A 81.26 68.0
## B.Tech_percentage Backlogs registered_for_.Placement_Training
## 1 40.00 18 NO
## 2 71.45 0 yes
## 3 45.26 13 yes
## 4 36.47 17 yes
## 5 42.52 17 yes
## 6 62.20 6 yes
## placement.status
## 1 Not placed
## 2 Placed
## 3 Not placed
## 4 Not placed
## 5 Not placed
## 6 Not placed
str(m) # Check the structure of the dataset
## 'data.frame': 117 obs. of 9 variables:
## $ Roll.No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ Section : Factor w/ 2 levels "A","B": 1 2 1 1 1 1 1 1 1 1 ...
## $ SSC.Percentage : num 87.3 89 67 71 67 ...
## $ inter_Diploma_percentage : num 65.3 92.4 68 70.4 65.5 68 56.5 79.3 89.6 75.5 ...
## $ B.Tech_percentage : num 40 71.5 45.3 36.5 42.5 ...
## $ Backlogs : int 18 0 13 17 17 6 20 3 10 8 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 2 2 2 2 1 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 2 1 1 1 1 1 1 1 1 ...

3) Step 3: Divide the data (117 observations) into training data and test data.

n=nrow(m) # n is total number of rows.

set.seed(101)

# We use sample function to partition the data. Here 85 percent is training data and 15 percent is test
data. Note that since "replace = TRUE", we may have a row sampled more than once.
data_index=sample(1:n, size = round(0.85*n),replace = TRUE)
train_data=m[data_index,]
test_data=m[-data_index,]

4) Check the structure of training and test data (Optional).

str(train_data)
## 'data.frame': 99 obs. of 9 variables:
## $ Roll.No : int 44 6 84 77 30 36 69 40 73 64 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 1 2 ...
## $ Section : Factor w/ 2 levels "A","B": 2 1 1 1 2 2 1 2 2 1 ...
## $ SSC.Percentage : num 86 81.3 89 78 72 ...
## $ inter_Diploma_percentage : num 92.5 68 88.9 59 88.1 90 61 88.8 83.7 69.2 ...
## $ B.Tech_percentage : num 70.8 62.2 63 51.1 69.6 ...
## $ Backlogs : int 0 6 1 17 0 0 6 0 0 20 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 2 2 1 1 2 2 1 2 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 2 1 1 1 1 ...
str(test_data)
## 'data.frame': 49 obs. of 9 variables:
## $ Roll.No : int 1 4 7 8 11 12 14 15 17 18 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 2 1 1 ...
## $ Section : Factor w/ 2 levels "A","B": 1 1 1 1 2 1 2 1 2 2 ...
## $ SSC.Percentage : num 87.3 71 71 84.8 82.3 ...
## $ inter_Diploma_percentage : num 65.3 70.4 56.5 79.3 76.3 66 88.7 52.2 85 95.1 ...
## $ B.Tech_percentage : num 40 36.5 33.8 61 71.5 ...
## $ Backlogs : int 18 17 20 3 0 16 0 7 0 0 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 1 1 2 2 2 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 1 2 1 1 2 ...

5) Build a decision tree model using “rpart”" function.

 Provide the class label(placement.stats) and attributes/variables.

 Here method is “class” because we are going to classification and not prediction.
 Two types of split criterias can be used(parms). Gini and entropy(information).Default split
criteria is Gini.

stu_model<-rpart(formula =placement.status~
Backlogs+Gender+B.Tech_percentage+SSC.Percentage+inter_Diploma_percentage,
data=train_data,method = "class",parms = list(split="gini"))

# Print the model.

print(stu_model)
## n= 99
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 99 28 Not placed (0.71717172 0.28282828)
## 2) B.Tech_percentage< 67.135 63 2 Not placed (0.96825397 0.03174603) *
## 3) B.Tech_percentage>=67.135 36 10 Placed (0.27777778 0.72222222)
## 6) SSC.Percentage< 83.58 11 3 Not placed (0.72727273 0.27272727) *
## 7) SSC.Percentage>=83.58 25 2 Placed (0.08000000 0.92000000) *

6) Draw a decision tree.

We using rpart.plot from rpart.plot package.

 type=5 means we want to show the split variable name in the interior nodes.
 extra=2 means we want to display the classification rate at the node, expressed as the number
of correct classifications and the number of observations in the node.

rpart.plot(stu_model,type=5,extra = 2 )
7) Apply the model stu_model on our test data using predict function.

In the predict function, give the model name stu_model and the test_data as input and

specify type=“class” because we are doing classification.

p<-predict(stu_model,test_data,type="class")
print(p)
## 1 4 7 8 11 12
## Not placed Not placed Not placed Not placed Not placed Not placed
## 14 15 17 18 19 21
## Placed Not placed Placed Placed Not placed Placed
## 23 26 31 32 34 35
## Not placed Not placed Placed Not placed Not placed Not placed
## 37 41 42 43 45 55
## Not placed Placed Not placed Not placed Not placed Not placed
## 56 57 58 59 60 63
## Placed Not placed Not placed Placed Placed Not placed
## 65 66 68 71 74 75
## Placed Placed Placed Placed Not placed Not placed
## 76 85 87 88 89 93
## Not placed Not placed Not placed Not placed Not placed Not placed
## 100 101 102 105 106 114
## Not placed Not placed Not placed Not placed Not placed Not placed
## 116
## Placed
## Levels: Not placed Placed

8) Print the confusion matrix.

“table” command is used to draw confusion matrix. “test_data[,9]” is the original class labels
and “p” are predicted class labels. Confusion matrix gives information about number of correct
predictions and number of wrong predictions.

t<-table(test_data[,9],p)
print(t)
## p
## Not placed Placed
## Not placed 29 2
## Placed 6 12

In the above table, (29+ 12) are correct predictions and (6+2) are wrong predictions.

9) Find the accuracy of the model.
Accuracy of the model is number of correct predictions in test set divided by total number of samples in
test set.

 Note: In the diagonal element in the matrix t, there are correct predictions

print(sum(diag(t))/sum(t))
## [1] 0.8367347

6. Implement Clustering techniques

Step 1: Import the package.

library("cluster")

Step 2: Import the dataset

m<-read.csv("C:/Users/pradeep/OneDrive/datasets/hclustdata.csv")
head(m)
## Name Gender SSC.Perc.entage inter.Diploma.perc
## 1 ARIGELA AVINASH M 87.30 65.3
## 2 BALADARI KEERTHANA F 89.00 92.4
## 3 BAVIRISETTI PRAVALIKA F 67.00 68.0
## 4 BODDU SAI BABA M 71.00 70.4
## 5 BONDAPALLISRINIVAS M 67.00 65.5
## 6 CH KANAKARAJU M 81.26 68.0
## B.Tech.perc Back.logs
## 1 40.00 18
## 2 71.45 0
## 3 45.26 13
## 4 36.47 17
## 5 42.52 17
## 6 62.20 6

Step 3a: Apply Agglomerative hierarchal clustering with group single link (min technique)

clust1<-agnes(x = m,stand = TRUE,metric = "euclidean",method = "single")

pltree(clust1)

m[c(14,16),] # Check whether 14 and 16 are in same cluster or not

## Name Gender SSC.Perc.entage inter.Diploma.perc B.Tech.perc
## 14 EDARA ROJA F 87.10 88.7 74.96
## 16 GADIPALLI MADHURI F 85.83 87.0 75.96
## Back.logs
## 14 0
## 16 0

Step 3b: Apply Agglomerative hierarchal clustering with group complete link (complete technique)

clust2<-agnes(x = m,stand = TRUE,metric = "euclidean",method = "complete")

pltree(clust2)

m[c(14,16),] # Check whether 14 and 16 are in same cluster or not

## Name Gender SSC.Perc.entage inter.Diploma.perc B.Tech.perc
## 14 EDARA ROJA F 87.10 88.7 74.96
## 16 GADIPALLI MADHURI F 85.83 87.0 75.96
## Back.logs
## 14 0
## 16 0

Step 3b: Apply Agglomerative hierarchal clustering with group group average
clust3<-agnes(x = m,stand = TRUE,metric = "euclidean",method = "average")
pltree(clust3)

m[c(14,16),] # Check whether 14 and 16 are in same cluster or not

## Name Gender SSC.Perc.entage inter.Diploma.perc B.Tech.perc
## 14 EDARA ROJA F 87.10 88.7 74.96
## 16 GADIPALLI MADHURI F 85.83 87.0 75.96
## Back.logs
## 14 0
## 16 0
7.Visualize data using any plotting Framework.

#Installing ggplot2

#ggplot2-ggplot2 is a R package dedicated to data visualization. It can greatly improve the quality and
aesthetics of your graphics, and will make you much more efficient in creating them.

install.packages("ggplot2")

7a) Scatter plot

# load ggplot2
library(ggplot2)
library(hrbrthemes)

# mtcars dataset is natively available in R

# head(mtcars)

# A basic scatterplot with color depending on Species

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=6) +
theme_ipsum()

Output:
7b) Box plot

# Load ggplot2
library(ggplot2)

# The mtcars dataset is natively available

# head(mtcars)

# A really basic boxplot.

ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) +
geom_boxplot(fill="slateblue", alpha=0.2) +
xlab("cyl")

output:
7c) Bar Plot

# Load ggplot2
library(ggplot2)

# Create data
data <- data.frame(
name=c("A","B","C","D","E") ,
value=c(3,12,5,18,45)
)

# Barplot
ggplot(data, aes(x=name, y=value)) +
geom_bar(stat = "identity")
Output:
7d) Histogram:

# library
library(ggplot2)

# dataset:
data=data.frame(value=rnorm(100))

# basic histogram
p <- ggplot(data, aes(x=value)) +
geom_histogram()

Output:

CPU Scheduling Problem
100% (1)
CPU Scheduling Problem
18 pages
Introducing The Art of Statistics How To Learn Fro
No ratings yet
Introducing The Art of Statistics How To Learn Fro
6 pages
$ Sudo Apt-Get Install Oracle-Java8-Installer
No ratings yet
$ Sudo Apt-Get Install Oracle-Java8-Installer
4 pages
BDA lab manual UPDATED
No ratings yet
BDA lab manual UPDATED
45 pages
Hadoop Installation Steps
100% (1)
Hadoop Installation Steps
6 pages
Online:: Setting Up The Environment
No ratings yet
Online:: Setting Up The Environment
9 pages
213nt1306- Big Data Analytics Lab Manual
No ratings yet
213nt1306- Big Data Analytics Lab Manual
80 pages
Bda Record
No ratings yet
Bda Record
27 pages
Experiment No - 1
No ratings yet
Experiment No - 1
13 pages
BDAO
No ratings yet
BDAO
23 pages
Install Hadoop
No ratings yet
Install Hadoop
8 pages
Hadoop 2 - Pseudo Node Installation
No ratings yet
Hadoop 2 - Pseudo Node Installation
9 pages
Single Node Hadoop Cluster
No ratings yet
Single Node Hadoop Cluster
9 pages
TP2 _3IM - En
No ratings yet
TP2 _3IM - En
7 pages
Installation of Hadoop in Ubuntu
No ratings yet
Installation of Hadoop in Ubuntu
15 pages
Hadoop 2.6 Installing On Ubuntu 14.04 (Single-Node Cluster)
No ratings yet
Hadoop 2.6 Installing On Ubuntu 14.04 (Single-Node Cluster)
27 pages
Hadoop Installation Step by Step
No ratings yet
Hadoop Installation Step by Step
8 pages
Lab Manual
No ratings yet
Lab Manual
27 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
18 pages
Bda Lab
No ratings yet
Bda Lab
37 pages
Hadoop Installation Manual 2.odt
No ratings yet
Hadoop Installation Manual 2.odt
20 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
DataVisuaization Lab
No ratings yet
DataVisuaization Lab
5 pages
PRACTICAL 4 - Single and Multi Node Hadoop Install
No ratings yet
PRACTICAL 4 - Single and Multi Node Hadoop Install
11 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
BDA Lab manual
No ratings yet
BDA Lab manual
49 pages
Installation of Hadoop
No ratings yet
Installation of Hadoop
8 pages
Step 1 - Install Oracle Java 8 On Ubuntu
No ratings yet
Step 1 - Install Oracle Java 8 On Ubuntu
7 pages
HDFS Installation Guide-Anju
No ratings yet
HDFS Installation Guide-Anju
4 pages
How To Install Hadoop On Ubuntu 18.04 or 20.04
No ratings yet
How To Install Hadoop On Ubuntu 18.04 or 20.04
15 pages
Installing Standalone and Pseudocode Hadoop Cluster: 1. Setting Up Vmware Virtual Machine
No ratings yet
Installing Standalone and Pseudocode Hadoop Cluster: 1. Setting Up Vmware Virtual Machine
14 pages
Experiment-2_BDA_Lab
No ratings yet
Experiment-2_BDA_Lab
13 pages
big-data-file
No ratings yet
big-data-file
32 pages
Hadoop Installation
No ratings yet
Hadoop Installation
4 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
Hadoop Cluster Creation
No ratings yet
Hadoop Cluster Creation
8 pages
BigData_Lab_Manual
No ratings yet
BigData_Lab_Manual
44 pages
Hadoop 3 Installation
No ratings yet
Hadoop 3 Installation
10 pages
hbase_installationn
No ratings yet
hbase_installationn
12 pages
HADOOP 1.X Installation Steps On Ubuntu
No ratings yet
HADOOP 1.X Installation Steps On Ubuntu
3 pages
Hadoop 2.6.5 Installing On Ubuntu 16.04 and 18.04 (Single-Node Cluster)
No ratings yet
Hadoop 2.6.5 Installing On Ubuntu 16.04 and 18.04 (Single-Node Cluster)
7 pages
Unix Commands Part 2
No ratings yet
Unix Commands Part 2
37 pages
Big Data Analytics Lab Experiments
No ratings yet
Big Data Analytics Lab Experiments
16 pages
BDA Practical
No ratings yet
BDA Practical
38 pages
Anurag 1-6 Merged
No ratings yet
Anurag 1-6 Merged
60 pages
Hadoop Multinode Cluster Installation
No ratings yet
Hadoop Multinode Cluster Installation
4 pages
big data
No ratings yet
big data
32 pages
Hadoop Install
No ratings yet
Hadoop Install
19 pages
Hadoop Installation
No ratings yet
Hadoop Installation
6 pages
Updated CMD
No ratings yet
Updated CMD
23 pages
Big Data Manual Ai
No ratings yet
Big Data Manual Ai
33 pages
Installing A Single Node Hadoop Cluster
No ratings yet
Installing A Single Node Hadoop Cluster
4 pages
6 Hadoop
No ratings yet
6 Hadoop
20 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Exp-1-1
No ratings yet
Exp-1-1
24 pages
Create A Multi-Node Cluster For Distributed Hadoop Environment
No ratings yet
Create A Multi-Node Cluster For Distributed Hadoop Environment
5 pages
Hadoop All Installations
No ratings yet
Hadoop All Installations
19 pages
Step 1: Download Binary Package
No ratings yet
Step 1: Download Binary Package
50 pages
How To Install Hadoop On Ubuntu 18
No ratings yet
How To Install Hadoop On Ubuntu 18
15 pages
Installation of Hadoop
No ratings yet
Installation of Hadoop
6 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
MINT 2024 05 25 Delhi Delhi 5 - 03 ANANDASRRTVegesna 25052024121914 Uxz - Unlocked
No ratings yet
MINT 2024 05 25 Delhi Delhi 5 - 03 ANANDASRRTVegesna 25052024121914 Uxz - Unlocked
1 page
Fatima Michael College of Engineering & Technology: Scanned by Camscanner
No ratings yet
Fatima Michael College of Engineering & Technology: Scanned by Camscanner
32 pages
WSN 1
No ratings yet
WSN 1
13 pages
Smart Virtual Assistant For Insurance Domain
No ratings yet
Smart Virtual Assistant For Insurance Domain
7 pages
Folder Redirection
No ratings yet
Folder Redirection
32 pages
Unix and Shell Programming QP
No ratings yet
Unix and Shell Programming QP
5 pages
Oracle Server Arch Overview
100% (1)
Oracle Server Arch Overview
22 pages
Quiz - Basics and Folder Navigation - Attempt Review
No ratings yet
Quiz - Basics and Folder Navigation - Attempt Review
7 pages
Ubuntu Server CLI Pro Tips 2020-04
No ratings yet
Ubuntu Server CLI Pro Tips 2020-04
2 pages
03 Ipc
No ratings yet
03 Ipc
36 pages
Issue Register
No ratings yet
Issue Register
5 pages
GemFire Introduction Hands-On Labs
No ratings yet
GemFire Introduction Hands-On Labs
19 pages
Garbage Collection Tuning
No ratings yet
Garbage Collection Tuning
29 pages
Problem SPSS
No ratings yet
Problem SPSS
3 pages
Linux Fundamentals 4
No ratings yet
Linux Fundamentals 4
1 page
Gnu Grub Manual 1.99
No ratings yet
Gnu Grub Manual 1.99
66 pages
How To Lock Your Windows XP in 2 Click: Stalin
No ratings yet
How To Lock Your Windows XP in 2 Click: Stalin
10 pages
Log
No ratings yet
Log
10 pages
Storage Devices - Part2
No ratings yet
Storage Devices - Part2
21 pages
The Linux Shell Scripting
No ratings yet
The Linux Shell Scripting
17 pages
Common Linux Commands - For Dummies
No ratings yet
Common Linux Commands - For Dummies
3 pages
Chapter 2: Operating-System Services: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
No ratings yet
Chapter 2: Operating-System Services: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
59 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Installation Log
No ratings yet
Installation Log
30 pages
Operating System
No ratings yet
Operating System
83 pages
Assignment-1 (Santoshini Satapathy)
No ratings yet
Assignment-1 (Santoshini Satapathy)
16 pages
Sysbench Manual
No ratings yet
Sysbench Manual
17 pages
Emu Log
No ratings yet
Emu Log
8 pages
Incremental Aggregation
No ratings yet
Incremental Aggregation
2 pages
DB2 Backup To TSM
No ratings yet
DB2 Backup To TSM
4 pages
Unit 2 OS 2023
No ratings yet
Unit 2 OS 2023
61 pages
Osc Presentation
No ratings yet
Osc Presentation
13 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.