0% found this document useful (0 votes)
31 views

BDA LAB Programs

The document discusses setting up Hadoop in three modes: local, pseudo-distributed, and fully-distributed. It provides step-by-step instructions for installing and configuring Hadoop and its components in pseudo-distributed mode on a single machine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

BDA LAB Programs

The document discusses setting up Hadoop in three modes: local, pseudo-distributed, and fully-distributed. It provides step-by-step instructions for installing and configuring Hadoop and its components in pseudo-distributed mode on a single machine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 56

1.

Install,Configure and run Hadoop and HDFS

i) Local (Standalone) Mode

ii)Pseudo-Distributed Mode

iii)Fully-Distributed Mode

Prerequisites
Supported Platforms

GNU/Linux is supported as a development and production platform. Hadoop has been


demonstrated on GNU/Linux clusters with 2000 nodes.

Windows is also a supported platform but the followings steps are for Linux only. To set up
Hadoop on Windows, see wiki page.

Required Software

Required software for Linux include:

Java™ must be installed. Recommended Java versions are described at HadoopJavaVersions.

ssh must be installed and sshd must be running to use the Hadoop scripts that manage
remote Hadoop daemons.

Installing Software

If your cluster doesn’t have the requisite software you will need to install

it. For example on Ubuntu Linux:

$ sudo apt-get install ssh

$ sudo apt-get install rsync

Download

To get a Hadoop distribution, download a recent stable release from one of the Apache
Download Mirrors.

Prepare to Start the Hadoop Cluster

Unpack the downloaded Hadoop distribution. In the distribution, edit the file
etc/hadoop/hadoop-env.sh to define some parameters as follows:

# set to the root of your Java installation

export JAVA_HOME=/usr/java/latest
1 Hadoop and BigData Lab

Try the following command:

$ bin/hadoop

This will display the usage documentation for the hadoop script.Now you are ready to start
your Hadoop cluster in one of the three supported modes:

Local (Standalone) Mode

Pseudo-Distributed Mode

Fully-Distributed Mode

i) Local (Standalone) Mode


By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.
This is useful for debugging.

The following example copies the unpacked conf directory to use as input and then finds
and displays every match of the given regular expression. Output is written to the given
output directory.

$ mkdir input

$ cp etc/hadoop/*.xml input

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep


input output 'dfs[a-z.]+'

ii) Pseudo-Distributed Mode


Part -1

1. Installing Oracle Java 8

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update

$ sudo apt-get install oracle-java8-installer


2 Hadoop and BigData Lab

It will install java source in your machine at /usr/lib/jvm/java-8-

oracle To check java version


$ java -version

2. Creating a Hadoop user for accessing HDFS and MapReduce

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

3. Installing SSH

$ sudo apt-get install openssh-server

Configuring SSH

# First login with hduser (and from now use only hduser account for further steps)

$ sudo su hduser

$ ssh-keygen -t rsa -P ""

4. Disabling IPv6

Since Hadoop doesn’t work on IPv6, we should disable it. One of another reason is also that
it has been developed and tested on IPv4 stacks. Hadoop nodes will be able to communicate
if we are having IPv4 cluster. (Once you have disabled IPV6 on your machine, you need to
reboot your machine in order to check its effect. In case if you don’t know how to reboot
with command use sudo reboot )

For getting your IPv6 disable in your Linux machine, you need to update /etc/sysctl.conf by
adding following line of codes at end of the file,open sysctl.conf by using the following
command

$ sudo gedit /etc/sysctl.conf

copy the next 4 lines add to end of the file

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1
Tip:- You can use nano, gedit, and Vi editor for updating all text files for this configuration
purpose.

5. Download latest Apache Hadoop source from Apache mirrors

First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or latest version
source from Apache download Mirrors. You can also try stable hadoop to get all latest
features as well as recent bugs solved with Hadoop source. Choose location where you want
to place all your hadoop installation, I have chosen /usr/local/hadoop

if your downloaded file is at Downloads move to Downloads

$cd Download

# #Extract Hadoop source run this following command where your hadoop is downloaded

$sudo tar -xzvf hadoop-2.*.tar.gz

## Locate to hadoop installation parent dir

$ cd /usr/local/

## Move hadoop-2.6.0 to hadoop folder

$sudo mv hadoop-2.6.0 /usr/local/hadoop

## Assign ownership of this folder to Hadoop user

$sudo chown hduser:hadoop -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode

$sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

$sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode


## Again assign ownership of this Hadoop temp folder to Hadoop user

$sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

6. Update Hadoop configuration files

a. Update hduser configuration file by appending the

$ sudo gedit .bashrc

## Update hduser configuration file by appending the

## following environment variables at the end of this file.


# -- HADOOP ENVIRONMENT VARIABLES START -- #

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

# -- HADOOP ENVIRONMENT VARIABLES END -- #

b. Configuration file : hadoop-env.sh

## To edit file, fire the below given command

$ sudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

## Update JAVA_HOME variable,

JAVA_HOME=/usr/lib/jvm/java-8-oracle

c. move to /usr/local/hadoop/etc/hadoop

$ cd /usr/local/hadoop/etc/hadoop

d. Configuration file : core-site.xml

## To edit file, fire the below given command

$ sudo gedit core-site.xml

## Paste these lines into <configuration> tag


<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

e. Configuration file : hdfs-site.xml

## To edit file, fire the below given command

$ sudo gedit hdfs-site.xml

## Paste these lines into <configuration> tag

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>

</property>

<property>

f. Configuration file : yarn-site.xml

## To edit file, fire the below given command

$ sudo gedit yarn-site.xml


## Paste these lines into <configuration> tag

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

g. Configuration file : mapred-site.xml

## Copy template of mapred-site.xml.template file

$cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml

## To edit file, fire the below given command

$ sudo gedit mapred-site.xml

## Paste these lines into <configuration> tag

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

7. Format Namenode

$hadoop namenode -format

8. Start all Hadoop daemons

Start hdfs daemons

$ start-dfs.sh
Start MapReduce daemons:

$ start-yarn.sh

Instead both of these above command you can also use start-all.sh, but its now deprecated
so its not recommended to be used for better Hadoop operations.

9. Track/Monitor/Verify

Verify Hadoop daemons:

$ jps

Monitor Hadoop ResourseManage and Hadoop NameNode

If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring Hadoop
web view of ResourceManager and NameNode which are usually used by hadoop
administrators. Open your default browser and visit to the following links.
For ResourceManager – Http://localhost:8088

For NameNode – Http://localhost:50070

If you are getting output as shown in the above snapshot then Congratulations! You
have successfully installed Apache Hadoop in your Ubuntu and if not then post your
error messages in comments. We will be happy to help you. Happy Hadooping.!!
iii) Fully-Distributed Mode

Prerequisites

1. Installation and Configuration of Single node Hadoop :

Install and Confiure Single node Hadoop which will be our Masternode.

2. Prepare your computer network (Decide no of nodes to set up cluster) :

Based on the helping parameters like Purpose of Hadoop Multinode cluster,Size


of the dataset to be processed andAvailability of Machines, you need to define no of
Master nodes and no of Slave nodes to be configured for Hadoop Cluster setup.

3. Basic installation and configuration :

Step 3A: Hostname identification of your nodes to be configured in the further steps. To
Masternode, we will name it as HadoopMaster and to 2 different Slave nodes, we will name
them as HadoopSlave1, HadoopSlave2 respectively in /etc/hosts directory. After deciding a
hostname of all nodes, assign their names by updating hostnames (You can ignore this step
if you do not want to setup names.) Add all host names to /etc/hosts directory in all
Machines (Master and Slave nodes).

# Edit the /etc/hosts file with following command

$ sudo gedit /etc/hosts

# Add following hostname and their ip in host table

192.168.2.14 HadoopMaster

192.168.2.15 HadoopSlave1

192.168.2.16 HadoopSlave2

Step 3B: Create hadoop as group and hduser as user in all Machines (if not created !!).

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser


If you require to add hdusers to sudoers, then fire following command

$ sudo usermod -a -G sudo hduser


OR

$sudo gedit /etc/sudoers

Add following line in /etc/sudoers/

hduser ALL=(ALL:ALL) ALL

Step 3C: Install rsync for sharing hadoop source with rest all Machines,

$sudo apt-get install rsync

Step 3D: To make above changes reflected, we need to reboot all of the Machines.

$sudo reboot

4. Hadoop configuration steps

☑ Applying Common Hadoop Configuration :

However, we will be configuring Master-Slave architecture we need to apply the


common changes in Hadoop config files (i.e. common for both type of Mater and Slave
nodes) before we distribute these Hadoop files over the rest of the machines/nodes. Hence,
these changes will be reflected over your single node Hadoop setup. And from the step 6,
we will make changes specifically for Master and Slave nodes respectively.

Changes:

1. Update core-site.xml

Update this file by changing hostname from localhost to


HadoopMaster ## To edit file, fire the below given command

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit core-site.xml

## Paste these lines into <configuration> tag OR Just update it by replacing localhost

<property>
<name>fs.default.name</name>
<value>hdfs://HadoopMaster:9000</value>
</property>
with master

2. Update hdfs-site.xml

Update this file by updating repliction factor from 1 to

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit hdfs-site.xml


3. ## To edit file, fire the below given command

<property>
<name>dfs.replication</name>
<value>3</value>
</property>
## Paste/Update these lines into <configuration> tag

3. Update yarn-site.xml

Update this file by updating the following three properties by updating hostname
from localhost to HadoopMaster,

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit yarn-site.xml


## To edit file, fire the below given command

<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>HadoopMaster:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>HadoopMaster:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>HadoopMaster:8050</value>
</property>
## Paste/Update these lines into <configuration> tag

4. Update Mapred-site.xml

Update this file by updating and adding following

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit mapred-site.xml


properties, ## To edit file, fire the below given command
## Paste/Update these lines into <configuration> tag
<property>
<name>mapreduce.job.tracker</name>
<value>HadoopMaster:5431</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
5. Update masters

Update the directory of master nodes of Hadoop


cluster ## To edit file, fire the below given command
hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit masters

HadoopMaster
## Add name of master nodes

6. Update slaves

Update the directory of slave nodes of Hadoop

hduser@HadoopMaster:/usr/local/hadoop/etc/hadoop$ sudo gedit slaves


cluster ## To edit file, fire the below given command

HadoopSlave1
HadoopSlave2
## Add name of slave nodes

☑ Copying/Sharing/Distributing Hadoop config files to rest all nodes master/slave


Use rsync for distributing configured Hadoop source among rest of nodes via network.
☑ create hadoop folder in Hadoop slaves at /usr/local :

# In HadoopSlave1 machine
$sudo mkdir /usr/local/hadoop

$sudo rsync -avxP /usr/local/hadoop/ hduser@HadoopSlave1:/usr/local/hadoop/


# In HadoopSlave2 machine
$sudo mkdir /usr/local/hadoop

$sudo rsync -avxP /usr/local/hadoop/ hduser@HadoopSlave2:/usr/local/hadoop/


The above command will share the files stored within hadoop folder to Slave nodes with
location /usr/local/hadoop. So, you dont need to again download as well as setup the above
configuration in rest of all nodes. You just need Java and rsync to be installed over all nodes. And this
JAVA_HOME path need to be matched with $HADOOP_HOME/etc/hadoop/hadoop-env.sh file of
your Hadoop distribution which we had already configured in Single node Hadoop configuration.

☑ Applying Master node specific Hadoop configuration: (Only for master nodes)
These are some configuration to be applied over Hadoop MasterNodes (Since we have only
one master node it will be applied to only one master node.)

step a: Remove existing Hadoop_data folder (which was created while singlenode hadoop

setup.
$sudo rm -rf /usr/local/hadoop_tmp/
step b: Make same (/usr/local/hadoop_tmp/hdfs) directory and create NameNode
(/usr/local/hadoop_tmp/hdfs/namenode) directory
$ sudo mkdir -p /usr/local/hadoop_tmp/

$ sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode


step c: Make hduser as owner of that directory.
$sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

☑ Applying Slave node specific Hadoop configuration : (Only for slave nodes)

Since we have three slave nodes, we will be applying the following changes over HadoopSlave1,
HadoopSlave2 and HadoopSlave3 nodes.

step a: Remove existing Hadoop_data folder (which was created while single node hadoop

setup)
$sudo rm -rf /usr/local/hadoop_tmp/hdfs/

setp b: Creates same (/usr/local/hadoop_tmp/) directory/folder, an inside this folder


again Create DataNode (/usr/local/hadoop_tmp/hdfs/namenode) directory/folder

$sudo mkdir -p /usr/local/hadoop_tmp/

$sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode


step c: Make hduser as owner of that directory
$sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

☑ Copying ssh key for Setting up passwordless ssh access from Master to Slave node :

To manage (start/stop) all nodes of Master-Slave architecture, hduser (hadoop user of


Masternode) need to be login on all Slave as well as all Master nodes which can be possible through
setting up passwrdless SSH login. (If you are not setting this then you need to provide password
while starting and stoping daemons on Slave nodes from Master node).

Fire the following command for sharing public SSH key –$HOME/.ssh/id_rsa.pub file (of
HadoopMaster node) to authorized_keys file of hduser@HadoopSlave1 and also on
hduser@HadoopSlave1 (in$HOME/.ssh/authorized_keys)

hduser@HadoopMaster:~$ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@HadoopSlave1


hduser@HadoopMaster:~$ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@HadoopSlave2

Format Namenonde (Run on MasterNode) :

# Run this command from Masternode


hduser@HadoopMaster: usr/local/hadoop/$ hdfs namenode -format
Starting up Hadoop cluster daemons : (Run on MasterNode)

Start HDFS daemons:

hduser@HadoopMaster:/usr/local/hadoop$ start-dfs.sh

Start MapReduce daemons:


hduser@HadoopMaster:/usr/local/hadoop$ start-yarn.sh
Instead both of these above command you can also use start-all.sh, but its now deprecated so its not
recommended to be used for better Hadoop operations.

Track/Monitor/Verify Hadoop cluster : (Run on any Node)

Verify Hadoop daemons on Master :

hduser@HadoopMaster: jps

Verify Hadoop daemons on all slave nodes :

hduser@HadoopSlave1: jps
hduser@HadoopSlave2: jps

(As shown in above snap- The running services of HadoopSlave1 will be the same for all Slave nodes
configured in Hadoop Cluster.)

Monitor Hadoop ResourseManage and Hadoop NameNode via web-version,

If you wish to track Hadoop MapReduce as well as HDFS, you can also try exploring Hadoop web
view of ResourceManager and NameNode which are usually used by hadoop administrators. Open
your default browser and visit to the following links from any of the node.
Hadoop and BigData Lab

For ResourceManager – Http://HadoopMaster:8088

For NameNode – Http://HadoopMaster:50070

If you are getting the similar output as shown in the above snapshot for Master and Slave noes then
Congratulations! You have successfully installed Apache Hadoop in your Cluster and if not then post
your error messages in comments.
3. Implement the following file management tasks in Hadoop:

i) Adding files and directories.

ii) Retrieving files.

iii) Deleting files

Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and
copies them into HDFS using one of the above command line utilities.

> To know your hadoop version

$ hadoop version

> Result

Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-
2.7.3.jar

> Start Hadoop using commads

$ start-dfs.sh
$ start-yarn.sh

> Check all demons running or not by using command.

$ jps

> if your are dealing with filesystem use fs comman

$ hadoop fs
the above command gives the list of options for hadoop filesystem like

Usage: hadoop fs [generic options]


[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
3 Hadoop and BigData Lab

[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-


count [-q] [-h] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir>
[<snapshotName>]] [-deleteSnapshot
<snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]

Generic options supported are


-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to
the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in
the classpath.
-archives <comma separated list of archives> specify comma separated archives to
be unarchived on the compute machines.

The general command line syntax is


bin/hadoop command [genericOptions] [commandOptions]
4 Hadoop and BigData Lab

i) Adding files and directories.


> Create directory in HDFS
$ hadoop fs -mkdir /path

ii) Retrieving files.


> Retrive files from HDFS

$ hadoop fs -cat /path

iii) Deleting files


> Delete file from HDFS
$ hadoop fs -rm /filepath

> Delete folder from HDFS


$hadoop fs -rm -R /folder_path

How Run Hadoop Program

> 1st setup environment varivales for java if not


HINT :: To know java home type “echo $JAVA_HOME ”

$ export JAVA_HOME=/usr/lib/jvm/jdk1.8.0._101
$ export PATH=${JAVA_HOME}/bin:${PATH}
$ export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

> Load your input data into HDFS.


Using filesystem comnnads. for example your example is wordcount. Create a folder
worcount in HDFS using following commad

$ haoop fs -mkdir /wordcount

next create input directory for text files what you are storing.

$hadoop fs -mkdir /wordcount/input

OR
if you want to create a directory using single command use following command.

$ hadoop fs -mkdir -p /wordcount/input

copy your data into HDFS directory

$ hadoop fs -put /localdirectiory files /hdfs directory path

For example your files name is wordfile.txt is in your home directory. load this file into hdfs
directory /wordcount/input

5 Hadoop and BigData Lab

$ hadoop fs -put ~/wordfile.txt /wordcount/input/

> Compile you Mapreduce program

$ hadoop com.sun.tools.javac.Main WordCount.java

> Create Jar file using the following command.

$ jar cf <filename.jar> <class names1,...>


Syntax for creating jar file

$ jar cf wc.jar WordCount*.class


example

> Run Hadoop program

$ hadoop jar <filename.jar> <classname> </input directory > <output directory>


sysntax

$ hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output


example
2. Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm

Solution:

import java.io.IOException;
import java.util.StringTokenizer;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text,
IntWritable>{ private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken())
; context.write(word,
one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{ private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values)
{ sum += val.get();
}
result.set(sum);
context.write(key,
result);
}
}
public static void main(String[] args) throws Exception
{ Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word
count"); job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
3. Hadoop and BigData Lab

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Compiling and Running a Program


$ hadoop fs -mkdir -p /wordcount/input
$ hadoop fs -put ~/intput.txt /wordcount/input

$ hadoop com.sun.tools.javac.Main WordCount.java


$ jar cf wc.jar WordCount*.class
$ hadoop jar wc.jar WordCount /wordcount/input /wordcount/output

Input.txt(input file)

Apache > Hadoop > Apache Hadoop 3.0.0-alpha1 Wiki | git | Last Published: 2016-08-30 |
Version: 3.0.0-alpha1 General Overview Single Node Setup Cluster Setup commands
Reference FileSystem Shell Compatibility Interface Classification FileSystem Specification
Common CLI Mini Cluster Etc ……………

Output:

(jar/
executable1
(multi-
terabyte1 (see
2
(thousands 1
A 1
Architecture 2
Distributed 1
File 1
Guide) 1
Guide). 1
HDFS 1
Hadoop 3
MRAppMaster 1
MapReduce 4
Minimally, 1
NodeManager 1
ResourceManager 1
3. Write a Map Reduce program that mines weather

data.

Solution:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperature {


public static class MaxTemperatureMapper extends Mapper<LongWritable, Text,
Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)throws
IOException, InterruptedException {
String line = value.toString();
String year = line.substring(0,4);
int airTemperature;
if (line.charAt(5) != '-') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(6,8));
}
else {
airTemperature = Integer.parseInt(line.substring(5,8));
}
context.write(new Text(year), new IntWritable(airTemperature));
}
}
public static class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
public static void main(String[] args) throws Exception {

if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path>
<output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "maxtemprature"); job.setJarByClass(MaxTemperature.class); job.setJobName("Max
temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new
Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

Compiling and Running a Program


$ hadoop fs -mkdir -p /weather/input
$ hadoop fs -put ~/intput.txt / weather/input

$ hadoop com.sun.tools.javac.Main MaxTemperature.java


$ jar cf weather.jar MaxTemperature *.class
$ hadoop jar weather.jar /weather/ /weather/
MaxTemperature input output

Input file:
1940,+761
1940,+341
1940,-041
1940,+221
1940,+481
1940,+921
1940,+981
1940,+861
1950,-241
1950,+521
1950,+041
1950,+041
1950,+941
1950,+761
1950,+101
1950,+101
1950,+401
1950,+041
1955,+841
1955,+621
1955,-021
1955,+761
1955,+261
1955,+981
1955,+941
1955,+301
1955,+961
1955,+721
1955,+721
1955,+881
1955,+981
1955,-061
1955,+361
1955,+581
1955,+941

Output:

1940 98
1950 94
1955 98
4a. Implement Linear Regression using R

library(datasets)

InputData <- as.data.frame(state.x77)

colnames(InputData)[4] = "Life.Exp"

colnames(InputData)[6] = "HS.Grad"

InputData$Density = InputData$Population * 1000 / InputData$Area

fit1 <- lm(Life.Exp ~ ., data=InputData)

summary(fit1)

#It appears higher populations are related to increased life expectancy and

#higher murder rates are strongly related to decreased life expectancy.

#High school graduation rate is marginal. Other than that,

#we're not seeing much. Another kind of summary of the model can be obtained like this.

fit2 <- lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +

HS.Grad + Frost + Density, data = InputData)

summary(fit2)

anova(fit1, fit2)

#As you can see, removing "Area" had no significant effect on the model (p = .4205).

#Compare the p-value to that for "Area" in the first summary table above.
fit3 <- lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost + Density, data = InputData)

summary(fit3)

fit4 <- lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost , data = InputData)

summary(fit4)

confint(fit4)

par(mfrow=c(2,2))

plot(fit1, 1)

plot(fit2, 1)

plot(fit3, 1)

plot(fit4, 1)

par(mfrow=c(1,1))

#The model object is a list containing quite a lot of information.

names(model5)

Output:

Call:

lm(formula = Life.Exp ~ ., data = InputData)

Residuals:

Min 1Q Median 3Q Max


-1.47514 -0.45887 -0.06352 0.59362 1.21823

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.995e+01 1.843e+00 37.956 < 2e-16 ***

Population 6.480e-05 3.001e-05 2.159 0.0367 *

Income 2.701e-04 3.087e-04 0.875 0.3867

Illiteracy 3.029e-01 4.024e-01 0.753 0.4559

Murder -3.286e-01 4.941e-02 -6.652 5.12e-08 ***

HS.Grad 4.291e-02 2.332e-02 1.840 0.0730 .

Frost -4.580e-03 3.189e-03 -1.436 0.1585

Area -1.558e-06 1.914e-06 -0.814 0.4205

Density -1.105e-03 7.312e-04 -1.511 0.1385

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7337 on 41 degrees of freedom

Multiple R-squared: 0.7501, Adjusted R-squared: 0.7013

F-statistic: 15.38 on 8 and 41 DF, p-value: 3.787e-10

Call:

lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +

HS.Grad + Frost + Density, data = InputData)


Residuals:

Min 1Q Median 3Q Max

-1.50252 -0.40471 -0.06079 0.58682 1.43862

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.094e+01 1.378e+00 51.488 < 2e-16 ***

Population 6.249e-05 2.976e-05 2.100 0.0418 *

Income 1.485e-04 2.690e-04 0.552 0.5840

Illiteracy 1.452e-01 3.512e-01 0.413 0.6814

Murder -3.319e-01 4.904e-02 -6.768 3.12e-08 ***

HS.Grad 3.746e-02 2.225e-02 1.684 0.0996 .

Frost -5.533e-03 2.955e-03 -1.873 0.0681 .

Density -7.995e-04 6.251e-04 -1.279 0.2079

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7307 on 42 degrees of freedom

Multiple R-squared: 0.746, Adjusted R-squared: 0.7037

F-statistic: 17.63 on 7 and 42 DF, p-value: 1.173e-10

Analysis of Variance Table

Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +

Frost + Area + Density


Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +

Frost + Density

Res.Df RSS Df Sum of Sq F Pr(>F)

1 41 22.068

2 42 22.425 -1 -0.35639 0.6621 0.4205

Call:

lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost +

Density, data = InputData)

Residuals:

Min 1Q Median 3Q Max

-1.56877 -0.40951 -0.04554 0.57362 1.54752

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.142e+01 1.011e+00 70.665 < 2e-16 ***

Population 6.083e-05 2.676e-05 2.273 0.02796 *

Murder -3.160e-01 3.910e-02 -8.083 3.07e-10 ***

HS.Grad 4.233e-02 1.525e-02 2.776 0.00805 **

Frost -5.999e-03 2.414e-03 -2.485 0.01682 *

Density -5.864e-04 5.178e-04 -1.132 0.26360

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7174 on 44 degrees of freedom

Multiple R-squared: 0.7435, Adjusted R-squared: 0.7144

F-statistic: 25.51 on 5 and 44 DF, p-value: 5.524e-12

Call:

lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,

data = InputData)

Residuals:

Min 1Q Median 3Q Max

-1.47095 -0.53464 -0.03701 0.57621 1.50683

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***

Population 5.014e-05 2.512e-05 1.996 0.05201 .

Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***

HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **

Frost -5.943e-03 2.421e-03 -2.455 0.01802 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7197 on 45 degrees of freedom

Multiple R-squared: 0.736, Adjusted R-squared: 0.7126


F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12

2.5 % 97.5 %

(Intercept) 6.910798e+01 72.9462729104

Population -4.543308e-07 0.0001007343

Murder -3.738840e-01 -0.2264135705

HS.Grad 1.671901e-02 0.0764454870

Frost -1.081918e-02 -0.0010673977


4b) Implement Logistic regression using R

# Loading package
library(caTools)
library(ROCR)

# Splitting dataset
split <- sample.split(mtcars, SplitRatio = 0.8)
split

train_reg <- subset(mtcars, split == "TRUE")


test_reg <- subset(mtcars, split == "FALSE")

# Training model
logistic_model <- glm(vs ~ wt + disp,
data = train_reg,
family = "binomial")
logistic_model

# Summary
summary(logistic_model)

# Predict test data based on model


predict_reg <- predict(logistic_model,
test_reg, type = "response")
predict_reg

# Changing probabilities
predict_reg <- ifelse(predict_reg >0.5, 1, 0)

# Evaluating model accuracy


# using confusion matrix
table(test_reg$vs, predict_reg)

missing_classerr <- mean(predict_reg != test_reg$vs)


print(paste('Accuracy =', 1 - missing_classerr))

# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_reg$vs)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")


auc <- auc@y.values[[1]]
auc

# Plotting curve
plot(ROCPer)
plot(ROCPer, colorize = TRUE,
print.cutoffs.at = seq(0.1, by = 0.1),
main = "ROC CURVE")
abline(a = 0, b = 1)

auc <- round(auc, 4)


legend(.6, .4, auc, title = "AUC", cex = 1)

Output:

[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE

Call: glm(formula = vs ~ wt + disp, family = "binomial", data = train_reg)

Coefficients:

(Intercept) wt disp

2.03395 2.54026 -0.05441

Degrees of Freedom: 22 Total (i.e. Null); 20 Residual

Null Deviance: 31.49

Residual Deviance: 12.69 AIC: 18.69

Call:

glm(formula = vs ~ wt + disp, family = "binomial", data = train_reg)

Deviance Residuals:

Min 1Q Median 3Q Max


-1.70195 -0.16130 -0.01561 0.49586 1.80907

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.03395 3.02219 0.673 0.5009

wt 2.54026 2.10430 1.207 0.2274

disp -0.05441 0.02883 -1.887 0.0591 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 31.492 on 22 degrees of freedom

Residual deviance: 12.692 on 20 degrees of freedom

AIC: 18.692

Number of Fisher Scoring iterations: 7

Hornet 4 Drive Hornet Sportabout Merc 230 Cadillac Fleetwood

2.108528e-02 1.482411e-04 9.148477e-01 3.319776e-05

Lincoln Continental Toyota Corolla Fiat X1-9 Porsche 914-2

9.922485e-05 9.440919e-01 9.340524e-01 7.158836e-01

Maserati Bora

5.087486e-03

predict_reg
01

041

113

[1] "Accuracy = 0.777777777777778"

[1] 0.775
5. Implement SVM/ Decision tree techniques

1) Step 1 :Import the required packages.


library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.3

2) Step 2: Import the dataset.

m<-read.csv("C:/Users/pradeep/OneDrive/datasets/students_placement_data.csv")
head(m) # Check the first 6 rows.
## Roll.No Gender Section SSC.Percentage inter_Diploma_percentage
## 1 1 M A 87.30 65.3
## 2 2 F B 89.00 92.4
## 3 3 F A 67.00 68.0
## 4 4 M A 71.00 70.4
## 5 5 M A 67.00 65.5
## 6 6 M A 81.26 68.0
## B.Tech_percentage Backlogs registered_for_.Placement_Training
## 1 40.00 18 NO
## 2 71.45 0 yes
## 3 45.26 13 yes
## 4 36.47 17 yes
## 5 42.52 17 yes
## 6 62.20 6 yes
## placement.status
## 1 Not placed
## 2 Placed
## 3 Not placed
## 4 Not placed
## 5 Not placed
## 6 Not placed
str(m) # Check the structure of the dataset
## 'data.frame': 117 obs. of 9 variables:
## $ Roll.No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ Section : Factor w/ 2 levels "A","B": 1 2 1 1 1 1 1 1 1 1 ...
## $ SSC.Percentage : num 87.3 89 67 71 67 ...
## $ inter_Diploma_percentage : num 65.3 92.4 68 70.4 65.5 68 56.5 79.3 89.6 75.5 ...
## $ B.Tech_percentage : num 40 71.5 45.3 36.5 42.5 ...
## $ Backlogs : int 18 0 13 17 17 6 20 3 10 8 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 2 2 2 2 1 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 2 1 1 1 1 1 1 1 1 ...

3) Step 3: Divide the data (117 observations) into training data and test data.

n=nrow(m) # n is total number of rows.


set.seed(101)

# We use sample function to partition the data. Here 85 percent is training data and 15 percent is test
data. Note that since "replace = TRUE", we may have a row sampled more than once.
data_index=sample(1:n, size = round(0.85*n),replace = TRUE)
train_data=m[data_index,]
test_data=m[-data_index,]

4) Check the structure of training and test data (Optional).

str(train_data)
## 'data.frame': 99 obs. of 9 variables:
## $ Roll.No : int 44 6 84 77 30 36 69 40 73 64 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 1 2 ...
## $ Section : Factor w/ 2 levels "A","B": 2 1 1 1 2 2 1 2 2 1 ...
## $ SSC.Percentage : num 86 81.3 89 78 72 ...
## $ inter_Diploma_percentage : num 92.5 68 88.9 59 88.1 90 61 88.8 83.7 69.2 ...
## $ B.Tech_percentage : num 70.8 62.2 63 51.1 69.6 ...
## $ Backlogs : int 0 6 1 17 0 0 6 0 0 20 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 2 2 1 1 2 2 1 2 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 2 1 1 1 1 ...
str(test_data)
## 'data.frame': 49 obs. of 9 variables:
## $ Roll.No : int 1 4 7 8 11 12 14 15 17 18 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 2 1 1 ...
## $ Section : Factor w/ 2 levels "A","B": 1 1 1 1 2 1 2 1 2 2 ...
## $ SSC.Percentage : num 87.3 71 71 84.8 82.3 ...
## $ inter_Diploma_percentage : num 65.3 70.4 56.5 79.3 76.3 66 88.7 52.2 85 95.1 ...
## $ B.Tech_percentage : num 40 36.5 33.8 61 71.5 ...
## $ Backlogs : int 18 17 20 3 0 16 0 7 0 0 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 1 1 2 2 2 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 1 2 1 1 2 ...

5) Build a decision tree model using “rpart”" function.

 Provide the class label(placement.stats) and attributes/variables.


 Here method is “class” because we are going to classification and not prediction.
 Two types of split criterias can be used(parms). Gini and entropy(information).Default split
criteria is Gini.

stu_model<-rpart(formula =placement.status~
Backlogs+Gender+B.Tech_percentage+SSC.Percentage+inter_Diploma_percentage,
data=train_data,method = "class",parms = list(split="gini"))

# Print the model.


print(stu_model)
## n= 99
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 99 28 Not placed (0.71717172 0.28282828)
## 2) B.Tech_percentage< 67.135 63 2 Not placed (0.96825397 0.03174603) *
## 3) B.Tech_percentage>=67.135 36 10 Placed (0.27777778 0.72222222)
## 6) SSC.Percentage< 83.58 11 3 Not placed (0.72727273 0.27272727) *
## 7) SSC.Percentage>=83.58 25 2 Placed (0.08000000 0.92000000) *

6) Draw a decision tree.


We using rpart.plot from rpart.plot package.

 type=5 means we want to show the split variable name in the interior nodes.
 extra=2 means we want to display the classification rate at the node, expressed as the number
of correct classifications and the number of observations in the node.

rpart.plot(stu_model,type=5,extra = 2 )
7) Apply the model stu_model on our test data using predict function.

In the predict function, give the model name stu_model and the test_data as input and


specify type=“class” because we are doing classification.

p<-predict(stu_model,test_data,type="class")
print(p)
## 1 4 7 8 11 12
## Not placed Not placed Not placed Not placed Not placed Not placed
## 14 15 17 18 19 21
## Placed Not placed Placed Placed Not placed Placed
## 23 26 31 32 34 35
## Not placed Not placed Placed Not placed Not placed Not placed
## 37 41 42 43 45 55
## Not placed Placed Not placed Not placed Not placed Not placed
## 56 57 58 59 60 63
## Placed Not placed Not placed Placed Placed Not placed
## 65 66 68 71 74 75
## Placed Placed Placed Placed Not placed Not placed
## 76 85 87 88 89 93
## Not placed Not placed Not placed Not placed Not placed Not placed
## 100 101 102 105 106 114
## Not placed Not placed Not placed Not placed Not placed Not placed
## 116
## Placed
## Levels: Not placed Placed

8) Print the confusion matrix.


“table” command is used to draw confusion matrix. “test_data[,9]” is the original class labels
and “p” are predicted class labels. Confusion matrix gives information about number of correct
predictions and number of wrong predictions.

t<-table(test_data[,9],p)
print(t)
## p
## Not placed Placed
## Not placed 29 2
## Placed 6 12

In the above table, (29+ 12) are correct predictions and (6+2) are wrong predictions.


9) Find the accuracy of the model.
Accuracy of the model is number of correct predictions in test set divided by total number of samples in
test set.

 Note: In the diagonal element in the matrix t, there are correct predictions

print(sum(diag(t))/sum(t))
## [1] 0.8367347

6. Implement Clustering techniques

Step 1: Import the package.

library("cluster")

Step 2: Import the dataset

m<-read.csv("C:/Users/pradeep/OneDrive/datasets/hclustdata.csv")
head(m)
## Name Gender SSC.Perc.entage inter.Diploma.perc
## 1 ARIGELA AVINASH M 87.30 65.3
## 2 BALADARI KEERTHANA F 89.00 92.4
## 3 BAVIRISETTI PRAVALIKA F 67.00 68.0
## 4 BODDU SAI BABA M 71.00 70.4
## 5 BONDAPALLISRINIVAS M 67.00 65.5
## 6 CH KANAKARAJU M 81.26 68.0
## B.Tech.perc Back.logs
## 1 40.00 18
## 2 71.45 0
## 3 45.26 13
## 4 36.47 17
## 5 42.52 17
## 6 62.20 6

Step 3a: Apply Agglomerative hierarchal clustering with group single link (min technique)

clust1<-agnes(x = m,stand = TRUE,metric = "euclidean",method = "single")


pltree(clust1)

m[c(14,16),] # Check whether 14 and 16 are in same cluster or not


## Name Gender SSC.Perc.entage inter.Diploma.perc B.Tech.perc
## 14 EDARA ROJA F 87.10 88.7 74.96
## 16 GADIPALLI MADHURI F 85.83 87.0 75.96
## Back.logs
## 14 0
## 16 0

Step 3b: Apply Agglomerative hierarchal clustering with group complete link (complete technique)

clust2<-agnes(x = m,stand = TRUE,metric = "euclidean",method = "complete")


pltree(clust2)

m[c(14,16),] # Check whether 14 and 16 are in same cluster or not


## Name Gender SSC.Perc.entage inter.Diploma.perc B.Tech.perc
## 14 EDARA ROJA F 87.10 88.7 74.96
## 16 GADIPALLI MADHURI F 85.83 87.0 75.96
## Back.logs
## 14 0
## 16 0

Step 3b: Apply Agglomerative hierarchal clustering with group group average
clust3<-agnes(x = m,stand = TRUE,metric = "euclidean",method = "average")
pltree(clust3)

m[c(14,16),] # Check whether 14 and 16 are in same cluster or not


## Name Gender SSC.Perc.entage inter.Diploma.perc B.Tech.perc
## 14 EDARA ROJA F 87.10 88.7 74.96
## 16 GADIPALLI MADHURI F 85.83 87.0 75.96
## Back.logs
## 14 0
## 16 0
7.Visualize data using any plotting Framework.

#Installing ggplot2

#ggplot2-ggplot2 is a R package dedicated to data visualization. It can greatly improve the quality and
aesthetics of your graphics, and will make you much more efficient in creating them.

install.packages("ggplot2")

7a) Scatter plot

# load ggplot2
library(ggplot2)
library(hrbrthemes)

# mtcars dataset is natively available in R


# head(mtcars)

# A basic scatterplot with color depending on Species


ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=6) +
theme_ipsum()

Output:
7b) Box plot

# Load ggplot2
library(ggplot2)

# The mtcars dataset is natively available


# head(mtcars)

# A really basic boxplot.


ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) +
geom_boxplot(fill="slateblue", alpha=0.2) +
xlab("cyl")

output:
7c) Bar Plot

# Load ggplot2
library(ggplot2)

# Create data
data <- data.frame(
name=c("A","B","C","D","E") ,
value=c(3,12,5,18,45)
)

# Barplot
ggplot(data, aes(x=name, y=value)) +
geom_bar(stat = "identity")
Output:
7d) Histogram:

# library
library(ggplot2)

# dataset:
data=data.frame(value=rnorm(100))

# basic histogram
p <- ggplot(data, aes(x=value)) +
geom_histogram()

#p

Output:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy