0% found this document useful (0 votes)
16 views

BDA Manual

Uploaded by

krathodfe2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

BDA Manual

Uploaded by

krathodfe2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Experiment No: - 01.

Aim: -
To Study Big-Data.

Theory: -

BIG DATA: -

Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data, rather it has
become a complete subject, which involves various tools, technologies and frameworks.

Why Big Data ?

Key enablers for the growth of “Big Data” are:


 Increase of storage capacities.
 Increase of processing power.
 Availability of data.

Google’s Solution: -
Google solved this problem using an algorithm called MapReduce. This algorithm divides
the task into small parts and assigns those parts to many computers connected over the
network, and collects the results to form the final result dataset.

Above diagram shows various commodity hardwares which could be single CPU
machines or servers with higher capacity.

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing


based on java. The MapReduce algorithm contains two important tasks, namely Map and
Reduce. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the
output from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of machines
in a cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

The Algorithm: -

 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
Map stage : -
The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
Reduce stage : -
This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Conclusion: -

Hence we have successfully studied Big-Data.


Experiment No: -02.
Aim: -
To Study of Hadoop ecosystem.
Theory: -

HADOOP: -
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of computers using simple
programming models. A Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture: -
Hadoop framework includes following four modules:
Hadoop Common: -
These are Java libraries and utilities required by other Hadoop modules. These libraries
provides filesystem and OS level abstractions and contains the necessary Java files and scripts
required to start Hadoop.
Hadoop YARN: -
This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS™): -
A distributed file system that provides highthroughput access to application data.
Hadoop MapReduce: -
This is YARN-based system for parallel processing of large data sets.
We can use following diagram to depict these four components available in Hadoop framework.
large datasets across clusters of computers using simple programming models.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned
above but also to the collection of additional software packages that can be installed on top of
or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.
MapReduce: -
Hadoop MapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
 The Map Task: -
This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
 The Reduce Task: -
This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after
the map task.
Typically both the input and the output are stored in a file-system. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on the slaves,
monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the tasks as
directed by the master and provide task-status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which
means if JobTracker goes down, all running jobs are halted.
Hadoop Distributed File System: -
Hadoop can work directly with any mountable distributed file system such as Local FS,
HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop
Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on large clusters (thousands of
computers) of small computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single NameNode
that manages the file system metadata and one or more slave DataNodes that store the actual
data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a
set of DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The
DataNodes takes care of read and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to
interact with the file system. These shell commands will be covered in a separate chapter along
with appropriate examples.
How Does Hadoop Work?
Stage 1: -
A user/application can submit a job to the Hadoop (a hadoop job client) for required
process by specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2: -
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration
to the slaves, scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage 3: -
The TaskTrackers on different nodes execute the task as per MapReduce
implementation and output of the reduce function is stored into the output files on the file
system.
Hadoop framework

Advantages of Hadoop: -

 Hadoop does not rely on hardware to provide fault-tolerance and high


availability (FTHA), rather Hadoop library itself has been designed to detect and
handle failures at the application layer.
 Allows the user to quickly write and test distributed systems. It is efficient, and it
automatic distributes the data and work across the machines and in turn, utilizes
the underlying parallelism of the CPU cores.
 Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
Conclusion: -
Hence we have successfully studied Hadoop ecosystem.
Experiment No: -03.

Aim: -

To install and configure MongoDB to execute NoSQL commands.

What is NoSQL?

NoSQL is a non-relational database management systems , different from traditional


relational database management systems in some significant ways. It is designed for
distributed data stores where very large scale of data storing needs (for example Google or
Facebook which collects terabits of data every day for their users). These type of data storing
may not require fixed schema, avoid join operations and typically scale horizontally.

Why NoSQL?

In today’s time data is becoming easier to access and capture through third parties
such as Facebook, Google+ and others. Personal user information, social graphs, geo location
data, user generated content and machine logging data are just a few examples where the
data has been increasing exponentially. To avail the above service properly, it is required to
process huge amount of data. Which SQL databases were never designed. The evolution of
NoSql databases is to handle these huge data properly.

Brief History of NoSQL: -

The term NoSQL was coined by Carlo Strozzi in the year 1998. He used this term to
name his Open Source, Light Weight, DataBase which did not have an SQL interface.

In the early 2009, when last.fm wanted to organize an event on open-source


distributed databases, Eric Evans, a Rackspace employee, reused the term to refer databases
which are nonrelational, distributed, and does not conform to atomicity, consistency,
isolation, durability - four obvious features of traditional relational database systems.

In the same year, the "no:sql(east)" conference held in Atlanta, USA, NoSQL was
discussed and debated a lot.

And then, discussion and practice of NoSQL got a momentum, and NoSQL saw an
unprecedented growth.

NoSQL vs. RDBMS: -

The NoSQL term can be applied to some databases that predated the relational
database management system, but it more commonly refers to the databases built in the
early 2000s for the purpose of large-scale database clustering in cloud and web applications.
In these applications, requirements for performance and scalability outweighed the need for
the immediate, rigid data consistency that the RDBMS provided to transactional enterprise
applications.

Notably, the NoSQL systems were not required to follow an established relational
schema. Large-scale web organizations such as Google and Amazon used NoSQL databases
to focus on narrow operational goals and employ relational databases as adjuncts where
high-grade data consistency is necessary.

Early NoSQL databases for web and cloud applications tended to focus on very
specific characteristics of data management. The ability to process very large volumes of
data and quickly distribute that data across computing clusters were desirable traits in web
and cloud design. Developers who implemented cloud and web systems also looked to create
flexible data schema -- or no schema at all -- to better enable fast changes to applications that
were continually updated.

Evolution of NoSQL: -

Berkeley DB was an influential system in the early evolution of NoSQL database usage.
Developed at the University of California, Berkeley, beginning in the 1990s, Berkeley DB was
widely described as an embedded database that closely supported specific applications'
storage needs. This open source software provided a simple key-value store. Berkeley DB
was commercially released by Sleepycat Software in 1999. The company was later acquired
by Oracle in 2006. Oracle has continued to support open source Berkeley DB. Other NoSQL
databases that have gained prominence include cloud-hosted NoSQL databases such as
Amazon DynamoDB, Google BigTable, as well as Apache Cassandraand MongoDB.

The basic NoSQL database classifications are only guides. Over time, vendors have
mixed and matched elements from different NoSQL database family trees to achieve more
generally useful systems. That evolution is seen, for example, in MarkLogic, which has added
a graph store and other elements to its original document databases. Couchbase Server
supports both key-value and document approaches. Cassandra has combined key-value
elements with a wide-column store and a graph database. Sometimes NoSQL elements are
mixed with SQL elements, creating a variety of databases that are referred to as multimodel
databases.

CAP Theorem (Brewer’s Theorem): -

You must understand the CAP theorem when you talk about NoSQL databases or in
fact when designing any distributed system. CAP theorem states that there are three basic
requirements which exist in a special relation when designing applications for a distributed
architecture.

Consistency: -
This means that the data in the database remains consistent after the execution of an
operation. For example after an update operation all clients see the same data.

Availability: -

This means that the system is always on (service guarantee availability), no


downtime.

Partition Tolerance: -

This means that the system continues to function even the communication among
the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot
communicate with one another.

In theoretically it is impossible to fulfill all 3 requirements. CAP provides the basic


requirements for a distributed system to follow 2 of the 3 requirements. Therefore all the
current NoSQL database follow the different combinations of the C, A, P from the CAP
theorem. Here is the brief description of three combinations CA, CP, AP :

CA: - Single site cluster, therefore all nodes are always in contact.

CP: -Some data may not be accessible, but the rest is still consistent/accurate.

AP: - System is still available under partitioning, but some of the data returned may be
inaccurate.

Advantages of NoSQL: -

 High scalability
 Distributed Computing
 Lower cost
 Schema flexibility
 semi-structure data
 No complicated Relationships

Disadvantages of NoSQL: -

 No standardization
 Limited query capabilities

NoSQL products categories: -

There are four general types (most common categories) of NoSQL databases. Each of
these categories has its own specific attributes and limitations. There is not a single solutions
which is better than all the others, however there are some databases that are better to solve
specific problems.
To clarify the NoSQL databases, lets discuss the most common categories:-

 Key-value stores.
 Column-oriented
 Graph
 Document oriented

1. Key-value Stores: -

 Key-value stores are most basic types of NoSQL databases.


 Designed to handle huge amounts of data.
 Based on Amazon’s Dynamo paper.
 Key value stores allow developer to store schema-less data.
 In the key-value storage, database stores data as hash table where each key is unique
and the value can be string, JSON, BLOB (Binary Large OBjec) etc.
 A key may be strings, hashes, lists, sets, sorted sets and values are stored against these
keys.
 For example a key-value pair might consist of a key like "Name" that is associated
with a value like "Robin".
 Key-Value stores can be used as collections, dictionaries, associative arrays etc.
 Key-Value stores follow the 'Availability' and 'Partition' aspects of CAP theorem.
 Key-Values stores would work well for shopping cart contents, or individual values
like color schemes, a landing page URI, or a default account number.

Example of Key-value store DataBase: -

 Redis Dynamo.
 Riak. etc.

Pictorial Presentation :
2.Column-oriented databases: -

 Column-oriented databases primarily work on columns and every column is treated


individually.
 Values of a single column are stored contiguously.
 Column stores data in column specific files.
 In Column stores, query processors work on columns too.
 All data within each column datafile have the same type which makes it ideal for
compression.
 Column stores can improve the performance of queries as it can access specific
column data.
 High performance on aggregation queries (e.g. COUNT, SUM, AVG, MIN, MAX).
 Works on data warehouses and business intelligence, customer relationship
management (CRM), Library card catalogs etc.

Example of Column-oriented databases: -

 BigTable
 Cassandra
 SimpleDB etc.

Pictorial Presentation: -
3.Graph Databases: -

A graph data structure consists of a finite (and possibly mutable) set of ordered pairs,
called edges or arcs, of certain entities called nodes or vertices.

The following picture presents a labeled graph of 6 vertices and 7 edges.

What is a Graph Databases?

 A graph database stores data in a graph.


 It is capable of elegantly representing any kind of data in a highly accessible way.
 A graph database is a collection of nodes and edges
 Each node represents an entity (such as a student or business) and each edge
represents a connection or relationship between two nodes.
 Every node and edge are defined by a unique identifier.
 Each node knows its adjacent nodes.
 As the number of nodes increases, the cost of a local step (or hop) remains the same.
 Index for lookups.

Example of Graph databases: -


 OrientDB
 Neo4J,
 g Titan.etc.

Pictorial Presentation: -
4.Document Oriented Databases: -

 A collection of documents
 Data in this model is stored inside documents.
 A document is a key value collection where the key allows access to its value.
 Documents are not typically forced to have a schema and therefore are flexible and
easy to change.
 Documents are stored into collections in order to group different kinds of data.
 Documents can contain many different key-value pairs, or key-array pairs, or even
nested documents.

Example of Document Oriented databases: -

 MongoDB
 CouchDB etc.

Pictorial Presentation :
SQL Terms/Concepts MongoDB Terms/Concepts

Database database

Table collection

Row document or BSON document

Column field

Index index

table joins $lookup, embedded documents

primary key primary key

Specify any unique column or column In MongoDB, the primary key is automatically set
combination as primary key. to the _id field.

aggregation pipeline

aggregation (e.g. group by) See the SQL to Aggregation Mapping Chart.

$out

SELECT INTO NEW_TABLE See the SQL to Aggregation Mapping Chart.

$merge (Available starting in MongoDB 4.2) See


the SQL to Aggregation Mapping Chart.
MERGE INTO TABLE

Transactions transactions

Conclusion: -

Hence we have successfully install and configure MongoDB to execute NoSQL


commands.
Experiment No: -04.

Aim: -
Write a Hadoop MapReduce program to compute the total number of occurrence of
each single word present in a text document.

Theory: -
How it works?
Hadoop WordCount operation occurs in 3 stages,
 Mapper Phase
 Shuffle Phase
 Reducer Phase

Hadoop WordCount Example: -


Mapper Phase Execution: -
The text from the input text file is tokenized into words to form a key value pair
with all the words present in the input text file. The key is the word from the input file and
value is ‘1’. For instance if you consider the sentence “An elephant is an animal”. The
mapper phase in the WordCount example will split the string into individual tokens i.e.
words. In this case, the entire sentence will be split into 5 tokens (one for each word) with
a value 1 as shown below.

Key-Value pairs from Hadoop Map Phase Execution.

Hadoop WordCount Example: -


Shuffle Phase Execution: -
After the map phase execution is completed successfully, shuffle phase is executed
automatically wherein the key-value pairs generated in the map phase are taken as input
and then sorted in alphabetical order. After the shuffle phase is executed from the
WordCount example code, the output will look like this,
Running the WordCount Example in Hadoop MapReduce using Java Project with
Eclipse: -

Now, let’s create the WordCount java project with eclipse IDE for Hadoop. Even if
you are working on Cloudera VM, creating the Java project can be applied to any
environment.
Step 1: - Let’s create the java project with the name “Sample WordCount” as
shown below,
File > New > Project > Java Project > Next.
"Sample WordCount" as our project name and click "Finish":
Step 2: - The next step is to get references to hadoop libraries by clicking on Add JARS as
follows,
Step3: - Create a new package within the project with the name com.code.dezyre.
Step 4: -Now let’s implement the WordCount example program by creating a

WordCount class under the project com.code.dezyre.


Step 5: -Create a Mapper class within the WordCount class which extends MapReduceBase
Class to implement mapper interface. The mapper class will contain,
Code to implement "map" method.
Code for implementing the mapper-stage business logic
should
be written within this method.
Program: -
Mapper Class Code for WordCount Example in Hadoop MapReduce,

public static class Map extends MapReduceBase implements Mapper { private final static
IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector output, Reporter
reporter)
throws IOException { String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken()); output.collect(word, one);
}
}
}

Reducer Class Code for WordCount Example in Hadoop MapReduce:


public static class Reduce extends MapReduceBase implements Reducer { public void
reduce(Text key, Iterator values,
OutputCollector output,
Reporter reporter) throws IOException { int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Step 7: - Create main() method within the WordCount class and set the following
properties using the JobConf class.

1. OutputKeyClass.
2. OutputValueClass.
3. Mapper Class.
4. Reducer Class.
5. InputFormat.
6. OutputFormat.
7. InputFilePath.
8. OutputFolderPath.

public static void main(String[] args) throws Exception { JobConf conf = new
JobConf(WordCount.class); conf.setJobName("WordCount");

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
//conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

Step 8: - Create the JAR file for the wordcount class


Output: -

>> hadoop jar (jar file name) (className_along_with_packageName) (input file) (output
folderpath)

>> hadoop jar dezyre_wordcount.jar com.code.dezyre.WordCount


/user/cloudera/Input/war_and_peace /user/cloudera/Output

>> hadoop fs –mkdir /user/cloudera/Input


>> hadoop fs –put war_and_peace /user/cloudera/Input/war_and_peace
Conclusion: -
Hence we have successfully implemented Hadoop MapReduce program to compute
the total number of occurrences of each single word present in a text document.
Experiment No: - 05.

Aim: -
Write a Matrix multiplication program in MapReduce.

Theory: -
MapReduce is a technique in which a huge program is subdivided into small tasks and run
parallelly to make computation faster, save time, and mostly used in distributed systems. It has 2
important parts

Mapper: -
It takes raw data input and organizes into key, value pairs. For example, In a dictionary,
you search for the word “Data” and its associated meaning is “facts and statistics collected together
for reference or analysis”. Here the Key is Data and the Value associated with is facts and statistics
collected together for reference oranalysis.

Reducer: -
It is responsible for processing data in parallel and producing final output.
Program: -

#!/usr/lib/python import sys


m_r=2 m_c=3 n_r=3 n_c=2 matrix=[]
for row in range(m_r):
r=[]
for col in range(n_c):
s=0
for el in range(m_c):
mul=1
for num in range(2):
line=sys.stdin.readline()
n=map(int,line.split('\t'))[-1]
mul*=n
s+=mul
r.append(s)
matrix.append(r) print('\n'.join([str(x) for x in matrix]))

Output: -

print('\n'.join([str(x) for x in matrix]))


$ chmod +x ~/Desktop/mr/matrix-mul/Mapper.py
$ chmod +x ~/Desktop/mr/matrix-mul/Reducer.py
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/matrices/ \
-output /user/cloudera/mat_output \
-mapper ~/Desktop/mr/matrix-mul/Mapper.py \
-reducer ~/Desktop/mr/matrix-mul/Reducer.py

Conclusion: -
Hence we have successfully implemented the Matrix multiplication
program in MapReduce.
Experiment No: - 06.

Aim: -
Implementing sorting algorithm in Map-Reduce style.

Theory: -

The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is
used as input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small


parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps
in sending the Map & Reduce tasks to appropriate servers in a cluster.

Program: -

public class SecondarySortingTemperatureMapper extends Mapper<LongWritable, Text,


TemperaturePair, NullWritable> {

private TemperaturePair temperaturePair = new TemperaturePair(); private NullWritable


nullValue = NullWritable.get();
private static final int MISSING = 9999; @Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
String yearMonth = line.substring(15, 21); int tempStartPosition = 87;
if (line.charAt(tempStartPosition) == '+') { tempStartPosition += 1;
}
int temp = Integer.parseInt(line.substring(tempStartPosition, 92)); if (temp != MISSING) {
temperaturePair.setYearMonth(yearMonth);
temperaturePair.setTemperature(temp); context.write(temperaturePair, nullValue);
}
}
}
@Override
public int compareTo(TemperaturePair temperaturePair) {
int compareValue = this.yearMonth.compareTo(temperaturePair.getYearMonth()); if
(compareValue == 0) {
compareValue = temperature.compareTo(temperaturePair.getTemperature());
}
return compareValue;
}

public class TemperaturePartitioner extends Partitioner<TemperaturePair,


NullWritable>{
@Override
public int getPartition(TemperaturePair temperaturePair, NullWritable nullWritable, int
numPartitions) {
return temperaturePair.getYearMonth().hashCode() % numPartitions;
}
}
public class YearMonthGroupingComparator extends WritableComparator { public
YearMonthGroupingComparator() {
super(TemperaturePair.class, true);
}
@Override
public int compare(WritableComparable tp1, WritableComparable tp2) { TemperaturePair
temperaturePair = (TemperaturePair) tp1; TemperaturePair temperaturePair2 =
(TemperaturePair) tp2;
return temperaturePair.getYearMonth().compareTo(temperaturePair2.getYearMonth());
}
}
Output: -

Conclusion: -
Hence we have successfully implemented sorting algorithm in Map-Reduce style.
Experiment No: - 07.

Aim: -
Implementing DGIM algorithm using R language.

Theory: -

DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)

Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to
represent a window of N bit, allowing to estimate the number of 1’s in the window with an error of
no more than 50%.

In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it
arrives. if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.. the positions
are recognized with the window size N (the window sizes are usually taken as a multiple of 2).The
windows are divided into buckets consisting of 1’s and 0's.

RULES FOR FORMING THE BUCKETS: -

The right side of the bucket should always start with 1. (if it starts with a
0,it is to be neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and
starting with 1 on it’s right end.
Every bucket should have at least one 1, else no bucket can be formed.
All buckets should be in powers of 2.
The buckets cannot decrease in size as we move to the left. (move in
increasing order towards left)
Let us take an example to understand the algorithm.
Estimating the number of 1’s and counting the buckets in the given data
stream.

Program: -

%!tikz editor 1.0


\documentclass{article}
\usepackage{tikz}
\usepackage[graphics, active, tightpage]{preview}
\PreviewEnvironment{tikzpicture}
%!tikz preamble begin

%!tikz preamble end


\begin{document}
%!tikz source begin
\begin{tikzpicture} [font=\sffamily\tiny]
\usetikzlibrary{matrix,shapes,arrows,fit}
\usetikzlibrary{matrix,positioning,fit,calc}
\pgfarrowsdeclare{arcs}{arcs}{...}
{
\pgfsetdash{}{0pt} % do not dash
\pgfsetroundjoin % fix join
\pgfsetroundcap % fix cap
\pgfpathmoveto{\pgfpoint{-10pt}{10pt}}
\pgfpatharc{180}{270}{10pt}
\pgfpatharc{90}{180}{10pt}
\pgfusepathqstroke
}
\draw[-arcs,line width=1pt] (-1.4,5.9) -- (-1.4,5.7);
\draw [line width=3pt, line cap=round, dash pattern=on 0pt off 2\pgflinewidth] (-2,6) --
(2.7,6);
\node[draw] at (-0.2,6.3) {Window Size = 40};
\draw [-to,shorten >=-1pt,gray,ultra thick] (4,-1) -- (3.5,-1.7);
\node[font=\fontsize{5}{5}\selectfont,anchor=north,text width=6cm] (note1) at (2,5.6)
{
Items enter the stream here
.};
\node[font=\fontsize{5}{5}\selectfont,anchor=north,text width=6cm] (note1) at (7,0) {
The difference between the latest\\ timestamp(105) and the oldest(65)\\ equals the
windows size(40). So\\ the oldest bucket is dropped
.};
\node[font=\fontsize{5}{5}\selectfont,anchor=north,text width=6cm] (note1) at (7,3) {
Combine any two adjacent buckets of\\ the same size, replace them by one\\ bucket of
twice the size. The timestamp\\ of the new bucket is the\\
timestamp of the rightmost (later in time)\\ of the two buckets.
.};
\matrix [matrix of nodes] (m)
{
End Time &100 & 98 & 95 & 92 &87 & 80 & 65\\ Size & 1 & 1 & 2 & 2 & 4 & 8 & 8\\
End Time &101 & 100 & 98 &95 & 92 &87 & 80 & 65\\ Size & 1 & 1 &1&
2 & 2 & 4 & 8 & 8\\
End Time &101 &100 & 95 & 92 & 87 & 80 &65\\
Size & 1 & 2 &2 & 2 & 4 & 8
& 8\\ End Time & 101 &100 &95
&87 & 80 &65\\ Size & 1 & 2
& 4 & 4 &8 & 8\\
End Time & 102 & 101 &100 & 95 &87 & 80 & 65\\ Size & 1 & 1&
2 & 4 & 4 & 8 & 8\\
End Time & 103 & 102 &101 &100 & 95 & 87 & 80 &65\\
Size & 1 & 1 & 1 & 2 & 4 & 4 & 8 & 8\\
End Time & 103 &102 & 100 & 95 & 87 & 80 & 65\\
Size & 1 & 2 & 2 & 4 & 4 & 8 & 8\\
End Time & 104 &103 & 102 &100 & 95 &87 &80 &65\\ Size & 1&
1 & 2 & 2 & 4 & 4 & 8 & 8\\
End Time & 105 &104 &103 &102 &100 & 95 & 87 & 80 & 65\\ Size &
1 & 1 & 1 & 2 & 2 & 4 & 4 & 8 & 8\\
End Time & 105 &104 &102 &100 & 95 & 87 &80\\ Size & 1 &2&
2 & 2 & 4 & 4 & 8 \\ End Time& 105 &104 &102 & 95 &87 & 80\\ Size &
1 & 2 & 4 & 4 & 4 & 8 \\
End Time & 105 & 104 &102 & 95 & 80\\ Size & 1 & 2& 4 &
8 & 8 \\
\\};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-1-2) (m-2-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-3-2) (m-4-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-5-2) (m-6-2)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-5-4) (m-6-4)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-6-4) (m-6-5)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-11-3) (m-12-3)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-12-3) (m-12-4)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-17-3) (m-18-3)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-18-3) (m-18-4)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-19-4) (m-20-4)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-20-4) (m-20-5)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-21-4) (m-22-4)
]{};
\node[draw=violet!99,font=\sffamily\footnotesize,inner sep=0pt,thick,rounded
corners,
fit=(m-22-4) (m-22-5)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-7-2) (m-8-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-9-2) (m-10-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-11-2) (m-12-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-13-2) (m-14-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-15-2) (m-16-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-17-2) (m-18-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-19-2) (m-20-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-21-2) (m-22-2)
]{};
\node[draw=cyan,font=\sffamily\footnotesize,inner sep=0pt,ultra
thick,dashed,rounded corners,
fit=(m-23-2) (m-24-2)
]{};

\node[draw=blue,font=\sffamily\footnotesize,inner sep=0pt,ultra thick,dashed,rounded


corners,
fit=(m-17-10) (m-18-10)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-1-3) (m-2-8)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-3-3) (m-4-9)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-5-3) (m-6-8)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-7-3) (m-8-7)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-9-3) (m-10-8)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-11-3) (m-12-9)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-13-3) (m-14-8)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-15-3) (m-16-9)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-17-3) (m-18-9)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-19-3) (m-20-8)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-21-3) (m-22-7)
]{};
\node[draw=orange,inner sep=0pt,rounded corners, fit=(m-23-3) (m-24-6)
]{};

\end{tikzpicture}
%!tikz source end

\end{document}
Output: -

Conclusion: -
Hence we have successfully implemented DGIM algorithm using R language.
Experiment No: - 08.

Aim: -
Implementing K means Clustering algorithm using Map-Reduce.

Theory: -

K Means algorithm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into
Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs
to only one group. It tries to make the intra-cluster data points as similar as possible while
also keeping the clusters as different (far) as possible. It assigns data points to a cluster
such that the sum of the squared distance between the data points and the cluster’s
centroid (arithmetic mean of all the data points that belong to that cluster) is at the
minimum. The less variation we have within clusters, the more homogeneous (similar) the
data points are within the same cluster.

The way k means algorithm works is as follows: -

Specify number of clusters K.


Initialize centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without replacement.
Keep iterating until there is no change to the centroids. i.eassignment
of data points to clusters isn’t changing.

Program: -

Mapper Function

//Map function find nearest centers to the coordinates Map Phase input:<k1, v1>
k1- line number
v1- point(coordinates)

//Find minimum center from point For each center


Find minimum distance to point End for
k2-nearest center V2-point
Output: <k2,v2> Reducer Function

//Compute new cluster centers Reduce Phase input:<k2, List<v2> Calculate mean value
for v2 points new center point-mean value
k3-new center point v3-points
Output: <k3,v3>
Continue the process till the clusters are converged.
Output: -
Conclusion: - Hence we have successfully implemented K means Clustering algorithm
using Map-Reduce.
Experiment No: -09.

Aim: -
Implementing Page Rank using Map-Reduce.

Theory: -

PageRank (PR) is an algorithm used by Google Search to rank websites in their search
engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank
is a way of measuring the importance of website pages. According to Google:'PageRank works
by counting the number and quality of links to a page to determine a rough estimate of how
important the website is. The underlying assumption is that more important websites are likely
to receive more links from other websites.'

Algorithm: -

The PageRank algorithm outputs a probability distribution used to represent the


likelihood that a person randomly clicking on links will arrive at any particular page. PageRank
can be calculated for collections of documents of any size. It is assumed in several research
papers that the distribution is evenly divided among all documents in the collection at the
beginning of the computational process. The PageRank computations require several passes,
called “iterations”, through the collection to adjust approximate PageRank values to more
closely reflect the theoretical true value.

Simplified algorithm: -

Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or
multiple outbound links from one single page to another single page, are ignored. PageRank is
initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank
over all pages was the total number of pages on the web at that time, so each page in this
example would have an initial value of 1. However, later versions of PageRank, and the
remainder of this section, assume a probability distribution between 0 and 1. Hence the initial
value for each page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the
next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25
PageRank to A upon the next iteration, for a total of 0.75.
Program: -

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import filepreprocess.HadoopDFSFileReadWrite;
import finalpagerank.FinalPageRankCalculator;

public class PageRank {


private static final transient Logger LOG = LoggerFactory.getLogger(PageRank.class);
private static volatile int roundNumber = 0;

// Consider there are only 5 nodes


public static final int TOTALNODES = 5;

// Counter
public enum MyCounter {
COUNTER;
}

// Some Pre - Processing of the Input file


static {
HadoopDFSFileReadWrite preprocessor = new HadoopDFSFileReadWrite();
String originalInputFile = "/pageRank/input/originalinput.txt";
String newInputFile = "/pageRank/input/pagerankinput.txt";
try{
preprocessor.preprocess(originalInputFile, newInputFile);
} catch(Exception e) {
LOG.info("Some Error In Reading the Input File");
LOG.info(e.getMessage());
System.exit(0);
} finally {
LOG.info("No Error In Reading the Input File");
// Proceed to the Map Reduce Job
}
}

public static void main(String[] args) throws Exception {


String inputPath = "/pageRank/input/pagerankinput.txt";
String outputPath = "/pageRank/outputs/output";
String finalPath = "/pageRank/finalOutput/finalOutput.txt";
Counter counter;

do {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);

deleteFolder(conf, outputPath + roundNumber);

LOG.info("Input : " + inputPath + " :: Output : " + outputPath);

myMapReduceTask(job, inputPath, outputPath + roundNumber);


inputPath = outputPath + roundNumber + "/part-r-00000";
roundNumber++;

// Configure the Counter


counter = job.getCounters().findCounter(MyCounter.COUNTER);

LOG.info("Counter Value : " + counter.getValue());


}
while(counter.getValue() > 0);
// The above loop executes til the time the Page ranks Stabilize

// Now calculate the sum of In Links for each node


FinalPageRankCalculator finalPageRankCalculator = new FinalPageRankCalculator();
finalPageRankCalculator.getFinalPageRank(
outputPath + (roundNumber - 1) + "/part-r-00000", finalPath);

LOG.info("Final Page Rank File Created");


LOG.info("Check the Final Output in the path /pageRank/finalOutput/finalOutput.txt");
}

private static void myMapReduceTask(Job job, String inputPath, String outputPath)


throws IOException, ClassNotFoundException, InterruptedException {
job.setJarByClass(PageRank.class);

// Set the Mapper Class


job.setMapperClass(PageRankMapper.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);

// Set the Reducer Class


job.setReducerClass(PageRankReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);

// Specify input and output Directories


FileInputFormat.addInputPath(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));

// Condition to wait for the completion of MR Job

while(!job.waitForCompletion(true)) {}

return;
}

private static void deleteFolder(Configuration conf, String folderPath ) throws IOException {


// Delete the Folder
FileSystem fs = FileSystem.get(conf);
Path path = new Path(folderPath);
if(fs.exists(path)) {
fs.delete(path,true);
}
}
}

Input: -

1 0.0:0.06666666666666667:0.06666666666666667:0.06666666666666667:0.0:
2 0.0:0.0:0.0:0.0:0.2:
3 0.0:0.1:0.0:0.0:0.1:
4 0.1:0.0:0.1:0.0:0.0:
5 0.0:0.0:0.1:0.1:0.0:

Output: -

0 0.08951327267661183
1 0.16418882351680386
2 0.26865107113768866
3 0.17913779846107686
4 0.298509034207819

Conclusion: -

Page Rank using map-reduce has been implemented successfully.


Experiment No: -10.
Aim: -
Implementing Logistic regression analytics technique using Scilab.
Theory: -
Logistic regression is a predictive modelling algorithm that is used when the Y
variable is binary categorical. That is, it can take only two values like 1 or 0. The goal is
to determine a mathematical equation that can be used to predict the probability of
event 1. Once the equation is established, it can be used to predict the Y when only the
X’s are known.
In linear regression the Y variable is always a continuous variable. If suppose, the
Y variable was categorical, you cannot use linear regression model it. So what would
you do when the Y is a categorical variable with 2 classes? Logistic regression can be
used to model and solve such problems, also called as binary classification problems. A
key point to note here is that Y can have 2 classes only and not more than that. If Y has
more than 2 classes, it would become a multi class classification and you can no longer
use the vanilla logistic regression for that. Yet, Logistic regression is a classic predictive
modelling technique and still remains a popular choice for modelling binary categorical
variables. Another advantage of logistic regression is that it computes a prediction
probability score of an event. More on that when you actually start building the models.
Here are some examples of binary classification problems:
 Spam Detection : Predicting if an email is Spam or not
 Credit Card Fraud : Predicting if a given credit card transaction is fraud or
not
 Health : Predicting if a given mass of tissue is benign or malignant
 Marketing : Predicting if a given user will buy an insurance product or not
 Banking: Predicting if a customer will default on a loan.

Logistic Regression Model: -


The logistic regression model takes real-valued inputs and makes a prediction as
to the probability of the input belonging to the default class (class 0). If the probability is
> 0.5 we can take the output as a prediction for the default class (class 0), otherwise the
prediction is for the other class (class 1). For this dataset, the logistic regression has
three coefficients just like linear regression, for example:
output = b0 + b1*x1 + b2*x2
The job of the learning algorithm will be to discover the best values for the
coefficients (b0, b1 and b2) based on the training data. Unlike linear regression, the
output is transformed into a probability using the logistic function:
p(class=0) = 1 / (1 + e^(-output))
this Can be written as: p(class=0) = 1 / (1 + EXP(-output))

Program: -

ps aux | grep -i apt


root@lab221-To-Be-Filled-By-O-E-M:~# sudo apt-get install scilab
root@lab221-To-Be-Filled-By-O-E-M:~# sudo killall apt apt-get
root@lab221-To-Be-Filled-By-O-E-M:~# sudo kill -9 1178
root@lab221-To-Be-Filled-By-O-E-M:~# sudo apt-get install scilab
echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.2
multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.2.list

b0 = 10;
t = b0 * rand(100,2);
t = [t 0.5+0.5*sign(t(:,2)+t(:,1)-b0)];
b = 1;
flip = find(abs(t(:,2)+t(:,1)-b0)<b);
t(flip,$)=grand(length(t(flip,$)),1,"uin",0,1);
t0 = t(find(t(:,$)==0),:);
t1 = t(find(t(:,$)==1),:);
clf(0);scf(0);
plot(t0(:,1),t0(:,2),'bo')
plot(t1(:,1),t1(:,2),'rx')
x = t(:, 1:$-1); y = t(:, $);
[m, n] = size(x);

x = [ones(m, 1) x];
// Initialize fitting parameters
theta = zeros(n + 1, 1);
// Learning rate and number of iterations
a = 0.01;
n_iter = 10000;
for iter = 1:n_iter do
z = x * theta;
h = ones(z) ./ (1+exp(-z));
theta = theta - a * x' *(h-y) / m;
J(iter) = (-y' * log(h) - (1-y)' * log(1-h))/m;
end
disp(theta)
u = linspace(min(x(:,2)),max(x(:,2)));
clf(1);scf(1);
plot(t0(:,1),t0(:,2),'bo')
plot(t1(:,1),t1(:,2),'rx')
plot(u,-(theta(1)+theta(2)*u)/theta(3),'-g')

Output: -
Conclusion: -
Hence we have successfully Implementes Logistic regression analytics
technique using Scilab.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy