BDA Manual
BDA Manual
Aim: -
To Study Big-Data.
Theory: -
BIG DATA: -
Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data, rather it has
become a complete subject, which involves various tools, technologies and frameworks.
Google’s Solution: -
Google solved this problem using an algorithm called MapReduce. This algorithm divides
the task into small parts and assigns those parts to many computers connected over the
network, and collects the results to form the final result dataset.
Above diagram shows various commodity hardwares which could be single CPU
machines or servers with higher capacity.
What is MapReduce?
The Algorithm: -
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
Map stage : -
The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
Reduce stage : -
This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Conclusion: -
HADOOP: -
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of computers using simple
programming models. A Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture: -
Hadoop framework includes following four modules:
Hadoop Common: -
These are Java libraries and utilities required by other Hadoop modules. These libraries
provides filesystem and OS level abstractions and contains the necessary Java files and scripts
required to start Hadoop.
Hadoop YARN: -
This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS™): -
A distributed file system that provides highthroughput access to application data.
Hadoop MapReduce: -
This is YARN-based system for parallel processing of large data sets.
We can use following diagram to depict these four components available in Hadoop framework.
large datasets across clusters of computers using simple programming models.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned
above but also to the collection of additional software packages that can be installed on top of
or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.
MapReduce: -
Hadoop MapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
The Map Task: -
This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: -
This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after
the map task.
Typically both the input and the output are stored in a file-system. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on the slaves,
monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the tasks as
directed by the master and provide task-status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which
means if JobTracker goes down, all running jobs are halted.
Hadoop Distributed File System: -
Hadoop can work directly with any mountable distributed file system such as Local FS,
HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop
Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on large clusters (thousands of
computers) of small computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single NameNode
that manages the file system metadata and one or more slave DataNodes that store the actual
data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a
set of DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The
DataNodes takes care of read and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to
interact with the file system. These shell commands will be covered in a separate chapter along
with appropriate examples.
How Does Hadoop Work?
Stage 1: -
A user/application can submit a job to the Hadoop (a hadoop job client) for required
process by specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2: -
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration
to the slaves, scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage 3: -
The TaskTrackers on different nodes execute the task as per MapReduce
implementation and output of the reduce function is stored into the output files on the file
system.
Hadoop framework
Advantages of Hadoop: -
Aim: -
What is NoSQL?
Why NoSQL?
In today’s time data is becoming easier to access and capture through third parties
such as Facebook, Google+ and others. Personal user information, social graphs, geo location
data, user generated content and machine logging data are just a few examples where the
data has been increasing exponentially. To avail the above service properly, it is required to
process huge amount of data. Which SQL databases were never designed. The evolution of
NoSql databases is to handle these huge data properly.
The term NoSQL was coined by Carlo Strozzi in the year 1998. He used this term to
name his Open Source, Light Weight, DataBase which did not have an SQL interface.
In the same year, the "no:sql(east)" conference held in Atlanta, USA, NoSQL was
discussed and debated a lot.
And then, discussion and practice of NoSQL got a momentum, and NoSQL saw an
unprecedented growth.
The NoSQL term can be applied to some databases that predated the relational
database management system, but it more commonly refers to the databases built in the
early 2000s for the purpose of large-scale database clustering in cloud and web applications.
In these applications, requirements for performance and scalability outweighed the need for
the immediate, rigid data consistency that the RDBMS provided to transactional enterprise
applications.
Notably, the NoSQL systems were not required to follow an established relational
schema. Large-scale web organizations such as Google and Amazon used NoSQL databases
to focus on narrow operational goals and employ relational databases as adjuncts where
high-grade data consistency is necessary.
Early NoSQL databases for web and cloud applications tended to focus on very
specific characteristics of data management. The ability to process very large volumes of
data and quickly distribute that data across computing clusters were desirable traits in web
and cloud design. Developers who implemented cloud and web systems also looked to create
flexible data schema -- or no schema at all -- to better enable fast changes to applications that
were continually updated.
Evolution of NoSQL: -
Berkeley DB was an influential system in the early evolution of NoSQL database usage.
Developed at the University of California, Berkeley, beginning in the 1990s, Berkeley DB was
widely described as an embedded database that closely supported specific applications'
storage needs. This open source software provided a simple key-value store. Berkeley DB
was commercially released by Sleepycat Software in 1999. The company was later acquired
by Oracle in 2006. Oracle has continued to support open source Berkeley DB. Other NoSQL
databases that have gained prominence include cloud-hosted NoSQL databases such as
Amazon DynamoDB, Google BigTable, as well as Apache Cassandraand MongoDB.
The basic NoSQL database classifications are only guides. Over time, vendors have
mixed and matched elements from different NoSQL database family trees to achieve more
generally useful systems. That evolution is seen, for example, in MarkLogic, which has added
a graph store and other elements to its original document databases. Couchbase Server
supports both key-value and document approaches. Cassandra has combined key-value
elements with a wide-column store and a graph database. Sometimes NoSQL elements are
mixed with SQL elements, creating a variety of databases that are referred to as multimodel
databases.
You must understand the CAP theorem when you talk about NoSQL databases or in
fact when designing any distributed system. CAP theorem states that there are three basic
requirements which exist in a special relation when designing applications for a distributed
architecture.
Consistency: -
This means that the data in the database remains consistent after the execution of an
operation. For example after an update operation all clients see the same data.
Availability: -
Partition Tolerance: -
This means that the system continues to function even the communication among
the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot
communicate with one another.
CA: - Single site cluster, therefore all nodes are always in contact.
CP: -Some data may not be accessible, but the rest is still consistent/accurate.
AP: - System is still available under partitioning, but some of the data returned may be
inaccurate.
Advantages of NoSQL: -
High scalability
Distributed Computing
Lower cost
Schema flexibility
semi-structure data
No complicated Relationships
Disadvantages of NoSQL: -
No standardization
Limited query capabilities
There are four general types (most common categories) of NoSQL databases. Each of
these categories has its own specific attributes and limitations. There is not a single solutions
which is better than all the others, however there are some databases that are better to solve
specific problems.
To clarify the NoSQL databases, lets discuss the most common categories:-
Key-value stores.
Column-oriented
Graph
Document oriented
1. Key-value Stores: -
Redis Dynamo.
Riak. etc.
Pictorial Presentation :
2.Column-oriented databases: -
BigTable
Cassandra
SimpleDB etc.
Pictorial Presentation: -
3.Graph Databases: -
A graph data structure consists of a finite (and possibly mutable) set of ordered pairs,
called edges or arcs, of certain entities called nodes or vertices.
Pictorial Presentation: -
4.Document Oriented Databases: -
A collection of documents
Data in this model is stored inside documents.
A document is a key value collection where the key allows access to its value.
Documents are not typically forced to have a schema and therefore are flexible and
easy to change.
Documents are stored into collections in order to group different kinds of data.
Documents can contain many different key-value pairs, or key-array pairs, or even
nested documents.
MongoDB
CouchDB etc.
Pictorial Presentation :
SQL Terms/Concepts MongoDB Terms/Concepts
Database database
Table collection
Column field
Index index
Specify any unique column or column In MongoDB, the primary key is automatically set
combination as primary key. to the _id field.
aggregation pipeline
aggregation (e.g. group by) See the SQL to Aggregation Mapping Chart.
$out
Transactions transactions
Conclusion: -
Aim: -
Write a Hadoop MapReduce program to compute the total number of occurrence of
each single word present in a text document.
Theory: -
How it works?
Hadoop WordCount operation occurs in 3 stages,
Mapper Phase
Shuffle Phase
Reducer Phase
Now, let’s create the WordCount java project with eclipse IDE for Hadoop. Even if
you are working on Cloudera VM, creating the Java project can be applied to any
environment.
Step 1: - Let’s create the java project with the name “Sample WordCount” as
shown below,
File > New > Project > Java Project > Next.
"Sample WordCount" as our project name and click "Finish":
Step 2: - The next step is to get references to hadoop libraries by clicking on Add JARS as
follows,
Step3: - Create a new package within the project with the name com.code.dezyre.
Step 4: -Now let’s implement the WordCount example program by creating a
public static class Map extends MapReduceBase implements Mapper { private final static
IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector output, Reporter
reporter)
throws IOException { String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken()); output.collect(word, one);
}
}
}
1. OutputKeyClass.
2. OutputValueClass.
3. Mapper Class.
4. Reducer Class.
5. InputFormat.
6. OutputFormat.
7. InputFilePath.
8. OutputFolderPath.
public static void main(String[] args) throws Exception { JobConf conf = new
JobConf(WordCount.class); conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
//conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
>> hadoop jar (jar file name) (className_along_with_packageName) (input file) (output
folderpath)
Aim: -
Write a Matrix multiplication program in MapReduce.
Theory: -
MapReduce is a technique in which a huge program is subdivided into small tasks and run
parallelly to make computation faster, save time, and mostly used in distributed systems. It has 2
important parts
Mapper: -
It takes raw data input and organizes into key, value pairs. For example, In a dictionary,
you search for the word “Data” and its associated meaning is “facts and statistics collected together
for reference or analysis”. Here the Key is Data and the Value associated with is facts and statistics
collected together for reference oranalysis.
Reducer: -
It is responsible for processing data in parallel and producing final output.
Program: -
Output: -
Conclusion: -
Hence we have successfully implemented the Matrix multiplication
program in MapReduce.
Experiment No: - 06.
Aim: -
Implementing sorting algorithm in Map-Reduce style.
Theory: -
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is
used as input by Reducer class, which in turn searches matching pairs and reduces them.
Program: -
Conclusion: -
Hence we have successfully implemented sorting algorithm in Map-Reduce style.
Experiment No: - 07.
Aim: -
Implementing DGIM algorithm using R language.
Theory: -
Designed to find the number 1’s in a data set. This algorithm uses O(log²N) bits to
represent a window of N bit, allowing to estimate the number of 1’s in the window with an error of
no more than 50%.
In DGIM algorithm, each bit that arrives has a timestamp, for the position at which it
arrives. if the first bit has a timestamp 1, the second bit has a timestamp 2 and so on.. the positions
are recognized with the window size N (the window sizes are usually taken as a multiple of 2).The
windows are divided into buckets consisting of 1’s and 0's.
The right side of the bucket should always start with 1. (if it starts with a
0,it is to be neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and
starting with 1 on it’s right end.
Every bucket should have at least one 1, else no bucket can be formed.
All buckets should be in powers of 2.
The buckets cannot decrease in size as we move to the left. (move in
increasing order towards left)
Let us take an example to understand the algorithm.
Estimating the number of 1’s and counting the buckets in the given data
stream.
Program: -
\end{tikzpicture}
%!tikz source end
\end{document}
Output: -
Conclusion: -
Hence we have successfully implemented DGIM algorithm using R language.
Experiment No: - 08.
Aim: -
Implementing K means Clustering algorithm using Map-Reduce.
Theory: -
K Means algorithm
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into
Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs
to only one group. It tries to make the intra-cluster data points as similar as possible while
also keeping the clusters as different (far) as possible. It assigns data points to a cluster
such that the sum of the squared distance between the data points and the cluster’s
centroid (arithmetic mean of all the data points that belong to that cluster) is at the
minimum. The less variation we have within clusters, the more homogeneous (similar) the
data points are within the same cluster.
Program: -
Mapper Function
//Map function find nearest centers to the coordinates Map Phase input:<k1, v1>
k1- line number
v1- point(coordinates)
//Compute new cluster centers Reduce Phase input:<k2, List<v2> Calculate mean value
for v2 points new center point-mean value
k3-new center point v3-points
Output: <k3,v3>
Continue the process till the clusters are converged.
Output: -
Conclusion: - Hence we have successfully implemented K means Clustering algorithm
using Map-Reduce.
Experiment No: -09.
Aim: -
Implementing Page Rank using Map-Reduce.
Theory: -
PageRank (PR) is an algorithm used by Google Search to rank websites in their search
engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank
is a way of measuring the importance of website pages. According to Google:'PageRank works
by counting the number and quality of links to a page to determine a rough estimate of how
important the website is. The underlying assumption is that more important websites are likely
to receive more links from other websites.'
Algorithm: -
Simplified algorithm: -
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or
multiple outbound links from one single page to another single page, are ignored. PageRank is
initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank
over all pages was the total number of pages on the web at that time, so each page in this
example would have an initial value of 1. However, later versions of PageRank, and the
remainder of this section, assume a probability distribution between 0 and 1. Hence the initial
value for each page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the
next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25
PageRank to A upon the next iteration, for a total of 0.75.
Program: -
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import filepreprocess.HadoopDFSFileReadWrite;
import finalpagerank.FinalPageRankCalculator;
// Counter
public enum MyCounter {
COUNTER;
}
do {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
while(!job.waitForCompletion(true)) {}
return;
}
Input: -
1 0.0:0.06666666666666667:0.06666666666666667:0.06666666666666667:0.0:
2 0.0:0.0:0.0:0.0:0.2:
3 0.0:0.1:0.0:0.0:0.1:
4 0.1:0.0:0.1:0.0:0.0:
5 0.0:0.0:0.1:0.1:0.0:
Output: -
0 0.08951327267661183
1 0.16418882351680386
2 0.26865107113768866
3 0.17913779846107686
4 0.298509034207819
Conclusion: -
Program: -
b0 = 10;
t = b0 * rand(100,2);
t = [t 0.5+0.5*sign(t(:,2)+t(:,1)-b0)];
b = 1;
flip = find(abs(t(:,2)+t(:,1)-b0)<b);
t(flip,$)=grand(length(t(flip,$)),1,"uin",0,1);
t0 = t(find(t(:,$)==0),:);
t1 = t(find(t(:,$)==1),:);
clf(0);scf(0);
plot(t0(:,1),t0(:,2),'bo')
plot(t1(:,1),t1(:,2),'rx')
x = t(:, 1:$-1); y = t(:, $);
[m, n] = size(x);
x = [ones(m, 1) x];
// Initialize fitting parameters
theta = zeros(n + 1, 1);
// Learning rate and number of iterations
a = 0.01;
n_iter = 10000;
for iter = 1:n_iter do
z = x * theta;
h = ones(z) ./ (1+exp(-z));
theta = theta - a * x' *(h-y) / m;
J(iter) = (-y' * log(h) - (1-y)' * log(1-h))/m;
end
disp(theta)
u = linspace(min(x(:,2)),max(x(:,2)));
clf(1);scf(1);
plot(t0(:,1),t0(:,2),'bo')
plot(t1(:,1),t1(:,2),'rx')
plot(u,-(theta(1)+theta(2)*u)/theta(3),'-g')
Output: -
Conclusion: -
Hence we have successfully Implementes Logistic regression analytics
technique using Scilab.