Big Data Analytics
Big Data Analytics
Big Data Analytics
The volume of data that one has to deal has exploded to unimaginable
levels in the past decade, and at the same time, the price of data storage
has systematically reduced. Private companies and research institutions
capture terabytes of data about their users’ interactions, business, social
media, and also sensors from devices such as mobile phones and
automobiles. The challenge of this era is to make sense of this sea of data.
This is where big data analytics comes into picture.
Big Data Analytics largely involves collecting data from different sources,
munge it in a way that it becomes available to be consumed by analysts
and finally deliver data products useful to the organization business.
Evaluation − At this stage in the project, you have built a model (or models)
that appears to have high quality, from a data analysis perspective. Before
proceeding to final deployment of the model, it is important to evaluate the
model thoroughly and review the steps executed to construct the model, to be
certain it properly achieves the business objectives.
Deployment − Creation of the model is generally not the end of the project.
Even if the purpose of the model is to increase knowledge of the data, the
knowledge gained will need to be organized and presented in a way that is
useful to the customer.
Diagnostic analytics:
It determines what has happened in the past and why it can be used to access
the no of post, like and reviews in a social media marketing campaign.
Predictive analytics:
It works by identifying the patterns and historical data and then using statistics
to make inference about the future.
Prescriptive analytics:
It doesn’t only say what is going on, but also what might happen and most
importantly what to do about it.
Descriptive analytics:
It uses existing information from the past to understand decision in present and
helps decide on effective source of action in future.
When analyzing data you can categories big data analytics as follows
1 Basic analytics
2 Advanced analytics
3 Operational analytics
4 Monetized analytics
Advanced analytics:
Operational analytics:
Monetized analytics:
It help business to take important and better decision and earn revenues.
Ex: credit card providers can use this data to offer value added products and tec
communication companies.
Importance of Big data
It is defined Big Data as conforming to the volume, velocity, and variety (V3)
attributes that characterize it. Big Data solutions aren’t a replacement for existing
warehouse solutions Big Data solutions are ideal for analyzing not only raw
structured data, but semistructured and unstructured data from a wide variety of
sources. Big Data solutions are ideal when all, or most, of the data needs to be
analyzed versus a sample of the data; or a sampling of data isn’t nearly as
effective as a larger set of data from which to derive analysis. Big Data solutions
are ideal for iterative and exploratory analysis when business measures on data
are not predetermined. When it comes to solving information management
challenges using Big Data technologies, we suggest you consider the following: • a
Big Data solution is not only going to leverage data not typically suitable for a
traditional warehouse environment, and in massive amounts of volume, but it’s
going to give up some of the formalities and “strictness” of the data. The benefit is
that you can preserve the fidelity of data and gain access to mountains of
information for exploration and discovery of business insights before running it
through the due diligence that you’re accustomed to the data that can be included
as a participant of a cyclic system, enriching the models in the warehouse. • Big
Data is well suited for solving information challenges that don’t natively fit within
a traditional relational database approach for handling the problem at hand. It’s
important that you understand that conventional database technologies are an
important, and relevant, part of an overall analytic solution. In fact, they become
even more vital when used in conjunction with your Big Data platform. A good
analogy here is your left and right hands; each offers individual strengths and
optimizations for a task at hand. There exists some class of problems that don’t
natively belong in traditional databases, at least not at first. And there’s data that
we’re not sure we want in the warehouse, because perhaps we don’t know if it’s
rich in value, it’s unstructured, or it’s too voluminous. In many cases, we can’t find
out the value per byte of the data until after we spend the effort and money to put it
into the warehouse; but we want to be sure that data is worth saving and has a high
value per byte before investing in it.
Matrix-vector multiplication::mappedReduced()
MapReduce is a high level programming model for processing large data sets in parallel, originally
developed by Google, adapted from functional programming. The model is suitable for a range of
problems such as matrix operations, relational algebra, statistical frequency counting etc. To learn
more about MapReduce,
The MapReduce implementation in this example differs from the school-book
multiplication that I just introduced. A single map function will be processing only a single
matrix element rather than the whole row.
There are different implementations of matrix vector multiplication depending on whether the vector
fits into the main memory or not. This example demonstrates a basic version of matrix-vector
multiplication in which the vector fits into the main memory.
We store the sparse matrix (sparse matrices often appear in scientific and engineering
problems) as a triple with explicit coordinates (i, j, aij). E.g. the value of the first entry of the matrix
(0,0) is 3. Similarly, the value of the second entry in (0,1) is 2. We do not store the zero entries of the
matrix. This is the sparse matrix representation of the matrix from the figure above:
i, j, aij
We store the vector in a dense format without explicit coordinates. You can see this below:
In our case, the map function takes a single element of the sparse matrix (not the whole
row!), multiplies it with the corresponding entry of the vector and produces an intermediate
key-value pair (i, aij*vj). This is sufficient because in order to perform the summation (i.e. the reduce
step) we only need to know the matrix row, we do not need to know the matrix column. E.g. one of
the map functions takes the first element of the matrix (0,0,3), multiplies it with the first element of
the vector (4) and produces an intermediate key-value pair (0,12). The key (0), i.e. the row position
of the element in the sparse matrix, associates the value (12) with its position in the matrix-vector
product. Another map function takes the second element (0,1,2), multiplies it with the second row of
the vector (3) producing an intermediate key-value pair (0,6) etc.
Reduce function performs a summary operation on the intermediate keys. E.g. an intermediate
value with the key "0" will be summed up under a final index "0".
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-
value pairs from the mapper by their keys.
In the Shuffle and Sort phase, after tokenizing the values in the mapper
class, the Context class (user-defined class) collects the matching valued
keys as a collection.
The following example shows how MapReduce employs Searching algorithm to find
out the details of the employee who draws the highest salary in a given employee
The Map phase processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>). See the following
The combiner phase (searching technique) will accept the input from the
Map phase as a key-value pair with employee name and salary. Using
searching technique, the combiner will check all the employee salary to find
the highest salaried employee in each file. See the following snippet.
Max = v(salary);
Continue checking;
Reducer phase − Form each file, you will find the highest salaried
employee. To avoid redundancy, check all the <k, v> pairs and eliminate
duplicate entries, if any. The same algorithm is used in between the four <k,
v> pairs, which are coming from four input files. The final output should be
as follows −
<gopal, 50000>