Big Data Analytics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11
At a glance
Powered by AI
The key takeaways are that big data analytics involves collecting data from different sources and converting it into useful information for organizations. The volume of data has exploded and big data analytics helps make sense of large amounts of unstructured data.

The main steps in the CRISP-DM process are business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

The four types of big data analytics discussed are diagnostic, predictive, prescriptive, and descriptive.

BIG DATA ANALYTICS

The volume of data that one has to deal has exploded to unimaginable
levels in the past decade, and at the same time, the price of data storage
has systematically reduced. Private companies and research institutions
capture terabytes of data about their users’ interactions, business, social
media, and also sensors from devices such as mobile phones and
automobiles. The challenge of this era is to make sense of this sea of data.
This is where big data analytics comes into picture.

Big Data Analytics largely involves collecting data from different sources,
munge it in a way that it becomes available to be consumed by analysts
and finally deliver data products useful to the organization business.

Big Data analytics in different domains.

The process of converting large amounts of unstructured raw data, retrieved


from different sources to a data product useful for organizations forms the core
of Big Data Analytics.

 Business Understanding − This initial phase focuses on understanding the


project objectives and requirements from a business perspective, and then
converting this knowledge into a data mining problem definition. A preliminary
plan is designed to achieve the objectives. A decision model, especially one built
using the Decision Model and Notation standard can be used.

 Data Understanding − The data understanding phase starts with an initial


data collection and proceeds with activities in order to get familiar with the
data, to identify data quality problems, to discover first insights into the data, or
to detect interesting subsets to form hypotheses for hidden information.

 Data Preparation − The data preparation phase covers all activities to


construct the final dataset (data that will be fed into the modeling tool(s)) from
the initial raw data. Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table, record, and
attribute selection as well as transformation and cleaning of data for modeling
tools.
 Modeling − In this phase, various modeling techniques are selected and applied
and their parameters are calibrated to optimal values. Typically, there are
several techniques for the same data mining problem type. Some techniques
have specific requirements on the form of data. Therefore, it is often required to
step back to the data preparation phase.

 Evaluation − At this stage in the project, you have built a model (or models)
that appears to have high quality, from a data analysis perspective. Before
proceeding to final deployment of the model, it is important to evaluate the
model thoroughly and review the steps executed to construct the model, to be
certain it properly achieves the business objectives.

A key objective is to determine if there is some important business issue that


has not been sufficiently considered. At the end of this phase, a decision on the
use of the data mining results should be reached.

 Deployment − Creation of the model is generally not the end of the project.
Even if the purpose of the model is to increase knowledge of the data, the
knowledge gained will need to be organized and presented in a way that is
useful to the customer.

Depending on the requirements, the deployment phase can be as simple as


generating a report or as complex as implementing a repeatable data scoring
(e.g. segment allocation) or data mining process.

TYPES OF ANALYTICS

There are 4 tpes of big data analytics

1 Diagnostic 2) Predictive 3) Prescriptive 4) Descriptive

Diagnostic analytics:

It determines what has happened in the past and why it can be used to access
the no of post, like and reviews in a social media marketing campaign.

Predictive analytics:

It works by identifying the patterns and historical data and then using statistics
to make inference about the future.

Prescriptive analytics:
It doesn’t only say what is going on, but also what might happen and most
importantly what to do about it.

Descriptive analytics:

It uses existing information from the past to understand decision in present and
helps decide on effective source of action in future.

Categories:

When analyzing data you can categories big data analytics as follows

1 Basic analytics

2 Advanced analytics

3 Operational analytics

4 Monetized analytics

Basic analytics: it involves visualization of simple statics. This type of analytics


is used especially when a lot of desperate data need to be analyzed. The
process of basic analytics includes investigating what happened, when it
happened and its impact

Ex: basic monitoring of data and anomaly identification.

Advanced analytics:

It performs complex analysis of both structure and unstructured data by using


algorithms. It also used data mining tech, neural network, machine learning,
text analysis, statistics models.

Operational analytics:

It means making analytics an important part of business process for instance


,an insurance company can use this model to predict the probability of claim
being fraudulent

Monetized analytics:

It help business to take important and better decision and earn revenues.

Ex: credit card providers can use this data to offer value added products and tec
communication companies.
Importance of Big data
It is defined Big Data as conforming to the volume, velocity, and variety (V3)
attributes that characterize it. Big Data solutions aren’t a replacement for existing
warehouse solutions Big Data solutions are ideal for analyzing not only raw
structured data, but semistructured and unstructured data from a wide variety of
sources. Big Data solutions are ideal when all, or most, of the data needs to be
analyzed versus a sample of the data; or a sampling of data isn’t nearly as
effective as a larger set of data from which to derive analysis. Big Data solutions
are ideal for iterative and exploratory analysis when business measures on data
are not predetermined. When it comes to solving information management
challenges using Big Data technologies, we suggest you consider the following: • a
Big Data solution is not only going to leverage data not typically suitable for a
traditional warehouse environment, and in massive amounts of volume, but it’s
going to give up some of the formalities and “strictness” of the data. The benefit is
that you can preserve the fidelity of data and gain access to mountains of
information for exploration and discovery of business insights before running it
through the due diligence that you’re accustomed to the data that can be included
as a participant of a cyclic system, enriching the models in the warehouse. • Big
Data is well suited for solving information challenges that don’t natively fit within
a traditional relational database approach for handling the problem at hand. It’s
important that you understand that conventional database technologies are an
important, and relevant, part of an overall analytic solution. In fact, they become
even more vital when used in conjunction with your Big Data platform. A good
analogy here is your left and right hands; each offers individual strengths and
optimizations for a task at hand. There exists some class of problems that don’t
natively belong in traditional databases, at least not at first. And there’s data that
we’re not sure we want in the warehouse, because perhaps we don’t know if it’s
rich in value, it’s unstructured, or it’s too voluminous. In many cases, we can’t find
out the value per byte of the data until after we spend the effort and money to put it
into the warehouse; but we want to be sure that data is worth saving and has a high
value per byte before investing in it.

Big data Applications:


The current IT industry and many other agencies are required to store bulk data
and they need to process huge amounts of the data. The existing systems and
algorithms are not able to meet the requirements of distributed and parallel
processing of the bulk data. The big data population is ranging from individual to
MNC’s, in case of individual data population the social media applications like
face book, twitter and online shopping produces huge data as a part of data
processing. The companies like ONGC, medical field, banking industry and
insurance even in retail domains also there is need of huge data management and
processing. Following are some of the applications of big data.
Medical Domains to predict the patient decease based on the recorded symptoms
The government agencies to forecast the weather information like predicting the
floods and other cyclone related news.
In case of Oil and Gas industry to mine the materials the application need to hold
and process huge amounts of the data.
1.Understanding and Optimizing Business Processes
2. Personal Quantification and Performance Optimization
3.Improving Healthcare and Public Health
4. Improving Sports Performance
5.Improving Science and Research
6.Optimizing Machine and Device Performance
7.Improving Security and Law Enforcement
8.Improving and Optimizing Cities and Countries

Matrix-vector multiplication::mappedReduced()
MapReduce is a high level programming model for processing large data sets in parallel, originally
developed by Google, adapted from functional programming. The model is suitable for a range of
problems such as matrix operations, relational algebra, statistical frequency counting etc. To learn
more about MapReduce,
The MapReduce implementation in this example differs from the school-book
multiplication that I just introduced. A single map function will be processing only a single
matrix element rather than the whole row.

There are different implementations of matrix vector multiplication depending on whether the vector
fits into the main memory or not. This example demonstrates a basic version of matrix-vector
multiplication in which the vector fits into the main memory.
We store the sparse matrix (sparse matrices often appear in scientific and engineering
problems) as a triple with explicit coordinates (i, j, aij). E.g. the value of the first entry of the matrix
(0,0) is 3. Similarly, the value of the second entry in (0,1) is 2. We do not store the zero entries of the
matrix. This is the sparse matrix representation of the matrix from the figure above:
i, j, aij
0,0,3
0,1,2
1,1,4
1,2,1
2,0,2
2,2,1
We store the vector in a dense format without explicit coordinates. You can see this below:
4
3
1
In our case, the map function takes a single element of the sparse matrix (not the whole
row!), multiplies it with the corresponding entry of the vector and produces an intermediate
key-value pair (i, aij*vj). This is sufficient because in order to perform the summation (i.e. the reduce
step) we only need to know the matrix row, we do not need to know the matrix column. E.g. one of
the map functions takes the first element of the matrix (0,0,3), multiplies it with the first element of
the vector (4) and produces an intermediate key-value pair (0,12). The key (0), i.e. the row position
of the element in the sparse matrix, associates the value (12) with its position in the matrix-vector
product. Another map function takes the second element (0,1,2), multiplies it with the second row of
the vector (3) producing an intermediate key-value pair (0,6) etc.

Reduce function performs a summary operation on the intermediate keys. E.g. an intermediate
value with the key "0" will be summed up under a final index "0".

Explain MapReduce – Algorithm ?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper
class is used as input by Reducer class, which in turn searches matching pairs and
reduces them.
MapReduce implements various mathematical algorithms to divide a task into
small parts and assign them to multiple systems. In technical terms, MapReduce
algorithm helps in sending the Map & Reduce tasks to appropriate servers in a
cluster.

These mathematical algorithms may include the following −

 Sorting
 Searching

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-
value pairs from the mapper by their keys.

 Sorting methods are implemented in the mapper class itself.

 In the Shuffle and Sort phase, after tokenizing the values in the mapper
class, the Context class (user-defined class) collects the matching valued
keys as a collection.

 To collect similar key-value pairs (intermediate keys), the Mapper class


takes the help of RawComparator class to sort the key-value pairs.

 The set of intermediate key-value pairs for a given Reducer is automatically


sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are
presented to the Reducer.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the
combiner phase (optional) and in the Reducer phase. Let us try to understand how
Searching works with the help of an example.

Example
The following example shows how MapReduce employs Searching algorithm to find
out the details of the employee who draws the highest salary in a given employee
dataset.

 Let us assume we have employee data in four different files − A, B, C, and


D. Let us also assume there are duplicate employee records in all four files
because of importing the employee data from all database tables
repeatedly. See the following illustration.

 The Map phase processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>). See the following
illustration.

 The combiner phase (searching technique) will accept the input from the
Map phase as a key-value pair with employee name and salary. Using
searching technique, the combiner will check all the employee salary to find
the highest salaried employee in each file. See the following snippet.

<k: employee name, v: salary>


Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){

Max = v(salary);

else{

Continue checking;

The expected result is as follows −

<satish, <gopal, <kiran, <manisha,


26000> 50000> 45000> 45000>

 Reducer phase − Form each file, you will find the highest salaried
employee. To avoid redundancy, check all the <k, v> pairs and eliminate
duplicate entries, if any. The same algorithm is used in between the four <k,
v> pairs, which are coming from four input files. The final output should be
as follows −

<gopal, 50000>

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy