CHAPTER 02: Big Data Analytics
CHAPTER 02: Big Data Analytics
CHAPTER 02: Big Data Analytics
BDA
• Big data analytics is the process of examining large and varied data sets
-- i.e., big data -- to uncover hidden patterns, unknown correlations,
market trends, customer preferences and other useful
information/insights to make faster and better decisions (informed
decisions).
• The Big data encompasses a mix of structured, semi-structured and
unstructured data -- for example, growing volumes of structured
transaction data, internet clickstream data, web server logs, social
media content, text from customer emails and survey responses, mobile
phone records, and machine data captured by sensors connected to the
internet of things.
BDA
• Data analytics is the science of extracting patterns,
trends, and actionable information from large sets of
data generated by hundreds of millions of devices, things
like wearable tech, smartphones, and anything that’s
part of the Internet of Things (IoT).
• Data analytics helps to slice and dice the data to extract
insights that allow to leverage this data to give an
organization a competitive advantage.
• BDA supports 360 degree view of the customer
(clickstream data which is unstructured).
BDA
• Businesses can use advanced analytics techniques such
as text analytics, machine learning, predictive analytics,
data mining, statistics and natural language processing to
gain new insights from previously untapped data sources
independently or together with existing enterprise data.
BDA
• It is technology enabled analytics to process and
analyze big data.
• BDA is about gaining a meaningful, deeper, and
richer insight into the business to steer it in the right
direction, understanding the customer’s
demographics to cross-sell and up-sell to them, by
better leveraging the services of vendors and
suppliers, etc.
BDA
• BDA is about a competitive edge over competitors by
enabling quicker and better decision-making.
• BDA is a tight handshake between three
communities: IT, business users, and data scientists.
• BDA is working with data sets whose volume and
variety exceed the current storage, processing
capabilities and infrastructure of an enterprise.
• BDA is about moving code to data, because programs
for distributed processing is tiny (few KBs) compared
to the data (TB, PB, EB, ZB and YB).
Types of USD available for analysis
BDA
Social
Media
Infographic
Difference between Data Analysis and Data
Science
• Data Analysis includes descriptive analytics and
prediction to a certain extent.
• On the other hand, Data Science is more about
Predictive Analytics and Machine Learning.
• Data Science is a more forward-looking approach, an
exploratory way with the focus on analyzing the past or
current data and predicting the future outcomes with
the aim of making informed decisions.
Use-cases of for Data Science
• Internet search
• Digital Advertisements
• Recommender Systems
• Image Recognition
• Speech Recognition
• Gaming
• Price Comparison Websites
• Airline Route Planning
• Fraud and Risk Detection
• Medical diagnosis, etc.
Data Science is multi-disciplinary
Data Science process
1. Collecting raw data from multiple disparate data
sources.
2. Processing the data
3. Integrating the data and preparing clean datasets
4. Engaging in explorative data analysis using model (ML
model) and algorithms.
5. Preparing presentations using data visualization.
6. Communicating the findings to all stakeholders
7. Making faster and better decisions.
Business Acumen (wisdom) skills of Data
Scientist
• Understanding of domain
• Business strategy
• Problem solving
• Communication
• Presentation
• Thirst for knowledge
Technology Expertise of Data Scientist
• Good database knowledge such as RDBMS
• Good NoSQL database knowledge such as MongoDB,
Cassandra, Hbase, etc.
• Languages such as Java, Python, R, C++, etc.
• Open-source tools such as Hadoop.
• Data warehousing
• Data Mining
• Excellent understanding of machine learning techniques
and algorithms, such as K-means, Regression, kNN, Naive
Bayes, SVM, PCA, Decision tree, Tableau, Flare, Google
visualization APIs, text analytics, etc.
Mathematics Expertise of Data Scientist
• Mathematics
• Statistics
• AI/Machine learning/DL
• Algorithms
• Pattern recognition
• NLP
Data Scientist
• A data scientist is a professional with the capabilities
to gather large amounts of data to analyze and
synthesize the information into actionable plans for
companies and other organizations.
• A data scientist is a professional responsible for
collecting, analyzing and interpreting large amounts of
data to identify ways to help a business to improve
operations and gain a competitive edge over
competitors.
• They're part mathematician, part
computer scientist and part trend-spotter.
Responsibilities of Data Scientist
1. Prepare and integrates large and varied datasets and develop
relevant data sets for analysis.
2. Thoroughly clean and prune data to discard irrelevant
information
3. Applies business/domain knowledge to provide context.
4. Employs a blend of analytical techniques to develop models
and algorithms to understand the data, interpret relationships,
spot trends and unveil patterns.
5. Communicates or presents findings or results in the business
context in a language that is understood by the different
business stakeholders.
6. Invent new algorithms to solve problems and build new tools to
automate work
Responsibilities of Data Scientist
7. Employ sophisticated analytics programs, machine
learning and statistical methods to prepare data for use in
predictive and prescriptive modeling.
8. Explore and examine data from a variety of angles to
determine hidden weaknesses, trends and/or
opportunities
Terminologies/Technologies used in big data environment
1. In-memory analytics
2. In-database processing
3. Symmetric multiprocessor system (SMP)
4. Massively parallel processing
5. Parallel and distributed systems
6. Shared nothing architecture
In-memory analytics