Big Data Engineer 2021-Ecosystem-Course Guide (2) - 21-30
Big Data Engineer 2021-Ecosystem-Course Guide (2) - 21-30
2
Unit 1. Introduction to big data
Uempty
Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure
Uempty
Big data is a term that is used to describe large collections of data (also known as data sets). Big
data might be unstructured and grow so large and quickly that is difficult to manage with regular
database or statistics tools.
Uempty
We are witnessing a tsunami of huge volume of data of different types and formats that make
managing, processing, storing, safeguarding, securing, and transporting data a real challenge.
“Big data refers to non-conventional strategies and innovative technologies that are used by
businesses and organizations to capture, manage, process, and make sense of a large volume of
data.” (Source: Reed, J, Data Analytics: Applicable Data to Advance Any Business. Seattle, WA,
CreateSpace Independent Publishing Platform, 2017. 1544916507.
The analogies:
• Elephant (hence the logo of Hadoop)
• Humongous (the underlying word for Mongo Database)
• Streams, data lakes, and oceans of data
Uempty
There is much data, such as historical and new data that is generated from social media apps,
science, medical research, stream data from web applications, and IoT sensor data. The amount of
data is larger than ever, growing exponentially, and in many different formats.
The business value in the data comes from the meaning that you can harvest from it. Deriving
business value from all that data is a significant problem.
Uempty
Uempty
Variety
Different
forms of data
Velocity
Veracity
Analysis of
streaming Value Uncertainty
of data
data
Figure 1-8. The four classic dimensions of big data (the four Vs)
Uempty
• Data variety
More sources of data mean more varieties of data in different formats: from traditional
documents and databases, to semi-structured and unstructured data from click streams, GPS
location data, social media apps, and IoT (to name a few). Different data formats mean that it is
tougher to derive value (meaning) from the data because it must all be extracted for processing
in different ways. Traditional computing methods do not work on all these different varieties of
data.
• Data veracity
There is usually noise, biases, and abnormality in data. It is possible that such a huge amount
of data has some uncertainty that is associated with it. After much data is gathered, it must be
curated, sanitized, and cleansed.
Often, this process is seen as the thankless job of being a data janitor, and it can take more
than 85% of a data analyst’s or data scientist’s time. Veracity in data analysis is considered the
biggest challenge when compared to volume, velocity, and variety. The large volume, wide
variety, and high velocity along with high-end technology has no significance if the data that is
collected or reported is incorrect. Data trustworthiness (in other words, the quality of data) is of
the highest importance in the big data world.
• Data value
The business value in the data comes from the meaning that we can harvest from it. The value
comes from converting a large volume of data into actionable insights that are generated by
analyzing information, which leads to smarter decision making.
References:
• What is big data? More than volume, velocity and variety:
https://developer.ibm.com/blogs/what-is-big-data-more-than-volume-velocity-and-variety/
• The Four Vs of Big Data:
https://www.ibmbigdatahub.com/infographic/four-vs-big-data
• Big Data Analytics:
ftp://ftp.software.ibm.com/software/tw/Defining_Big_Data_through_3V_v.pdf
• The 5 Vs of big data:
https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/
• The 4 Vs of Big Data for Yielding Invaluable Gems of Information:
https://www.promptcloud.com/blog/The-4-Vs-of-Big-Data-for-Yielding-Invaluable-Gems-of-Infor
mation
Uempty
Domain knowledge
Statistics Visualizations
Data
Machine Science Pattern
learning recognition
Business analysis
Presentation
KDD AI
Databases and
data processing
Big data analytics is the use of advanced analytic techniques against large, diverse data sets from
different sources and in different sizes from terabytes to zettabytes. There are several specialized
techniques and technologies that are involved. The slide shows some of the big data analytics
techniques and the relationship between them. This list is not exhaustive, but it helps you
understand the complexity of the problem domain.
For more information, see the articles that are listed under References.
References:
• An Insight into 26 Big Data Analytic Techniques: Part 1:
https://blogs.systweak.com/an-insight-into-26-big-data-analytic-techniques-part-1/
• An Insight into 26 Big Data Analytic Techniques: Part 2:
https://blogs.systweak.com/an-insight-into-26-big-data-analytic-techniques-part-2/
• Big data analytics:
https://www.ibm.com/analytics/hadoop/big-data-analytics
• A Beginner’s Guide to Big Data Analytics:
https://blogs.systweak.com/a-beginners-guide-to-big-data-analytics/
Uempty
1.2. Big data use cases
Uempty