0% found this document useful (0 votes)
3 views37 pages

Lec 1 - Introduction to Big Data

The document outlines a course on Big Data, focusing on its storage, processing, and core NoSQL concepts. It discusses the characteristics of Big Data, including volume, velocity, variety, veracity, and value, along with the classification of data types. Additionally, it highlights enterprise technologies related to Big Data and Business Intelligence, emphasizing the differences between traditional BI and Big Data methodologies.

Uploaded by

amirosama21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views37 pages

Lec 1 - Introduction to Big Data

The document outlines a course on Big Data, focusing on its storage, processing, and core NoSQL concepts. It discusses the characteristics of Big Data, including volume, velocity, variety, veracity, and value, along with the classification of data types. Additionally, it highlights enterprise technologies related to Big Data and Business Intelligence, emphasizing the differences between traditional BI and Big Data methodologies.

Uploaded by

amirosama21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Understanding Big Data - I

Lecture 1
Course Objectives

▪ Understand the fundamentals of storage and


processing of Big Data

▪ Illustrate the core concepts of NoSQL for Big


Data
Prerequisites

▪ To be familiar with the following Database Concepts:


• Database Model and schema.

• Different Database Models

• Normalization

• Standard SQL

• Distributed Databases

• Transaction Processing and Concurrency Control


Lecture Outlines

▪ Why Big Data?

▪ The Definition of Big Data

▪ Characteristics/Challenges of Big Data

▪ Classification of Big Data

▪ Applications of Big Data

▪ Enterprise Technologies for Big Data and Business Intelligence

4
The Model of
Why Big Data? Generating/Consuming
Data has Changed
Why Big Data?

▪ 2.5 quintillion (1018) bytes of data are generated every day!

• Social media sites


• Sensors
• Digital photos
• Business transactions
• Location-based data Website Social Media

• Generative AI Toold
Billing
ERP Network Switches
CRM RFID

Source: IBM http://www-01.ibm.com/software/data/bigdata/


Glen Mules – Big Data University
▪ Big data itself isn’t new – it is been here
Why Big for a while and growing exponentially.
Data?
▪ What is new is the technology to
process and analyze it.

• Increase of storage capacities.

• Increase of processing power.

• Availability of data.
It is all about
Available technology can cost
deriving new insight effectively manage and analyze all
available data in its native form
for the business unstructured, structured, streaming
▪ Wikipedia
• Big data is a term for datasets that are so large or
What is Big complex that traditional data processing applications are
inadequate to deal with them.
Data? • Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, querying,
updating and information privacy

▪ Gartner
• Big data is a popular term used to acknowledge the
exponential growth, availability and use of information in
the data-rich landscape of tomorrow.

▪ Academia
Key idea: “Big” is relative! • Big Data is any data that is expensive to manage and hard
“Difficult Data” is perhaps to extract value from.”
more apt!
▪ DeepSeek
• Big Data refers to extremely large and complex datasets
that traditional data processing tools and methods are
unable to efficiently manage, analyse, or interpret.

Bill Howe, UW 9
Characteristics of Big Data 3 Vs
Super exponential
Volume growth in data volume

▪ Refers to the vast amount of


data generated every second.

▪ It is not only Terabytes, but


Zettabytes or Brontobytes.

▪ This makes most datasets too


large to store and analyse using
traditional database technology.

▪ Big data tools use distributed


systems to store and analyse
data that are dotted around
anywhere in the world.

https://www.virtualb.it/en/blog/big-data-or-small-data-the-value-its-not-about-quantity-but-about-quality/ 11
Super exponential
Volume growth in data volume

▪ High data volumes impose:


• Distinct data storage and processing demands.
• Additional data preparation, curation and management
processes

https://medium.com/analytics-vidhya/the-5-vs-of-big-data-2758bfcc51d
Data can arrive at fast
Velocity speeds

▪ The most challenging V to conquer, since it


has a compounding effect on the other Vs.

▪ The velocity of data translates into the amount


of time it takes for the data to be processed
once it arrives .

▪ Coping with the fast inflow of data requires to


design highly elastic and available data
processing solutions and corresponding data
storage capabilities.

The 3 V’s of Big Data: Velocity Remains A


Challenge for Many.
Dennis Duckworth | Jan 4, 2023
Data can arrive at fast
Velocity speeds

▪ It is a challenge to manage, analyze, summarize, visualize, and discover


knowledge from the collected data in a timely manner and in a scalable
fashion

https://www.researchgate.net/figure/Examples-of-big-data-velocity_fig3_313400371
Variety Multiple formats and
types of data

▪ Refers to the different types of data we need to use. In fact, 80% of the
world’s data is unstructured.
▪ Data variety brings challenges in terms of data integration,
transformation, processing, and storage.

• Structured data • Data streams


- Financial transactions, students’ records, etc. - Sensor data, RFID data, network data,
trajectory data, etc.
• Documents
• Time series data
- Unstructured text data (Web).
- Semi-structured data (XML, RDF triples, etc.) - Stock exchange data, video/audio data,
trajectory, EEG data, etc.
• Graphs
- Social networks, Semantic Web (RDF graphs), road • Multimedia data
networks, etc. - Audio, video, image, etc.
They could also be 4 V’s © 2014 IBM Corporation
How accurate or
Veracity truthful a data set
may be

▪ Veracity is the degree to which data is


accurate, precise, and trustworthy because
of the biasedness, noise, abnormality in
data.
▪ It also refers to incomplete data or the
presence of errors, outliers, and missing
values.
▪ To convert this type of data into a
consistent, consolidated, and united source
of information creates a big challenge for https://www.researchgate.net/figure/Conceptualization-of-the-
the enterprise Components-of-Big-Data-Veracity_fig3_260178341

Veracity: The Most Important “V” of Big Data


Aug 29, 2019

17
OR 5 V’s © 2014 IBM Corporation
Value The usefulness of
data

▪ The value characteristic is intuitively related to


the veracity characteristic in that the higher
the data fidelity, the more value it holds for the
business.
▪ Value is also dependent on how long data
processing takes because analytics results
have a shelf-life;
▪ i.e., a 20 minute delayed stock quote has little
to no value for making a trade compared to a
quote that is 20 milliseconds old.

▪ Value and time are inversely related. The


longer it takes for data to be turned into
meaningful information, the less value it has
for a business.
Value The usefulness of
data

▪ Apart from veracity and time, value is also impacted


by the following lifecycle-related concerns:
▪ How well has the data been stored?
▪ Were valuable attributes of the data removed during data
cleansing?
▪ Are the right types of questions being asked during data analysis?
▪ Are the results of the analysis being accurately communicated to
the appropriate decision-makers?
And 10 V’s
Classification ▪ The data processed by Big Data solutions can
be human-generated or machine-generated,
of Big Data although it is ultimately the responsibility of
machines to generate the analytic results.
▪ Human-generated data is the result of human
interaction with systems
• Online services and digital devices.
▪ Machine-generated data is generated by
software programs and hardware devices in
response to real-world events.
• Point-of-sale system or information conveyed
from the numerous sensors in a cell phone.
The primary types of data are:
• Structured Data
• Unstructured Data
• Semi-Structured Data
▪ Apart from these three fundamental data types,
another important type of data in Big Data
environments is metadata.
Classification of Big Data Structured data

▪ Structured data is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data.
▪ Data stored in databases is an example of structured data.

▪ It is easy to work with structures data W.R.T the following


▪ CRUD operations - SQL

▪ Indexing

▪ Security

▪ Scalability – Scale up

▪ Transaction Processing – ACID properties


Semi-structured
Classification of Big Data data

▪ Semi-structured data has a defined level of


structure and consistency but is not relational in
nature. Instead, semi-structured data is
hierarchical or graph-based.
• XML, markup languages like HTML, etc.
▪ There is no separation between the data and the
schema.

▪ Semi-structured data often has special pre-


processing and storage requirements, especially
if the underlying format is not text-based.
▪ Validation of an XML file to ensure that it conformed to its
schema definition.

▪ Metadata for this data is available but is not


sufficient.
Un-structured
Classification of Big Data data

▪ Unstructured data is the data which does not


conform to a data model or is not in a form which
can be used easily by a computer program.
▪ It is estimated that unstructured data makes up 80%
of the data within any given enterprise
• i.e., memos, chat rooms, presentations, body of an
email, images, videos, letters, researches, white papers
etc.
▪ Unstructured data has a faster growth rate than
structured data.
▪ Unstructured data cannot be directly processed or
queried using SQL. If it is required to be stored
within a relational database, it is stored in a table as
a Binary Large Object (BLOB).
▪ Alternatively, a NoSQL database can be used to
store unstructured data alongside structured data.
Classification of Big Data Meta Data

▪ Metadata provides information about a dataset’s characteristics and


structure.

▪ Mostly machine-generated and can be appended to data.

▪ The tracking of metadata is crucial to Big Data processing, storage and


analysis because it provides information about the pedigree of the data
and its provenance during processing.

▪ Examples of metadata include:

• XML tags providing the author and creation date of a document

• attributes providing the file size and resolution of a digital photograph

▪ Big Data solutions rely on metadata, particularly when processing semi-


structured and unstructured data
Protein-to-protein
Hurricane moving interaction networks
path predication

Satellite imagery, mobile


station, distributed sensor
networks, geographical
plotting …

Web data

Volcano monitoring

Digital health
care
Intelligent transportation

Applications of Big Data


Self-learning

▪ Kamal, Raj, and Preeti Saxena. Big Data Analytics:


Introduction to Hadoop, Spark, and Machine-Learning.
McGraw-Hill Education, 2019.
Chapter 1:
1.7 Big Data Analytics Applications and Case Studies

▪ Bahga, Arshdeep, and Vijay Madisetti. Big Data analytics: A


hands-on approach, 2020.
Chapter 1:
1.4 Domain Specific Examples of Big Data
Enterprise Technologies for Big Data and Business
Intelligence

▪ Big Data has ties to business architecture at each of the organizational layers.

▪ In an enterprise executed as a layered system, the strategic layer constrains the


tactical layer, which directs the operational layer.

▪ The transformation of data into information, information into knowledge and


knowledge into wisdom require the understanding of the following Concepts:
▪ Online Transaction Processing (OLTP)

▪ Online Analytical Processing (OLAP)

▪ Extract Transform Load (ETL)

▪ Data Warehouses

▪ Data Marts

▪ Traditional BI

▪ Big Data BI
Online Transaction Processing (OLTP)

▪ OLTP is a software system that processes transaction-oriented data.


▪ The term “online transaction” refers to the completion of an activity in real time
and is not batch-processed.
▪ OLTP systems store operational data that is normalized. This data is a common
source of structured data and serves as input to many analytic processes.
• OLTP systems example: ticket reservation systems, banking and point of sale
systems.

▪ OLTP queries are comprised of simple insert, delete and update operations with
sub-second response times.
Online Analytical Processing (OLAP)
▪ OLAP systems are used for processing data analysis queries. They form an
integral part of business intelligence, data mining and machine learning
processes.
▪ They are relevant to Big Data in that they can serve as both a data source as
well as a data sink that is capable of receiving data.
▪ They are used in diagnostic, predictive and prescriptive analytics.
▪ OLAP systems perform long-running, complex queries against a
multidimensional database whose structure is optimized for performing
advanced analytics.
▪ OLAP systems store historical data that is aggregated and denormalized to
support fast reporting capability.
Extract Transform Load (ETL)

▪ ETL is a process of loading data from a source system into a target system. It
represents the main operation through which data warehouses are fed data.

▪ The required data is first obtained or extracted from the sources, after which the
extracts are modified or transformed by the application of rules. Finally, the data is
inserted or loaded into the target system.

▪ The source system and the target


system can be a database, a flat file, or
an application.

▪ A Big Data solution encompasses the


ETL feature-set for converting data of
different types.
Data Warehouses
▪ A data warehouse is a central, enterprise-wide repository consisting of historical and
current data.
▪ They are heavily used by BI to run various analytical queries.
▪ They usually interface with an OLAP system to support multi-dimensional analytical
queries.
▪ Data pertaining to multiple business entities from different operational systems is
periodically extracted, validated, transformed and consolidated into a single denormalized
database.
Data Mart

▪ A data mart is a subset of the data stored in a data warehouse that typically
belongs to a department, division, or specific line of business. Data
warehouses can have multiple data marts.
Business intelligent
Traditional BI
▪ Utilizes descriptive and diagnostic analytics to provide
information on historical and current events.
▪ It is not “intelligent” because it only provides answers to
correctly formulated questions.
• Ad-hoc reports – custom-made reports on a specific area of
the business
• Dashboards - holistic view of key business areas at periodic
intervals in realtime or near-realtime

Big Data BI
▪ builds upon traditional BI by acting on the cleansed,
consolidated enterprise wide data in the data warehouse
and combining it with semi-structured and unstructured
data sources.
▪ It comprises both predictive and prescriptive analytics to
facilitate the development of an enterprise-wide
understanding of business performance.
Business Intelligence vs Big Data

▪ Although Big Data and Business Intelligence are two technologies used to analyse data to help
companies in the decision-making process, there are differences between both of them. They differ in
the way they work as much as in the type of data they analyse.
▪ Traditional BI methodology is based on the principle of grouping all business data into a central Data
Warehouse and analysing it in offline mode. The data is structured in a conventional relational database
with an additional set of indexes and forms of access to the tables (multidimensional cubes).
▪ These are the main differences between Big Data and Business Intelligence:
▪ In a Big Data environment, information is stored on a distributed file system, rather than on a central server. It is a much
safer and more flexible space.

▪ Big Data solutions carry the processing functions to the data, rather than the data to the functions. As the analysis is
centered on the information, it´s easier to handle larger amounts of information in a more agile way.

▪ Big Data can analyse data in different formats, both structured and unstructured. The volume of unstructured data is
growing at levels much higher than the structured data. Nevertheless, its analysis carries different challenges. Big Data
solutions solve them by allowing a global analysis of various sources of information.

▪ Data processed by Big Data solutions can be historical or come from real-time sources. Thus, companies can make
decisions that affect their business in an agile and efficient way.

▪ Big Data technology uses parallel mass processing (MPP) concepts, which improves the speed of analysis. With MPP many
instructions are executed simultaneously, and since the various jobs are divided into several parallel execution parts, at the
end the overall results are reunited and presented. This allows you to analyse large volumes of information quickly.
Case Study You are required to go
through the case study at
the end of chapter 1.

“Ensure to Insure (ETI)

Erl T, Khattak W, Buhler P. Big data fundamentals: concepts, drivers &


techniques. Prentice Hall Press; 2016 Jan.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy