Lec 1 - Introduction to Big Data
Lec 1 - Introduction to Big Data
Lecture 1
Course Objectives
• Normalization
• Standard SQL
• Distributed Databases
4
The Model of
Why Big Data? Generating/Consuming
Data has Changed
Why Big Data?
• Generative AI Toold
Billing
ERP Network Switches
CRM RFID
• Availability of data.
It is all about
Available technology can cost
deriving new insight effectively manage and analyze all
available data in its native form
for the business unstructured, structured, streaming
▪ Wikipedia
• Big data is a term for datasets that are so large or
What is Big complex that traditional data processing applications are
inadequate to deal with them.
Data? • Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, querying,
updating and information privacy
▪ Gartner
• Big data is a popular term used to acknowledge the
exponential growth, availability and use of information in
the data-rich landscape of tomorrow.
▪ Academia
Key idea: “Big” is relative! • Big Data is any data that is expensive to manage and hard
“Difficult Data” is perhaps to extract value from.”
more apt!
▪ DeepSeek
• Big Data refers to extremely large and complex datasets
that traditional data processing tools and methods are
unable to efficiently manage, analyse, or interpret.
Bill Howe, UW 9
Characteristics of Big Data 3 Vs
Super exponential
Volume growth in data volume
https://www.virtualb.it/en/blog/big-data-or-small-data-the-value-its-not-about-quantity-but-about-quality/ 11
Super exponential
Volume growth in data volume
https://medium.com/analytics-vidhya/the-5-vs-of-big-data-2758bfcc51d
Data can arrive at fast
Velocity speeds
https://www.researchgate.net/figure/Examples-of-big-data-velocity_fig3_313400371
Variety Multiple formats and
types of data
▪ Refers to the different types of data we need to use. In fact, 80% of the
world’s data is unstructured.
▪ Data variety brings challenges in terms of data integration,
transformation, processing, and storage.
17
OR 5 V’s © 2014 IBM Corporation
Value The usefulness of
data
▪ Structured data is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data.
▪ Data stored in databases is an example of structured data.
▪ Indexing
▪ Security
▪ Scalability – Scale up
Web data
Volcano monitoring
Digital health
care
Intelligent transportation
▪ Big Data has ties to business architecture at each of the organizational layers.
▪ Data Warehouses
▪ Data Marts
▪ Traditional BI
▪ Big Data BI
Online Transaction Processing (OLTP)
▪ OLTP queries are comprised of simple insert, delete and update operations with
sub-second response times.
Online Analytical Processing (OLAP)
▪ OLAP systems are used for processing data analysis queries. They form an
integral part of business intelligence, data mining and machine learning
processes.
▪ They are relevant to Big Data in that they can serve as both a data source as
well as a data sink that is capable of receiving data.
▪ They are used in diagnostic, predictive and prescriptive analytics.
▪ OLAP systems perform long-running, complex queries against a
multidimensional database whose structure is optimized for performing
advanced analytics.
▪ OLAP systems store historical data that is aggregated and denormalized to
support fast reporting capability.
Extract Transform Load (ETL)
▪ ETL is a process of loading data from a source system into a target system. It
represents the main operation through which data warehouses are fed data.
▪ The required data is first obtained or extracted from the sources, after which the
extracts are modified or transformed by the application of rules. Finally, the data is
inserted or loaded into the target system.
▪ A data mart is a subset of the data stored in a data warehouse that typically
belongs to a department, division, or specific line of business. Data
warehouses can have multiple data marts.
Business intelligent
Traditional BI
▪ Utilizes descriptive and diagnostic analytics to provide
information on historical and current events.
▪ It is not “intelligent” because it only provides answers to
correctly formulated questions.
• Ad-hoc reports – custom-made reports on a specific area of
the business
• Dashboards - holistic view of key business areas at periodic
intervals in realtime or near-realtime
Big Data BI
▪ builds upon traditional BI by acting on the cleansed,
consolidated enterprise wide data in the data warehouse
and combining it with semi-structured and unstructured
data sources.
▪ It comprises both predictive and prescriptive analytics to
facilitate the development of an enterprise-wide
understanding of business performance.
Business Intelligence vs Big Data
▪ Although Big Data and Business Intelligence are two technologies used to analyse data to help
companies in the decision-making process, there are differences between both of them. They differ in
the way they work as much as in the type of data they analyse.
▪ Traditional BI methodology is based on the principle of grouping all business data into a central Data
Warehouse and analysing it in offline mode. The data is structured in a conventional relational database
with an additional set of indexes and forms of access to the tables (multidimensional cubes).
▪ These are the main differences between Big Data and Business Intelligence:
▪ In a Big Data environment, information is stored on a distributed file system, rather than on a central server. It is a much
safer and more flexible space.
▪ Big Data solutions carry the processing functions to the data, rather than the data to the functions. As the analysis is
centered on the information, it´s easier to handle larger amounts of information in a more agile way.
▪ Big Data can analyse data in different formats, both structured and unstructured. The volume of unstructured data is
growing at levels much higher than the structured data. Nevertheless, its analysis carries different challenges. Big Data
solutions solve them by allowing a global analysis of various sources of information.
▪ Data processed by Big Data solutions can be historical or come from real-time sources. Thus, companies can make
decisions that affect their business in an agile and efficient way.
▪ Big Data technology uses parallel mass processing (MPP) concepts, which improves the speed of analysis. With MPP many
instructions are executed simultaneously, and since the various jobs are divided into several parallel execution parts, at the
end the overall results are reunited and presented. This allows you to analyse large volumes of information quickly.
Case Study You are required to go
through the case study at
the end of chapter 1.