0% found this document useful (0 votes)

0 views34 pages

Intro To Data Engineering!

Uploaded by

Jay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views34 pages

Intro To Data Engineering!

Uploaded by

Jay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

LECTURE 01

Welcome to Data
Engineering!
April 10, 2025

1
Why Learn Data Engineering?

Data engineering is an essential ingredient

of real-world data science projects.
A set of activities that include The backbone,
collecting, collating, extracting, plumbing, or
moving, transforming, cleaning,
infrastructure that
integrating, organizing,
representing, storing, and supports data science.
processing data.

Data engineering is as essential as plumbing!

● When it works well, you donʼt realize it exists.
● When it doesnʼt, youʼll really know. 2
3
4
5
Data Science Data Engineering
Data Science: The Conventional View Nowadays, Data Science also involves
Data Engineering:
A data scientist operating alone, on A set of activities that include
one static dataset at a time, with a collecting, collating, extracting, moving,
clean “rectangularˮ shape and fitting transforming, cleaning, integrating,
in main-memory, employing various organizing, representing, storing, and
statistical and ML algorithms on processing data.
predeﬁned objectives.
● From Data 100 ● Messy (often non-rectangular),
● Also the view reinforced by dynamic, and large datasets
“popularˮ Machine Learning, ● One team generates the data,
e.g., leaderboards, Kaggle, … another team consumes it
● Unclear and ill-defined objectives
A lot of data engineering ● Necessary precursor to real-world
must happen to support data science & ML
the conventional view! ● etc. 6
[1/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world

data science projects involve data
engineering:
○ cleaning, moving, restructuring,
processing, …
● Often underappreciated compared
to other activities, e.g., ML.)

7
[2/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.

2. Data engineer roles >> data scientist roles.

“… 70% more open roles at

companies in data engineering as
compared to data science.ˮ
Mihail Eric, Jan 2021.[blog]

1. Data science projects largely focus on data engineering.

2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.

“ML codeˮ is not only a

small fraction of the
system; it is also often
simple—calls to
standard libraries
(sklearn, pytorch, etc.)

Sculley et al., SE4ML 2014 [google research]. 9

Data Engineering is Essential in ML/AI

“Under the strong influence

of the current AI hype,
people try to plug in data that’s
dirty & full of gaps, that
spans years while changing
in format and meaning,
that’s not understood yet,
thatʼs structured in ways
that donʼt make sense, and
expect those tools to
magically handle it.ˮ

Monica Rogati, 2017

10
[blog].
NEW Machine Learning Engineer

“ML Engineerˮ: a specialization of

data engineer focused on
operationalizing ML.
“A need for a person that would
reunite two warring parties…
One being fluent just enough in
both fields Data Science and
Software Engineering] to get the
product up and running…
...taking data scientistsʼ code and
making it more effective and
scalable. …ˮ

Tomasz Dudek,, Scalability will be an important

2018 [blog].
focus for us! 11
[4/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.

2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.
4. Balance your data techniques with a systems perspective.

As a Data Science major,

you are likely familiar with …but you are likely less
techniques: statistics/ML familiar with systems.
concepts & algorithms…

● You will learn systems and the infrastructure that enables these techniques.
● Youʼll start thinking about efficiency and scalability, esp. on large datasets.
● Various “plumbing analogiesˮ: data pipelines, data flows, …

12
All these Data Systems!!!

13
2024 MAD ML/AI/Data) Landscape
So…what is Data 101 about?

Data systems is a difficult subject! There are

many, many data systems – too many for us to
cover.
● In this class, we will try to cover the key
categories and underlying principles.
● This way, you can make informed decisions
about when to use what type of system.

2023 MAD (ML/AI/Data) Landscape: blog, interactive 14

Demystifying Industry Jargon

Data systems are tools

that support data
engineering.

(the same VC who made the MAD

Landscape diagram)
What you mostly learn at Berkeley (eg. DATA 100

Use-Case-Speciﬁc
Fit for purpose
Raw Data Self-Service
Transactions
Sensors
Log Files
Experiments Data
Preparation

“Experts are close to the data

Data preparation example: and should be the ones
Research experiments extracting / analyzing”
Alternative picture, but more traditional enterprise

Use-Case-Speciﬁc

Data Fit for purpose

Raw Data Preparation Self-Service
Transactions
Sensors
Log Files
Experiments Source of Truth
Governed
Secure
Data Integration example: Audited
UC Berkeley data (contracts, Managed
Data
student info, grants, etc…) – Integration
must be centrally managed “Compute is expensive, data is
precious”
How this actually happens? E, T, and L
Extract: Scrape raw data from all the source systems,
e.g., transactions, sensors, log files, experiments, tables,
bytestreams, …
Transform: Apply a series of rules or functions, wrangle
data into schema(s)/format(s)
Load: Load data into a data storage solution

18
Traditional Single Source of Truth: Data Warehouses - through ETL

Entire organizations
Transform centered around this
ETL process!

Raw Data
Transactions
Sensors
Log Files Extract Data
Experiments Source of Truth
Integration
Governed
Extract or scraping from API or
log ﬁle, transform into common
Load Secure
Audited
schema/format, load in parallel Managed
to “data warehouse”

Data Warehouse
ELT for Data Warehouses: A Newer Picture (e.g. Original Snowflake)

Load without doing a lot of

transformation, with transformations
done in SQL
Faster to get going, and more
scalable, but requires more data
Raw Data warehousing knowledge (& may be
more expensive).
Transactions
Sensors
Extract
Log Files Data
Experiments Integration Source of Truth
Transform Governed
Secure
Audited
Managed

Load
Data Warehouse
From Warehouses → … Lakes??? 💦 Got until here

Data Warehouses are expensive

● Warehouses expect some degree of structure
● Transformation is costly, not necessarily just computing
but engineering time

What about skipping the “data warehouse” entirely?

No Loading! Just “dumpˮ the data in
Letʼs be sloppy!
Letʼs be …agile…
Enter the data lake
Editorial note: Understand, but to not try to make too
much sense of why these terms came to be. Often just
marketing…]
ET? For Data Lakes?

Data Lake

Use-Case-Speciﬁc

Raw Data
Transform Data Fit for purpose
Preparation Self-Service
Transactions
Sensors
Log Files Extract
Experiments

No need to “load/manage” data 22

Data is dumped in cheaply and

massaged as needed for various (joke)
use-cases
Usually code-centric (Spark)
Why go through all this trouble?

Once data is “lostˮ (i.e. not saved, deleted, etc) it

cannot be recovered. So record everything.

Recreating history is exceptionally hard. Canʼt predict

when a particular measurement will be crucial to
understanding some situation.

How do you know something improved if you cannot

measure change?
23
The Two Extremes

Data Warehouse, 1990s Data Lake, 2010s

● “Single source of truthˮ: A central, ● Emerged during Hadoop/Spark
organized repository of data used revolution
for analytics throughout an ● “Landing zoneˮ: unconstrained
enterprise. storage for any and all data
● Design the uber-schema up-front ● Data is then analyzed on demand
of all of the rectangular tables ● Extract into files/storage
youʼd ever want. ● Load into storage (easy!)
● Extract from trusted sources ● Transform on demand for any use.
● Transform to warehouse schema ○ Create new files in the lake,
using custom tools catalog files as they go for
● Load data warehouse reuse
● Old school ETL solution: ○ Often code-centric
Informatica
Modern solution is likely Many-to-Many, ETLT

Some datasets may directly be

Data Lake loaded into a data warehouse

Transform Use-Case-Speciﬁc
Fit for purpose
Raw Data
Transactions
Transform Data
Self-Service

Sensors
Log Files Extract Preparation /
Integration
Experiments Source of Truth
Governed
Sometimes start with a data lake 25

Empower data scientists to work

Load Secure
Audited
on ad-hoc use cases Transform Managed

Allow for datasets that “graduate” This class will focus a lot on
to a carefully managed warehouse T: Transform
Data Warehouse
26
A Modern Buzzword for the Modern Solution: A Data Lakehouse 2020

27
…but that was just the beginning….

As we move away from a “managed” data

warehouse, there are other considerations we
need to worry about … 28
Really, really important considerations

Data Lake
Data
Discovery &
Assessment
Use-Case-Speciﬁc
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Governed
29
Secure
Audited
Managed

Data Warehouse
Important considerations
Data Discovery, Data Assessment
● Ad-Hoc: End-users land data, explore it, label it
● Systematic: Crawl/index the data lake for files
○ E.g., for CSV/JSON
● Very content-centric: really a form of
analytics/prediction
○ Try to figure out what type of data you have.
● AI  People!
Data Quality & Integrity
● Boolean Integrity checks
● Often specified by people, also “minedˮ by AI
● Data changes ALL the time, especially from clients.
● Enforced: can “rejectˮ or “sequesterˮ data that violates
○ e.g no two products that have the same product ID!
Donʼt forget about Metadata!

Data alone is not enough, also need context. Also need to capture metadata!

Application Metadata:
● Data entities (e.g. students, courses, employees for a university)
● Relationships between data
● Constraints
Behavioral Metadata:
● Data Lineage – where did it come from?
● Audit Trails of Usage – who ran this job, and what did it do?
Change Metadata
● Version info for all the above
● Timestamps
Modern solutions

Data Warehouse
Making things Dynamic: Operationalization and Feedback
Operationalization: Everything is an ongoing feed!
● When do jobs kick off, and what do they do?
● How are tests registered, exceptions handled,
people alerted?
● How do experiments “graduateˮ into processes?
Feedback: Every data “productˮ is of interest!
● Some are datasets in their own right. If you produce a table, thatʼs also
data!
● Many are new processes that generating new data feeds!
○ ML models: Constantly yielding predictions.
■ Compare old predictions to new predictions?
Real life is messy

Data Warehouse

Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
163 pages
100 Data Engineering QUESTIONS ANSWERS
No ratings yet
100 Data Engineering QUESTIONS ANSWERS
59 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Data Engineering Training Technology Agnostic Foundations
No ratings yet
Data Engineering Training Technology Agnostic Foundations
50 pages
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
1-Pre Requisite For Data Scientist-03!01!2025
No ratings yet
1-Pre Requisite For Data Scientist-03!01!2025
26 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
Data Engineering - Beginner's Guide
100% (1)
Data Engineering - Beginner's Guide
9 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
DM Lecture 5
No ratings yet
DM Lecture 5
31 pages
De Unit - I
No ratings yet
De Unit - I
43 pages
The Background and Skill of Data Engineer
No ratings yet
The Background and Skill of Data Engineer
9 pages
Data Engineering Vs Data Science
No ratings yet
Data Engineering Vs Data Science
26 pages
UNIT 1 Merged
No ratings yet
UNIT 1 Merged
11 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
4 Data Engineering
No ratings yet
4 Data Engineering
34 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
The Evolving Role of The Data Engineer
No ratings yet
The Evolving Role of The Data Engineer
61 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
6 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
1 Intro
No ratings yet
1 Intro
33 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
1) Data-Sci Chapter-1
No ratings yet
1) Data-Sci Chapter-1
17 pages
CH1 - Introduction To Data Engineering
No ratings yet
CH1 - Introduction To Data Engineering
36 pages
Data Engineeing 1 Pages 2
No ratings yet
Data Engineeing 1 Pages 2
14 pages
Evolution of Data Engineer.
No ratings yet
Evolution of Data Engineer.
2 pages
L1 - Introduction and Data EcoSystem
No ratings yet
L1 - Introduction and Data EcoSystem
42 pages
Data Engineer Preparation
No ratings yet
Data Engineer Preparation
5 pages
M1.2 Building A Data Lake
No ratings yet
M1.2 Building A Data Lake
60 pages
Coursera - IBM - Introduction To Data Analytics
No ratings yet
Coursera - IBM - Introduction To Data Analytics
13 pages
De Notes
No ratings yet
De Notes
3 pages
Evolution of The Data Engineer
No ratings yet
Evolution of The Data Engineer
1 page
OD M2 Building A Data Lake
No ratings yet
OD M2 Building A Data Lake
59 pages
Jeena Sikho
No ratings yet
Jeena Sikho
40 pages
Lecture 3 Data Engineering Concepts, Processes, and Tools
No ratings yet
Lecture 3 Data Engineering Concepts, Processes, and Tools
2 pages
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
No ratings yet
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
43 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
665eb6096eb6f40018ac2c4f - ## - E-Notes - The Indian Partnership Act and LLP Act
No ratings yet
665eb6096eb6f40018ac2c4f - ## - E-Notes - The Indian Partnership Act and LLP Act
66 pages
Inbound 2613578228155417375
No ratings yet
Inbound 2613578228155417375
2 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
No ratings yet
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
31 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
What Is Data Engineering?: Think
No ratings yet
What Is Data Engineering?: Think
13 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
5 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Data Science - Hierarchy of Needs
No ratings yet
Data Science - Hierarchy of Needs
20 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
CMA Inter Law & Ethics Imp Questions
No ratings yet
CMA Inter Law & Ethics Imp Questions
5 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Kmtcpus710196 Sur
100% (1)
Kmtcpus710196 Sur
1 page
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
No ratings yet
12 Must-Have Skills To Become A Data Engineer - by Anuj Syal - DataDrivenInvestor
9 pages
Auction Catalogue 476695
No ratings yet
Auction Catalogue 476695
13 pages
IDT Important Sections-CA Nikunj Goenka
No ratings yet
IDT Important Sections-CA Nikunj Goenka
5 pages
Reflection On Design 121
No ratings yet
Reflection On Design 121
27 pages
DE Week-1, Lecture
No ratings yet
DE Week-1, Lecture
3 pages
Unit IV - Inventory Control
No ratings yet
Unit IV - Inventory Control
30 pages
Week 02 - Process Identification
No ratings yet
Week 02 - Process Identification
39 pages
Background of CENVAT
No ratings yet
Background of CENVAT
29 pages
Advertisement No 2 of 2025
No ratings yet
Advertisement No 2 of 2025
3 pages
Tata Steel Supply Chain Management
50% (2)
Tata Steel Supply Chain Management
47 pages
Lecture 13-15
No ratings yet
Lecture 13-15
8 pages
Effects of Information Communication Technology in Hospitality and Tourism
100% (1)
Effects of Information Communication Technology in Hospitality and Tourism
17 pages
Advanced Valuation
No ratings yet
Advanced Valuation
15 pages
JD - CCL Legal Officer
No ratings yet
JD - CCL Legal Officer
4 pages
مهارات الحوار البنّاء بين الآباء والأبناء
No ratings yet
مهارات الحوار البنّاء بين الآباء والأبناء
8 pages
Task Sheet 5
No ratings yet
Task Sheet 5
3 pages
BMC Tracker Findaway
No ratings yet
BMC Tracker Findaway
3 pages
Samsung Galaxy Note 7 Project
0% (1)
Samsung Galaxy Note 7 Project
13 pages
Cost Accounting - Chapter 4 Job Costing
No ratings yet
Cost Accounting - Chapter 4 Job Costing
2 pages
Selected Course Outline - 8
No ratings yet
Selected Course Outline - 8
1 page
CV Carina Mendez Cuba
No ratings yet
CV Carina Mendez Cuba
2 pages
Sample Procurement Plan
No ratings yet
Sample Procurement Plan
7 pages
Group 1 Chapter 9
No ratings yet
Group 1 Chapter 9
2 pages
Dalubhasaan NG Lungsod NG Lucena (Formerly City College of Lucena)
No ratings yet
Dalubhasaan NG Lungsod NG Lucena (Formerly City College of Lucena)
2 pages
Sony Case
No ratings yet
Sony Case
1 page
Digest - Manila Prince Hotel Vs GSIS
No ratings yet
Digest - Manila Prince Hotel Vs GSIS
1 page
Entre P Quiz Chapter 2
No ratings yet
Entre P Quiz Chapter 2
2 pages
API-16AR BOP Remanufaturing API-16AR RSL-4 Rev-4 PDF
No ratings yet
API-16AR BOP Remanufaturing API-16AR RSL-4 Rev-4 PDF
1 page
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
The Beginner's to Professional Guide
From Everand
The Beginner's to Professional Guide
mohamed adel
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Intro To Data Engineering!

Uploaded by

Intro To Data Engineering!

Uploaded by

LECTURE 01

Data engineering is an essential ingredient

Data engineering is as essential as plumbing!

1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world

1. Data science projects largely focus on data engineering.

“… 70% more open roles at

A new specialized job category:

1. Data science projects largely focus on data engineering.

“ML codeˮ is not only a

Sculley et al., SE4ML 2014 [google research]. 9

“Under the strong influence

Monica Rogati, 2017

“ML Engineerˮ: a specialization of

Tomasz Dudek,, Scalability will be an important

1. Data science projects largely focus on data engineering.

As a Data Science major,

Data systems is a difficult subject! There are

2023 MAD (ML/AI/Data) Landscape: blog, interactive 14

Data systems are tools

(the same VC who made the MAD

“Experts are close to the data

Data Fit for purpose

Load without doing a lot of

Data Warehouses are expensive

What about skipping the “data warehouse” entirely?

No need to “load/manage” data 22

Data is dumped in cheaply and

Once data is “lostˮ (i.e. not saved, deleted, etc) it

Recreating history is exceptionally hard. Canʼt predict

How do you know something improved if you cannot

Data Warehouse, 1990s Data Lake, 2010s

Some datasets may directly be

Empower data scientists to work

As we move away from a “managed” data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.