Intro To Data Engineering!
Intro To Data Engineering!
Welcome to Data
Engineering!
April 10, 2025
1
Why Learn Data Engineering?
7
[2/4] Why Learn Data Engineering?
● You will learn systems and the infrastructure that enables these techniques.
● Youʼll start thinking about efficiency and scalability, esp. on large datasets.
● Various “plumbing analogiesˮ: data pipelines, data flows, …
12
All these Data Systems!!!
13
2024 MAD ML/AI/Data) Landscape
So…what is Data 101 about?
15
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions
Sensors
Log Files
Experiments Data
Preparation
Use-Case-Specific
18
Traditional Single Source of Truth: Data Warehouses - through ETL
Entire organizations
Transform centered around this
ETL process!
Raw Data
Transactions
Sensors
Log Files Extract Data
Experiments Source of Truth
Integration
Governed
Extract or scraping from API or
log file, transform into common
Load Secure
Audited
schema/format, load in parallel Managed
to “data warehouse”
Data Warehouse
ELT for Data Warehouses: A Newer Picture (e.g. Original Snowflake)
Load
Data Warehouse
From Warehouses → … Lakes??? 💦 Got until here
Data Lake
Use-Case-Specific
Raw Data
Transform Data Fit for purpose
Preparation Self-Service
Transactions
Sensors
Log Files Extract
Experiments
Transform Use-Case-Specific
Fit for purpose
Raw Data
Transactions
Transform Data
Self-Service
Sensors
Log Files Extract Preparation /
Integration
Experiments Source of Truth
Governed
Sometimes start with a data lake 25
Allow for datasets that “graduate” This class will focus a lot on
to a carefully managed warehouse T: Transform
Data Warehouse
26
A Modern Buzzword for the Modern Solution: A Data Lakehouse 2020
27
…but that was just the beginning….
Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Governed
29
Secure
Audited
Managed
Data Warehouse
Important considerations
Data Discovery, Data Assessment
● Ad-Hoc: End-users land data, explore it, label it
● Systematic: Crawl/index the data lake for files
○ E.g., for CSV/JSON
● Very content-centric: really a form of
analytics/prediction
○ Try to figure out what type of data you have.
● AI People!
Data Quality & Integrity
● Boolean Integrity checks
● Often specified by people, also “minedˮ by AI
● Data changes ALL the time, especially from clients.
● Enforced: can “rejectˮ or “sequesterˮ data that violates
○ e.g no two products that have the same product ID!
Donʼt forget about Metadata!
Data alone is not enough, also need context. Also need to capture metadata!
Application Metadata:
● Data entities (e.g. students, courses, employees for a university)
● Relationships between data
● Constraints
Behavioral Metadata:
● Data Lineage – where did it come from?
● Audit Trails of Usage – who ran this job, and what did it do?
Change Metadata
● Version info for all the above
● Timestamps
Modern solutions
Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Metadata
Governed
Store 32
Secure
Audited
Managed
Data Warehouse
Making things Dynamic: Operationalization and Feedback
Operationalization: Everything is an ongoing feed!
● When do jobs kick off, and what do they do?
● How are tests registered, exceptions handled,
people alerted?
● How do experiments “graduateˮ into processes?
Feedback: Every data “productˮ is of interest!
● Some are datasets in their own right. If you produce a table, thatʼs also
data!
● Many are new processes that generating new data feeds!
○ ML models: Constantly yielding predictions.
■ Compare old predictions to new predictions?
Real life is messy
Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Metadata
Governed
Store
Secure
Audited
Managed
Data Warehouse