0% found this document useful (0 votes)
0 views34 pages

Intro To Data Engineering!

Uploaded by

Jay Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views34 pages

Intro To Data Engineering!

Uploaded by

Jay Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

LECTURE 01

Welcome to Data
Engineering!
April 10, 2025

1
Why Learn Data Engineering?

Data engineering is an essential ingredient


of real-world data science projects.
A set of activities that include The backbone,
collecting, collating, extracting, plumbing, or
moving, transforming, cleaning,
infrastructure that
integrating, organizing,
representing, storing, and supports data science.
processing data.

Data engineering is as essential as plumbing!


● When it works well, you donʼt realize it exists.
● When it doesnʼt, youʼll really know. 2
3
4
5
Data Science Data Engineering
Data Science: The Conventional View Nowadays, Data Science also involves
Data Engineering:
A data scientist operating alone, on A set of activities that include
one static dataset at a time, with a collecting, collating, extracting, moving,
clean “rectangularˮ shape and fitting transforming, cleaning, integrating,
in main-memory, employing various organizing, representing, storing, and
statistical and ML algorithms on processing data.
predefined objectives.
● From Data 100 ● Messy (often non-rectangular),
● Also the view reinforced by dynamic, and large datasets
“popularˮ Machine Learning, ● One team generates the data,
e.g., leaderboards, Kaggle, … another team consumes it
● Unclear and ill-defined objectives
A lot of data engineering ● Necessary precursor to real-world
must happen to support data science & ML
the conventional view! ● etc. 6
[1/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world


data science projects involve data
engineering:
○ cleaning, moving, restructuring,
processing, …
● Often underappreciated compared
to other activities, e.g., ML.)

7
[2/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.


2. Data engineer roles >> data scientist roles.

“… 70% more open roles at


companies in data engineering as
compared to data science.ˮ
Mihail Eric, Jan 2021.[blog]

A new specialized job category:


● Data scientist: Use various techniques
in statistics & ML to process & analyze
data.
● Data engineer: Develops a robust and
(we wonʼt be dogmatic about
scalable set of data processing these industry-level role
tools/platforms. distinctions…)
8
[3/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.


2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.

“ML codeˮ is not only a


small fraction of the
system; it is also often
simple—calls to
standard libraries
(sklearn, pytorch, etc.)

Sculley et al., SE4ML 2014 [google research]. 9


Data Engineering is Essential in ML/AI

“Under the strong influence


of the current AI hype,
people try to plug in data that’s
dirty & full of gaps, that
spans years while changing
in format and meaning,
that’s not understood yet,
thatʼs structured in ways
that donʼt make sense, and
expect those tools to
magically handle it.ˮ

Monica Rogati, 2017


10
[blog].
NEW Machine Learning Engineer

“ML Engineerˮ: a specialization of


data engineer focused on
operationalizing ML.
“A need for a person that would
reunite two warring parties…
One being fluent just enough in
both fields Data Science and
Software Engineering] to get the
product up and running…
...taking data scientistsʼ code and
making it more effective and
scalable. …ˮ

Tomasz Dudek,, Scalability will be an important


2018 [blog].
focus for us! 11
[4/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.


2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.
4. Balance your data techniques with a systems perspective.

As a Data Science major,


you are likely familiar with …but you are likely less
techniques: statistics/ML familiar with systems.
concepts & algorithms…

● You will learn systems and the infrastructure that enables these techniques.
● Youʼll start thinking about efficiency and scalability, esp. on large datasets.
● Various “plumbing analogiesˮ: data pipelines, data flows, …

12
All these Data Systems!!!

13
2024 MAD ML/AI/Data) Landscape
So…what is Data 101 about?

Data systems is a difficult subject! There are


many, many data systems – too many for us to
cover.
● In this class, we will try to cover the key
categories and underlying principles.
● This way, you can make informed decisions
about when to use what type of system.

2023 MAD (ML/AI/Data) Landscape: blog, interactive 14


Demystifying Industry Jargon

Data systems are tools


that support data
engineering.

15

(the same VC who made the MAD


Landscape diagram)
What you mostly learn at Berkeley (eg. DATA 100

Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions
Sensors
Log Files
Experiments Data
Preparation

“Experts are close to the data


Data preparation example: and should be the ones
Research experiments extracting / analyzing”
Alternative picture, but more traditional enterprise

Use-Case-Specific

Data Fit for purpose


Raw Data Preparation Self-Service
Transactions
Sensors
Log Files
Experiments Source of Truth
Governed
Secure
Data Integration example: Audited
UC Berkeley data (contracts, Managed
Data
student info, grants, etc…) – Integration
must be centrally managed “Compute is expensive, data is
precious”
How this actually happens? E, T, and L
Extract: Scrape raw data from all the source systems,
e.g., transactions, sensors, log files, experiments, tables,
bytestreams, …
Transform: Apply a series of rules or functions, wrangle
data into schema(s)/format(s)
Load: Load data into a data storage solution

18
Traditional Single Source of Truth: Data Warehouses - through ETL

Entire organizations
Transform centered around this
ETL process!

Raw Data
Transactions
Sensors
Log Files Extract Data
Experiments Source of Truth
Integration
Governed
Extract or scraping from API or
log file, transform into common
Load Secure
Audited
schema/format, load in parallel Managed
to “data warehouse”

Data Warehouse
ELT for Data Warehouses: A Newer Picture (e.g. Original Snowflake)

Load without doing a lot of


transformation, with transformations
done in SQL
Faster to get going, and more
scalable, but requires more data
Raw Data warehousing knowledge (& may be
more expensive).
Transactions
Sensors
Extract
Log Files Data
Experiments Integration Source of Truth
Transform Governed
Secure
Audited
Managed

Load
Data Warehouse
From Warehouses → … Lakes??? 💦 Got until here

Data Warehouses are expensive


● Warehouses expect some degree of structure
● Transformation is costly, not necessarily just computing
but engineering time

What about skipping the “data warehouse” entirely?


No Loading! Just “dumpˮ the data in
Letʼs be sloppy!
Letʼs be …agile…
Enter the data lake
Editorial note: Understand, but to not try to make too
much sense of why these terms came to be. Often just
marketing…]
ET? For Data Lakes?

Data Lake

Use-Case-Specific

Raw Data
Transform Data Fit for purpose
Preparation Self-Service
Transactions
Sensors
Log Files Extract
Experiments

No need to “load/manage” data 22

Data is dumped in cheaply and


massaged as needed for various (joke)
use-cases
Usually code-centric (Spark)
Why go through all this trouble?

Once data is “lostˮ (i.e. not saved, deleted, etc) it


cannot be recovered. So record everything.

Recreating history is exceptionally hard. Canʼt predict


when a particular measurement will be crucial to
understanding some situation.

How do you know something improved if you cannot


measure change?
23
The Two Extremes

Data Warehouse, 1990s Data Lake, 2010s


● “Single source of truthˮ: A central, ● Emerged during Hadoop/Spark
organized repository of data used revolution
for analytics throughout an ● “Landing zoneˮ: unconstrained
enterprise. storage for any and all data
● Design the uber-schema up-front ● Data is then analyzed on demand
of all of the rectangular tables ● Extract into files/storage
youʼd ever want. ● Load into storage (easy!)
● Extract from trusted sources ● Transform on demand for any use.
● Transform to warehouse schema ○ Create new files in the lake,
using custom tools catalog files as they go for
● Load data warehouse reuse
● Old school ETL solution: ○ Often code-centric
Informatica
Modern solution is likely Many-to-Many, ETLT

Some datasets may directly be


Data Lake loaded into a data warehouse

Transform Use-Case-Specific
Fit for purpose
Raw Data
Transactions
Transform Data
Self-Service

Sensors
Log Files Extract Preparation /
Integration
Experiments Source of Truth
Governed
Sometimes start with a data lake 25

Empower data scientists to work


Load Secure
Audited
on ad-hoc use cases Transform Managed

Allow for datasets that “graduate” This class will focus a lot on
to a carefully managed warehouse T: Transform
Data Warehouse
26
A Modern Buzzword for the Modern Solution: A Data Lakehouse 2020

27
…but that was just the beginning….

As we move away from a “managed” data


warehouse, there are other considerations we
need to worry about … 28
Really, really important considerations

Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Governed
29
Secure
Audited
Managed

Data Warehouse
Important considerations
Data Discovery, Data Assessment
● Ad-Hoc: End-users land data, explore it, label it
● Systematic: Crawl/index the data lake for files
○ E.g., for CSV/JSON
● Very content-centric: really a form of
analytics/prediction
○ Try to figure out what type of data you have.
● AI  People!
Data Quality & Integrity
● Boolean Integrity checks
● Often specified by people, also “minedˮ by AI
● Data changes ALL the time, especially from clients.
● Enforced: can “rejectˮ or “sequesterˮ data that violates
○ e.g no two products that have the same product ID!
Donʼt forget about Metadata!

Data alone is not enough, also need context. Also need to capture metadata!

Application Metadata:
● Data entities (e.g. students, courses, employees for a university)
● Relationships between data
● Constraints
Behavioral Metadata:
● Data Lineage – where did it come from?
● Audit Trails of Usage – who ran this job, and what did it do?
Change Metadata
● Version info for all the above
● Timestamps
Modern solutions

Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Metadata
Governed
Store 32
Secure
Audited
Managed

Data Warehouse
Making things Dynamic: Operationalization and Feedback
Operationalization: Everything is an ongoing feed!
● When do jobs kick off, and what do they do?
● How are tests registered, exceptions handled,
people alerted?
● How do experiments “graduateˮ into processes?
Feedback: Every data “productˮ is of interest!
● Some are datasets in their own right. If you produce a table, thatʼs also
data!
● Many are new processes that generating new data feeds!
○ ML models: Constantly yielding predictions.
■ Compare old predictions to new predictions?
Real life is messy

Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Metadata
Governed
Store
Secure
Audited
Managed

Data Warehouse

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy