0% found this document useful (0 votes)

90 views33 pages

Data Engineering Life Cycle

Uploaded by

my home deccor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views33 pages

Data Engineering Life Cycle

Uploaded by

my home deccor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Data Engineering Architecture and Best

Practices for Cloud Development

Hassan Ebied
Principle Technical Consultant @ SAS
https://eg.linkedin.com/in/hebied

11/17/2024 1
Life Cycle
Data Engineering Life Cycle 01

Under Currents of Data Engineering

Under currents of Data Engineering 02

Agenda Best Practices

Best Practices For Building resilient data pipelines 03

Thank You
Q & A. 04

11/17/2024 3
Data Engineering & Data science

Data Engineering sits Upstream for Data

science

Hierarchy of Data science needs.

11/17/2024 5
Data Engineering lifecycle

11/18/2024 6
Life Cycle - Generation

Application database as Source System

IoT Swarm and Message Queue Source

Systems.

11/17/2024 7
• Characteristics of Application (Source system Application DB , stream ..i.e IoT Swarm)
• How is data persisted in the source system?
• At what rate is data generated? How many events per second? How many gigabytes per hour?What
level of consistency can data engineers expect from the output data?
• What level of consistency can we expect from the output data? If you’re running data-quality checks
against the output data, how often do data inconsistencies occur—nulls where they aren’t expected,
lousy formatting, etc.?
• How often do errors occur?

Life Cycle - •
•
Will the data contain duplicates?
Will some data values arrive late, possibly much later than other messages produced

Generation. Key
simultaneously?
• What is the schema of the ingested data ? Will data engineers need to join across several tables or
even several systems to get a complete picture of the data?

Considerations • If schema changes (say, a new column is added), how is this dealt with and communicated to
downstream stakeholders?
• How frequently should data be pulled from the source system?
• For stateful systems account information), is data provided as periodic snapshots or update events
from change data capture (CDC)?
• Who/what is the data provider that will transmit the data for downstream consumption?
• Will reading from a data source impact its performance?
• Does the source system have upstream data dependencies? What are the characteristics of these
upstream systems?
• Are data-quality checks in place to check for late or missing data?

11/17/2024 8
Storage – Data Architecture

Modern DWH - 2011

Relational DWH - Early Data Lake - 2010
Ages

Data Fabric - 2016 Data Lakehouse - 2020

11/17/2024 9
Storage

• Is this storage solution compatible with the architecture’s required write and read speeds?
• Will storage create a bottleneck for downstream processes?
• How this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For
instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with
significant performance overhead.)
• Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total
available storage, read operation rate, write volume, etc.
• Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)?
• Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant
impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and
institutional knowledge to streamline future projects and architecture changes.
• Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)?
• Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data
warehouse)?
• How are you tracking master data, golden records data quality, and data lineage for data governance
• How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical
locations but not others?

11/17/2024 10
Ingestion- Batch vs Stream

11/17/2024 11
Ingestion- Batch vs Stream

Aspect Batch Processing Stream Processing

Definition Processes large volumes of data Processes data as it arrives, in near real-
collected over a period of time. time or real-time.
Use - Periodic reporting - Real-time monitoring
Cases - Data warehousing - Fraud detection
- ETL (Extract, Transform, Load). - Recommendation systems.
Tools/Fra Hadoop, Apache Spark (batch mode), Apache Kafka, Apache Flink, Apache
meworks AWS Batch, Google Dataflow (batch Spark Streaming, AWS Kinesis, Google
mode). Dataflow.
Resource Can be optimized for throughput by Requires systems designed to handle
Efficiency processing large volumes at once. constant data flow with minimal delays.
Typical - Daily sales reporting - IoT data processing
Scenarios - Archiving logs for analysis - Stock market analytics
- Aggregation tasks. - Real-time notifications.
Data Size Handles larger volumes of historical Typically handles smaller, more recent
12
data. data, but at high velocity.
Ingestion – Key Considerations

• What are the use cases for the data I’m ingesting? Can I reuse this data rather than create
multiple versions of the same dataset?
• Are the systems generating and ingesting this data reliably, and is the data available when I
need it?
• What is the data destination after ingestion?
• How frequently will I need to access the data?
• In what volume will the data typically arrive?
• What format is the data in? Can my downstream storage and transformation systems handle
this format?
• Is the source data in good shape for immediate downstream use? If so, for how long, and what
may cause it to be unusable?
• If the data is from a streaming source, does it need to be transformed before reaching its
destination? Would an in-flight transformation be appropriate, where the data is transformed
within the stream itself?

11/17/2024 13
Ingestion –Considerations For
Streaming Data

• If I ingest the data in real time, can downstream storage systems handle the rate of data flow?
• Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data,
say, every minute?
• What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in
real time, what actions can I take on that data that would be an improvement upon batch?
• Will my streaming-first approach cost more in terms of time, money, maintenance, downtime, and opportunity cost than
simply doing batch?
• Are my streaming pipeline and system reliable and redundant if infrastructure fails?
• What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub,
Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage
it? What are the costs and trade-offs?
• If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training?
• Am I getting data from a live production instance? If so, what’s the impact of my ingestion process on this source system?

11/17/2024 14
Ingestion – Tumbling Window

• Fixed Intervals.
• Don’t overlap.
• Event belongs to one window only

11/17/2024 15
Ingestion – Hopping Window

• Fixed Intervals.
• Can overlap.
• Event can belong to multiple windows
at the same time

11/17/2024 16
Ingestion – Session Window

• Timeout
• Max Duration.
• No overlap, event can belong to one
window only

11/17/2024 17
Transformation – Data Pipelines

DATA MIGRATION BETWEEN SYSTEMS

DATA WRANGLING

DATA INTEGRATIONS

11/17/2024 18
Transformation – ETL Pipelines

11/17/2024 19
Transformation – ETL VS ELT

11/17/2024 20
Transformation – Medallion
Architecture

11/17/2024 21
Serving

Analytics Machine Learning Reserve ETL

11/17/2024 22
Serving - Analytics

11/17/2024 23
Serving – Machine Learning

1 2 3 4
Is the data of sufficient Is the data discoverable? Where are the technical and Does the dataset properly
quality to perform reliable Can data scientists and ML organizational boundaries represent ground truth? Is it
feature engineering? engineers easily find between data engineering unfairly biased?
valuable data? and ML engineering? This
organizational question has
significant architectural
implications.

11/17/2024 24
Serving - Reverse ETL

11/17/2024 25
Under currents of DG

11/17/2024 26
Under Currents - Security

Shared Security For Principe Of

Responsibility Low Level Least
Mode Data Privileges

Logging, Usage Of
Always Back
Monitoring, Security
Up Your Data
and Alerting Vaults.
11/17/2024 27
Under Currents – Data Management

DATA DATA DISCOVERABILITY DATA LINEAGE DATA QUALITY DATA LIFE CYCLE
GOVERNANCE ACCOUNTABILITY MANAGEMENT

ETHICS AND
PRIVACY

11/17/2024 28
Under Currents – Data Ops

Observation
Incident
Automation and Monitoring
Response

11/17/2024 29
Under Currents – Data Architecture

Operational Technical
What How

• Functional Requirements of People, • How data is ingested, stored,

Process, Technology ? transformed, and served along the
data engineering lifecycle
• what business processes does the
data serve. • How will you move 10 TB of data every
hour from a source database to your
• How does the organization manage data
data quality
• What is the latency requirement from
when the data is produced?
11/17/2024 30
Under Currents – Good Architecture Principle

AWS GOOGLE
• Operational excellence
• Security
• Reliability
• Performance efficiency
• Cost optimization
• Sustainability

11/17/2024 31
Principles of Data Engineering Architecture

• Principle 1: Choose Common Components Wisely

• Principle 2: Plan for Failure
• Principle 3: Architect for Scalability
• Principle 4: Architecture Is Leadership
• Principle 5: Always Be Architecting
• Principle 6: Build Loosely Coupled Systems
• Principle 7: Make Reversible Decisions
• Principle 8: Prioritize Security
• Principle 9: Embrace FinOps

11/17/2024 32
Under Currents – Software Engineering

• Core data processing code

• Development of open source frameworks
• Streaming
• Infrastructure as code
• Pipelines as code

11/17/2024 33
Best Practices - Conclusions
• Follow CI/CD .. Implement Version
Control
Functional Programming
Follow proper naming • Manage Incidents Efficiently
Practice modularity convention and proper
Usage
documentation • Automate Data Pipelines and Monitoring
• Focus on Business Value
Strive for easy-to-
Select the right tool for
maintain code
Use common data
• Avoid Data Duplicates with Idempotent
data wrangling
•DRY ( Don’t repeat yourself)
•KISS (Keep It Simple & Stupid)
design patterns Pipelines
• Track Pipeline Metadata for Easier
Debugging
Build scalable data
pipeline architecture
Ensure the reliability of
your data pipeline
Follow some general
coding principles
• Workflow Management (Apache Airflow)
• Ensure Data Quality
• Implement Thorough Testing
Set security policy for
data
Optimize Cloud Costs • Optimize Heavy Computational Tasks
• Embrace Data Ops
• Use Standard Data Transformation
Patterns
11/17/2024 34
Thank You.

Q& A
https://eg.linkedin.com/in/hebied

11/17/2024 35

Food Safety Level 2
No ratings yet
Food Safety Level 2
1 page
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Course 6
14% (7)
Course 6
2 pages
The Introduction of Western Education
88% (8)
The Introduction of Western Education
10 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
No ratings yet
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
431 pages
De Unit-2
No ratings yet
De Unit-2
17 pages
Big Data
No ratings yet
Big Data
51 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
16 pages
Oreilly Technical Guide Understanding Etl
No ratings yet
Oreilly Technical Guide Understanding Etl
107 pages
Wa0008.
No ratings yet
Wa0008.
19 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
System Design
No ratings yet
System Design
6 pages
Evolution of Data Engineering in Modern Software D
No ratings yet
Evolution of Data Engineering in Modern Software D
15 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
163 pages
Data Engineeing 1 Pages 2
No ratings yet
Data Engineeing 1 Pages 2
14 pages
InfoQ Modern Data Architectures Pipelines Streams
No ratings yet
InfoQ Modern Data Architectures Pipelines Streams
42 pages
Data Arch Base
No ratings yet
Data Arch Base
11 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Data Engineer Handbook
No ratings yet
Data Engineer Handbook
21 pages
Data Engineering 101
No ratings yet
Data Engineering 101
1 page
De Imp Qa
No ratings yet
De Imp Qa
12 pages
Unit 2
No ratings yet
Unit 2
10 pages
20250129-EB-Ultimate Data Streaming Guide
No ratings yet
20250129-EB-Ultimate Data Streaming Guide
103 pages
Fundamentals of Data Engineering Index
No ratings yet
Fundamentals of Data Engineering Index
17 pages
OD M2 Building A Data Lake
No ratings yet
OD M2 Building A Data Lake
59 pages
Unit 4
No ratings yet
Unit 4
11 pages
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
No ratings yet
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
32 pages
Systems Analysis and Design 3
No ratings yet
Systems Analysis and Design 3
5 pages
Big Data Architecture
No ratings yet
Big Data Architecture
41 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
No ratings yet
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
52 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Real Time Data
No ratings yet
Real Time Data
4 pages
Data Engineering Concepts For Mid-to-Senior Professionals
No ratings yet
Data Engineering Concepts For Mid-to-Senior Professionals
27 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
Module 1
No ratings yet
Module 1
29 pages
Modern Data Stack
No ratings yet
Modern Data Stack
23 pages
Data Engg
No ratings yet
Data Engg
19 pages
CH 05 Data Engineering
No ratings yet
CH 05 Data Engineering
28 pages
Lec 01
No ratings yet
Lec 01
17 pages
Data Engineering - Beginner's Guide
100% (1)
Data Engineering - Beginner's Guide
9 pages
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
4.data Engineering
No ratings yet
4.data Engineering
9 pages
Week 4 - Azure-AWSStorage
No ratings yet
Week 4 - Azure-AWSStorage
97 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
From Everand
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Eb Attunity Streaming Change Data Capture en
No ratings yet
Eb Attunity Streaming Change Data Capture en
60 pages
Gradient Flow Report 2022 State of Data Engineering
No ratings yet
Gradient Flow Report 2022 State of Data Engineering
21 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Data Transformation With Advanced Data Stack
No ratings yet
Data Transformation With Advanced Data Stack
35 pages
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
From Everand
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
No ratings yet
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
10 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Chapter 2
No ratings yet
Chapter 2
50 pages
Case Study 1 - Red Letter Days LTD
No ratings yet
Case Study 1 - Red Letter Days LTD
1 page
Why Corporate Sales
No ratings yet
Why Corporate Sales
2 pages
2.leon Kleyn CV
No ratings yet
2.leon Kleyn CV
5 pages
SPM 4
No ratings yet
SPM 4
15 pages
Indirectness and Euphemism PDF
No ratings yet
Indirectness and Euphemism PDF
30 pages
CH 14-1 EM
No ratings yet
CH 14-1 EM
29 pages
Salaries and Allowances of Members of Jammu and Kashmir State Legislature Act, 1960
No ratings yet
Salaries and Allowances of Members of Jammu and Kashmir State Legislature Act, 1960
13 pages
HT 128
No ratings yet
HT 128
2 pages
Patrizio Tressoldi and Guido Del Prete - ESP Under Hypnosis: Role of Induction Instructions and Personality Characteristics
No ratings yet
Patrizio Tressoldi and Guido Del Prete - ESP Under Hypnosis: Role of Induction Instructions and Personality Characteristics
14 pages
Decagon Bacon Aw Study
No ratings yet
Decagon Bacon Aw Study
2 pages
Đáp Án Đề Thi Thử Số 11 (2019-2020) With Notes
No ratings yet
Đáp Án Đề Thi Thử Số 11 (2019-2020) With Notes
9 pages
Cefuroxime
No ratings yet
Cefuroxime
2 pages
Final Research Report - SE - 2018 - 036 - Rajamiththiran
No ratings yet
Final Research Report - SE - 2018 - 036 - Rajamiththiran
74 pages
Prudential Bank of San Mateo
No ratings yet
Prudential Bank of San Mateo
1 page
Bosing Royal Academy, Yagrung: Subject: Mathematics
No ratings yet
Bosing Royal Academy, Yagrung: Subject: Mathematics
3 pages
Rhyme Time Lesson Plan
No ratings yet
Rhyme Time Lesson Plan
3 pages
Article Review Personal Financial Planning Attitude
No ratings yet
Article Review Personal Financial Planning Attitude
7 pages
Ramanlal Shorawala Public School
No ratings yet
Ramanlal Shorawala Public School
11 pages
Lesson Exemplar ENG 2 Q4 W5 W6
No ratings yet
Lesson Exemplar ENG 2 Q4 W5 W6
6 pages
La Ley Del Reconocimiento by Mike Murdock 85pdf Aw 5a395b9a1723dd97e6232835
No ratings yet
La Ley Del Reconocimiento by Mike Murdock 85pdf Aw 5a395b9a1723dd97e6232835
2 pages
Unknown
No ratings yet
Unknown
2 pages
XII - IP - 2013-14 - Guwahati Region
No ratings yet
XII - IP - 2013-14 - Guwahati Region
137 pages
Implementing Evidence Based Practice Real Life Success Stories 1st Edition Bernadette Mazurek - Download The Ebook and Start Exploring Right Away
No ratings yet
Implementing Evidence Based Practice Real Life Success Stories 1st Edition Bernadette Mazurek - Download The Ebook and Start Exploring Right Away
54 pages
Essential-Neuroscience 4th
No ratings yet
Essential-Neuroscience 4th
32 pages
SFM Theory Notes - Adish Jain
No ratings yet
SFM Theory Notes - Adish Jain
52 pages
Case Study 5 Ace Designers
No ratings yet
Case Study 5 Ace Designers
17 pages
Class 5 Notes - Class V GS L 3 MOON TB NOTES Converted 1
No ratings yet
Class 5 Notes - Class V GS L 3 MOON TB NOTES Converted 1
3 pages
Revisiting Structural Family Therapy
No ratings yet
Revisiting Structural Family Therapy
2 pages
2oth Century Artists
No ratings yet
2oth Century Artists
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Engineering Life Cycle

Uploaded by

Data Engineering Life Cycle

Uploaded by

Data Engineering Architecture and Best

Practices for Cloud Development

Under Currents of Data Engineering

Agenda Best Practices

Data Engineering sits Upstream for Data

Hierarchy of Data science needs.

Application database as Source System

IoT Swarm and Message Queue Source

Modern DWH - 2011

Data Fabric - 2016 Data Lakehouse - 2020

Aspect Batch Processing Stream Processing

DATA MIGRATION BETWEEN SYSTEMS

Analytics Machine Learning Reserve ETL

Shared Security For Principe Of

• Functional Requirements of People, • How data is ingested, stored,

• Principle 1: Choose Common Components Wisely

• Core data processing code

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.