Data Engineering Life Cycle
Data Engineering Life Cycle
11/17/2024 1
Life Cycle
Data Engineering Life Cycle 01
Thank You
Q & A. 04
11/17/2024 3
Data Engineering & Data science
11/18/2024 6
Life Cycle - Generation
11/17/2024 7
• Characteristics of Application (Source system Application DB , stream ..i.e IoT Swarm)
• How is data persisted in the source system?
• At what rate is data generated? How many events per second? How many gigabytes per hour?What
level of consistency can data engineers expect from the output data?
• What level of consistency can we expect from the output data? If you’re running data-quality checks
against the output data, how often do data inconsistencies occur—nulls where they aren’t expected,
lousy formatting, etc.?
• How often do errors occur?
Life Cycle - •
•
Will the data contain duplicates?
Will some data values arrive late, possibly much later than other messages produced
Generation. Key
simultaneously?
• What is the schema of the ingested data ? Will data engineers need to join across several tables or
even several systems to get a complete picture of the data?
Considerations • If schema changes (say, a new column is added), how is this dealt with and communicated to
downstream stakeholders?
• How frequently should data be pulled from the source system?
• For stateful systems account information), is data provided as periodic snapshots or update events
from change data capture (CDC)?
• Who/what is the data provider that will transmit the data for downstream consumption?
• Will reading from a data source impact its performance?
• Does the source system have upstream data dependencies? What are the characteristics of these
upstream systems?
• Are data-quality checks in place to check for late or missing data?
11/17/2024 8
Storage – Data Architecture
• Is this storage solution compatible with the architecture’s required write and read speeds?
• Will storage create a bottleneck for downstream processes?
• How this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For
instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with
significant performance overhead.)
• Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total
available storage, read operation rate, write volume, etc.
• Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)?
• Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant
impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and
institutional knowledge to streamline future projects and architecture changes.
• Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)?
• Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data
warehouse)?
• How are you tracking master data, golden records data quality, and data lineage for data governance
• How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical
locations but not others?
11/17/2024 10
Ingestion- Batch vs Stream
11/17/2024 11
Ingestion- Batch vs Stream
• What are the use cases for the data I’m ingesting? Can I reuse this data rather than create
multiple versions of the same dataset?
• Are the systems generating and ingesting this data reliably, and is the data available when I
need it?
• What is the data destination after ingestion?
• How frequently will I need to access the data?
• In what volume will the data typically arrive?
• What format is the data in? Can my downstream storage and transformation systems handle
this format?
• Is the source data in good shape for immediate downstream use? If so, for how long, and what
may cause it to be unusable?
• If the data is from a streaming source, does it need to be transformed before reaching its
destination? Would an in-flight transformation be appropriate, where the data is transformed
within the stream itself?
11/17/2024 13
Ingestion –Considerations For
Streaming Data
• If I ingest the data in real time, can downstream storage systems handle the rate of data flow?
• Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data,
say, every minute?
• What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in
real time, what actions can I take on that data that would be an improvement upon batch?
• Will my streaming-first approach cost more in terms of time, money, maintenance, downtime, and opportunity cost than
simply doing batch?
• Are my streaming pipeline and system reliable and redundant if infrastructure fails?
• What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub,
Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage
it? What are the costs and trade-offs?
• If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training?
• Am I getting data from a live production instance? If so, what’s the impact of my ingestion process on this source system?
11/17/2024 14
Ingestion – Tumbling Window
• Fixed Intervals.
• Don’t overlap.
• Event belongs to one window only
11/17/2024 15
Ingestion – Hopping Window
• Fixed Intervals.
• Can overlap.
• Event can belong to multiple windows
at the same time
11/17/2024 16
Ingestion – Session Window
• Timeout
• Max Duration.
• No overlap, event can belong to one
window only
11/17/2024 17
Transformation – Data Pipelines
DATA WRANGLING
DATA INTEGRATIONS
11/17/2024 18
Transformation – ETL Pipelines
11/17/2024 19
Transformation – ETL VS ELT
11/17/2024 20
Transformation – Medallion
Architecture
11/17/2024 21
Serving
11/17/2024 22
Serving - Analytics
11/17/2024 23
Serving – Machine Learning
1 2 3 4
Is the data of sufficient Is the data discoverable? Where are the technical and Does the dataset properly
quality to perform reliable Can data scientists and ML organizational boundaries represent ground truth? Is it
feature engineering? engineers easily find between data engineering unfairly biased?
valuable data? and ML engineering? This
organizational question has
significant architectural
implications.
11/17/2024 24
Serving - Reverse ETL
11/17/2024 25
Under currents of DG
11/17/2024 26
Under Currents - Security
Logging, Usage Of
Always Back
Monitoring, Security
Up Your Data
and Alerting Vaults.
11/17/2024 27
Under Currents – Data Management
DATA DATA DISCOVERABILITY DATA LINEAGE DATA QUALITY DATA LIFE CYCLE
GOVERNANCE ACCOUNTABILITY MANAGEMENT
ETHICS AND
PRIVACY
11/17/2024 28
Under Currents – Data Ops
Observation
Incident
Automation and Monitoring
Response
11/17/2024 29
Under Currents – Data Architecture
Operational Technical
What How
AWS GOOGLE
• Operational excellence
• Security
• Reliability
• Performance efficiency
• Cost optimization
• Sustainability
11/17/2024 31
Principles of Data Engineering Architecture
11/17/2024 32
Under Currents – Software Engineering
11/17/2024 33
Best Practices - Conclusions
• Follow CI/CD .. Implement Version
Control
Functional Programming
Follow proper naming • Manage Incidents Efficiently
Practice modularity convention and proper
Usage
documentation • Automate Data Pipelines and Monitoring
• Focus on Business Value
Strive for easy-to-
Select the right tool for
maintain code
Use common data
• Avoid Data Duplicates with Idempotent
data wrangling
•DRY ( Don’t repeat yourself)
•KISS (Keep It Simple & Stupid)
design patterns Pipelines
• Track Pipeline Metadata for Easier
Debugging
Build scalable data
pipeline architecture
Ensure the reliability of
your data pipeline
Follow some general
coding principles
• Workflow Management (Apache Airflow)
• Ensure Data Quality
• Implement Thorough Testing
Set security policy for
data
Optimize Cloud Costs • Optimize Heavy Computational Tasks
• Embrace Data Ops
• Use Standard Data Transformation
Patterns
11/17/2024 34
Thank You.
Q& A
https://eg.linkedin.com/in/hebied
11/17/2024 35