0% found this document useful (0 votes)
90 views33 pages

Data Engineering Life Cycle

Uploaded by

my home deccor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views33 pages

Data Engineering Life Cycle

Uploaded by

my home deccor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Engineering Architecture and Best

Practices for Cloud Development


Hassan Ebied
Principle Technical Consultant @ SAS
https://eg.linkedin.com/in/hebied

11/17/2024 1
Life Cycle
Data Engineering Life Cycle 01

Under Currents of Data Engineering


Under currents of Data Engineering 02

Agenda Best Practices


Best Practices For Building resilient data pipelines 03

Thank You
Q & A. 04

11/17/2024 3
Data Engineering & Data science

Data Engineering sits Upstream for Data


science

Hierarchy of Data science needs.


11/17/2024 5
Data Engineering lifecycle

11/18/2024 6
Life Cycle - Generation

Application database as Source System

IoT Swarm and Message Queue Source


Systems.

11/17/2024 7
• Characteristics of Application (Source system Application DB , stream ..i.e IoT Swarm)
• How is data persisted in the source system?
• At what rate is data generated? How many events per second? How many gigabytes per hour?What
level of consistency can data engineers expect from the output data?
• What level of consistency can we expect from the output data? If you’re running data-quality checks
against the output data, how often do data inconsistencies occur—nulls where they aren’t expected,
lousy formatting, etc.?
• How often do errors occur?

Life Cycle - •

Will the data contain duplicates?
Will some data values arrive late, possibly much later than other messages produced

Generation. Key
simultaneously?
• What is the schema of the ingested data ? Will data engineers need to join across several tables or
even several systems to get a complete picture of the data?

Considerations • If schema changes (say, a new column is added), how is this dealt with and communicated to
downstream stakeholders?
• How frequently should data be pulled from the source system?
• For stateful systems account information), is data provided as periodic snapshots or update events
from change data capture (CDC)?
• Who/what is the data provider that will transmit the data for downstream consumption?
• Will reading from a data source impact its performance?
• Does the source system have upstream data dependencies? What are the characteristics of these
upstream systems?
• Are data-quality checks in place to check for late or missing data?

11/17/2024 8
Storage – Data Architecture

Modern DWH - 2011


Relational DWH - Early Data Lake - 2010
Ages

Data Fabric - 2016 Data Lakehouse - 2020


11/17/2024 9
Storage

• Is this storage solution compatible with the architecture’s required write and read speeds?
• Will storage create a bottleneck for downstream processes?
• How this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For
instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with
significant performance overhead.)
• Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total
available storage, read operation rate, write volume, etc.
• Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)?
• Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant
impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and
institutional knowledge to streamline future projects and architecture changes.
• Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)?
• Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data
warehouse)?
• How are you tracking master data, golden records data quality, and data lineage for data governance
• How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical
locations but not others?

11/17/2024 10
Ingestion- Batch vs Stream

11/17/2024 11
Ingestion- Batch vs Stream

Aspect Batch Processing Stream Processing


Definition Processes large volumes of data Processes data as it arrives, in near real-
collected over a period of time. time or real-time.
Use - Periodic reporting - Real-time monitoring
Cases - Data warehousing - Fraud detection
- ETL (Extract, Transform, Load). - Recommendation systems.
Tools/Fra Hadoop, Apache Spark (batch mode), Apache Kafka, Apache Flink, Apache
meworks AWS Batch, Google Dataflow (batch Spark Streaming, AWS Kinesis, Google
mode). Dataflow.
Resource Can be optimized for throughput by Requires systems designed to handle
Efficiency processing large volumes at once. constant data flow with minimal delays.
Typical - Daily sales reporting - IoT data processing
Scenarios - Archiving logs for analysis - Stock market analytics
- Aggregation tasks. - Real-time notifications.
Data Size Handles larger volumes of historical Typically handles smaller, more recent
12
data. data, but at high velocity.
Ingestion – Key Considerations

• What are the use cases for the data I’m ingesting? Can I reuse this data rather than create
multiple versions of the same dataset?
• Are the systems generating and ingesting this data reliably, and is the data available when I
need it?
• What is the data destination after ingestion?
• How frequently will I need to access the data?
• In what volume will the data typically arrive?
• What format is the data in? Can my downstream storage and transformation systems handle
this format?
• Is the source data in good shape for immediate downstream use? If so, for how long, and what
may cause it to be unusable?
• If the data is from a streaming source, does it need to be transformed before reaching its
destination? Would an in-flight transformation be appropriate, where the data is transformed
within the stream itself?

11/17/2024 13
Ingestion –Considerations For
Streaming Data

• If I ingest the data in real time, can downstream storage systems handle the rate of data flow?
• Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data,
say, every minute?
• What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in
real time, what actions can I take on that data that would be an improvement upon batch?
• Will my streaming-first approach cost more in terms of time, money, maintenance, downtime, and opportunity cost than
simply doing batch?
• Are my streaming pipeline and system reliable and redundant if infrastructure fails?
• What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub,
Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage
it? What are the costs and trade-offs?
• If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training?
• Am I getting data from a live production instance? If so, what’s the impact of my ingestion process on this source system?

11/17/2024 14
Ingestion – Tumbling Window

• Fixed Intervals.
• Don’t overlap.
• Event belongs to one window only

11/17/2024 15
Ingestion – Hopping Window

• Fixed Intervals.
• Can overlap.
• Event can belong to multiple windows
at the same time

11/17/2024 16
Ingestion – Session Window

• Timeout
• Max Duration.
• No overlap, event can belong to one
window only

11/17/2024 17
Transformation – Data Pipelines

DATA MIGRATION BETWEEN SYSTEMS

DATA WRANGLING

DATA INTEGRATIONS

11/17/2024 18
Transformation – ETL Pipelines

11/17/2024 19
Transformation – ETL VS ELT

11/17/2024 20
Transformation – Medallion
Architecture

11/17/2024 21
Serving

Analytics Machine Learning Reserve ETL

11/17/2024 22
Serving - Analytics

11/17/2024 23
Serving – Machine Learning

1 2 3 4
Is the data of sufficient Is the data discoverable? Where are the technical and Does the dataset properly
quality to perform reliable Can data scientists and ML organizational boundaries represent ground truth? Is it
feature engineering? engineers easily find between data engineering unfairly biased?
valuable data? and ML engineering? This
organizational question has
significant architectural
implications.

11/17/2024 24
Serving - Reverse ETL

11/17/2024 25
Under currents of DG

11/17/2024 26
Under Currents - Security

Shared Security For Principe Of


Responsibility Low Level Least
Mode Data Privileges

Logging, Usage Of
Always Back
Monitoring, Security
Up Your Data
and Alerting Vaults.
11/17/2024 27
Under Currents – Data Management

DATA DATA DISCOVERABILITY DATA LINEAGE DATA QUALITY DATA LIFE CYCLE
GOVERNANCE ACCOUNTABILITY MANAGEMENT

ETHICS AND
PRIVACY

11/17/2024 28
Under Currents – Data Ops

Observation
Incident
Automation and Monitoring
Response

11/17/2024 29
Under Currents – Data Architecture

Operational Technical
What How

• Functional Requirements of People, • How data is ingested, stored,


Process, Technology ? transformed, and served along the
data engineering lifecycle
• what business processes does the
data serve. • How will you move 10 TB of data every
hour from a source database to your
• How does the organization manage data
data quality
• What is the latency requirement from
when the data is produced?
11/17/2024 30
Under Currents – Good Architecture Principle

AWS GOOGLE
• Operational excellence
• Security
• Reliability
• Performance efficiency
• Cost optimization
• Sustainability

11/17/2024 31
Principles of Data Engineering Architecture

• Principle 1: Choose Common Components Wisely


• Principle 2: Plan for Failure
• Principle 3: Architect for Scalability
• Principle 4: Architecture Is Leadership
• Principle 5: Always Be Architecting
• Principle 6: Build Loosely Coupled Systems
• Principle 7: Make Reversible Decisions
• Principle 8: Prioritize Security
• Principle 9: Embrace FinOps

11/17/2024 32
Under Currents – Software Engineering

• Core data processing code


• Development of open source frameworks
• Streaming
• Infrastructure as code
• Pipelines as code

11/17/2024 33
Best Practices - Conclusions
• Follow CI/CD .. Implement Version
Control
Functional Programming
Follow proper naming • Manage Incidents Efficiently
Practice modularity convention and proper
Usage
documentation • Automate Data Pipelines and Monitoring
• Focus on Business Value
Strive for easy-to-
Select the right tool for
maintain code
Use common data
• Avoid Data Duplicates with Idempotent
data wrangling
•DRY ( Don’t repeat yourself)
•KISS (Keep It Simple & Stupid)
design patterns Pipelines
• Track Pipeline Metadata for Easier
Debugging
Build scalable data
pipeline architecture
Ensure the reliability of
your data pipeline
Follow some general
coding principles
• Workflow Management (Apache Airflow)
• Ensure Data Quality
• Implement Thorough Testing
Set security policy for
data
Optimize Cloud Costs • Optimize Heavy Computational Tasks
• Embrace Data Ops
• Use Standard Data Transformation
Patterns
11/17/2024 34
Thank You.

Q& A
https://eg.linkedin.com/in/hebied

11/17/2024 35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy