Day5 Patterns Use Cases
Day5 Patterns Use Cases
Day5 Patterns Use Cases
LUAN MORENO
CEO & CDO
Data Engineer & Data Platform MVP
The Spark Lifecycle
Data Lake Apache Spark Data Warehouse
Repository of Raw Data Distributed Cluster-Computing Framework Analytics Platform for Enterprises
Without Schema Enforcement Optimized for Memory Computation Scalability – Horizontally & Vertically
Batch
Stream
Bronze Silver Gold
Ingestion Tables Refined Tables Feature & Agg Data Store
Job Task
Job Task
Sequence of Stages, Triggered By an Action Sequence of Tasks in Parallel, Based on Single Operation Applied to a Single
Such as Count(), ForeachRDD(), Collect(), Computation Boundaries, Based of Partition. Task Executed as a Single Thread
Read(), Write(). Transforms Each Jobs in DAGs Number of Partitions of a Dataset in an Executor. (Unit of Execution)
Apache Spark Query Plans Distilled ~ The Catalyst Optimizer
Logical & Physical Plans
Computational Query & Converts into an Execution Plan. Stages ~ Analysis,
Logical Optimization, Physical Planning & Code Generation. Catalyst Optimizer
Provides Rule-Based and Bost-Based Optimizations
Query
DataSet
Catalog
Optimizer Takes Unresolved Plan & Cross Once Query is Analyzed, Catalyst Optimizes Optimizer uses Optimized Logical Plan & Once Best Physical Plan is Chosen, Apache Spark
Checks with Catalog, Verify If Plan is Correct Query using Rule-Based Optimization ~ Derive Generates Physical Plans. Apache Spark Decides uses Tungsten Backend to Generate Java ByteCode
an Optimized Logical Plan Which Algorithm Must Be Used for Every Operator to Run on Each Machine ~ Executor
During Resolution, Tries to Identify Data Type, ~ SortMergeJoin & BroadcastHashJoin
Existence, and Location of Columns. Analyzer
Validates Operations Best Plan is Selected using Cost-Based Model ~
Model Costs for Engine
If Query Resolves Successfully, Analyzed Query
Plan, Additional Information Included
Apache Spark Query Plans Distilled ~ Query Plan Operators
Operators Grouped. During Physical Planning, Catalyst Optimizer Read Operations on Source Files ~ Apache Parquet & Delta. Objective ~ Additional Information Regarding Reading from Storage System, Number of
Follows a Rule, CollapseCodeGenStages and Groups Operators ~ Pull Data from Source, Return Only Requested and Selected for Files Read and Size of Files. Details Used ~ Understanding About Source Data
Support Code Generation Together ~ Speed Up Execution Process Column Pruning, Filter Rows using Pushed & Partition Filters
Apache Spark Query Plans Distilled ~ Query Plan Operators
Exchange
Joins
Simply Means Shuffle, Meaning Physical Data Movement in Cluster.
One of Most Expensive Operations, Triggered Types of Joins Used By Apache Spark Engine
• Joins ~ Between DataSets, DataFrames • BHJ ~ One Side is Very Small (MBs), Smaller Table is Broadcasted ~ Every
• Repartition ~ Repartition Data ~ Reduce Data Skew Executor (Exchange) and Joined with Bigger Table using Hash Join
• Coalesce ~ Move All Data ~ Single Executor ~ Output of CSV
• Sort ~ Output Data Sorted • SHJ ~ One Side is 3x Smaller, and Average of Partition Size is Small Enough for
a Broadcast. During Join Partitions are Broadcasted and Joined using Hash Join
Fully Managed, ServerLess Data Integration Solution for ServerLess Data Integration Service, Fully Managed ETL Fully Managed, Cloud-Native Data Integration Service at Any
Ingesting, Preparing, and Transforming Data at Scale Service and Cost-Effective for Data Clean, Enrich & Move Scale, Open Core, Delivering Hybrid Integration using CDAP
1. Easy-to-Use = Rehost SSIS Effortlessly 1. Discover, Prepare, & Combine Data for Analytics, 1. Visual Point-and-Click Interface Enabling Code-Free
2. Cost-Effective = Pay-as-You-Go Machine Learning, & Application Development Deployment of ETL/ELT Data Pipelines
3. + 90 Built-In Connectors 2. Automatic Schema Discovery using Crawlers 2. Broad Library of 150+ Pre-Configured Connectors &
3. Manage and Enforce Schema for Data Streams [AWS Transformations
Glue Schema Registry] 3. Natively Integrated Best-in-Class Google Cloud Services
Apache Airflow Managed Deployment Options
Amazon Managed Workflows for Apache Airflow Google Cloud Composer Kubernetes
[MWAA]
Managed Orchestration Service for Apache Airflow, Operate End-to-End Fully Managed Workflow Orchestration Service Built On Open-Source System for Automating Deployment,
Data Pipelines in Cloud at Scale with Minimum Effort & Configuration Apache Airflow Scaling, and Management of Containerized Applications
1. Data Secured by Default Running in an Isolated and Secure Cloud 1. Author, Schedule, and Monitor Pipelines ~ Span 1. Planet Scale
Environment using VPC, Data is Automatically Encrypted using KMS Across Hybrid and Multi-Cloud Environments 2. Runs Anywhere
2. Connect ~ AWS or On-Premises Resources Required for Workflows 2. Frees You from Lock-In and is Easy to Use 3. Batch Execution
Including Athena, Batch, Cloudwatch, DynamoDB, DataSync, EMR, 3. Supports Hybrid and Multi-Cloud 4. Self Healing
Fargate, EKS, Firehose, Glue, Lambda, Redshift, SQS, SNS, Sagemaker 5. Designed for Extensibility
& S3 6. Multi-Cloud Approach
Lambda Architecture – Cloud Agnostic & Simplified
Data Source Batch-Layer
JSON & CSV
SQL Server & Redis
Data Storage - Data Lake Storage Gen2 | GCS | S3 | HDFS Serving-Layer
Batch-Processing - Apache Spark | Databricks Azure SQL Data Warehouse
Internet – Siri | Spotify | YouTube
Amazon Redshift
Google BigQuery
Apache Hive
CosmosDB
Speed-Layer
Real-Time Ingestion - Apache Kafka [Confluent]
Stream Processing - Apache Kakfa [Confluent] | Apache Spark [Databricks]
Lambda Architecture
Batch Data
SQL Server
1
2 3 4
MySQL
Serving Visualization
5
Azure Data Factory Data Lake [Gen2] Azure Databricks
PostgreSQL
7 8
Mobile 6
Application
Python Producer
Kappa Architecture – Cloud Agnostic & Simplified
Speed-Layer
Apache Kafka ~ [Confluent]
Apache Spark ~ [Databricks]
Serving-Layer
Output of Batch & Speed Layer
Processed & Computed Data
Ad-Hoc Queries & Data Analysis
sending data in process data inside of getting data out visualize data
EDH
SQL Server
RDBMS enterprise data hub ElasticSearch
& Kibana
NoSQL
PostgreSQL
RDBMS
PostgreSQL
RDBMS
MySQL
RDBMS
Minio
MongoDB Storage
NoSQL
Microsoft Azure Big Data Landscape for Data Pipelines
Data Processing Data Serving
Data Ingestion ADF ~ Mapping Data Flows HDInsight ~ [Apache Hive]
Azure Data Factory Azure Databricks HDInsight ~ [Interactive Query]
Azure Event Hubs Azure Stream Analytics Azure CosmosDB
HDInsight ~ [Apache Kafka] Azure Functions Azure Synapse Analytics
Confluent Cloud Snowflake Data Viz
Azure Blob Storage Power Bi
RDBMS
Azure SQL DB NoSQL
Data Discovery Azure DB for MySQL Azure CosmosDB Search Orchestration Monitoring
Shared Resources Azure Purview Azure DB for PostgreSQL Azure Cache for Redis Azure Cognitive Search Azure Data Factory Azure Monitor
Shared Among Pipeline
Cost of a Data Pipeline on Microsoft Azure
Data Lake
Total Cost for Data Pipelines on Microsoft Azure
Raw Repository
Low-Cost Storage • Storage Layer = R$ 2.414
Data Processing Data Serving • Data Processing Layer = R$ 6.556
Cost for Azure Blob Storage • Data Serving Layer = R$ 14.580
Data Analytics Platform Massively Parallel Processing Engine [MPP]
• Region – EastUS2 Analytical & Modern Data Warehouse [MDW] • Total Monthly Cost – R$ 23,550
Apache Spark Engine
• Tier – Premium
• Redundancy – LRS
• Capacity – 1 TB
• Monthly Cost – R$ 847
Real-Time Ingestion
NoSQL
Amazon DynamoDB
Data Discovery RDBMS Amazon Neptune Search Data Orchestration Monitoring
Shared Resources AWS Glue Amazon RDS Amazon ElastiCache Amazon CloudSearch AWS Glue & MWAA Amazon CloudWatch
Shared Among Pipeline
Cost of a Data Pipeline on Amazon AWS
Data Lake
Total Cost for Data Pipelines on Amazon AWS
Raw Repository
Low-Cost Storage
• Storage Layer = R$ 2.634
Data Processing Data Serving • Data Processing Layer = R$ 5.791
Cost for Amazon S3 • Data Serving Layer = R$ 13.238
Data Analytics Platform Massively Parallel Processing Engine [MPP]
• Region – USEast • Total Monthly Cost – R$ 21.663
Apache Spark Engine Analytical & Modern Data Warehouse [MDW]
• Capacity – 1 TB
• Scanned & Returned – 100 GB
• Monthly Cost – R$ 170
Real-Time Ingestion
Data Exploration
Google Cloud DataPrep
Data Storage Google Cloud DataLab
Google Cloud Storage [GCS]
NoSQL
Shared Resources RDBMS Google Cloud BigTable
Shared Among Pipeline
Data Discovery Google Cloud SQL Google Cloud Firestore Data Orchestration Monitoring
Google Cloud Data Catalog Google Cloud Spanner Google Cloud MemoryStore Google Cloud Composer Google Cloud Stackdriver
Cost of a Data Pipeline on Google GCP
Data Lake
Total Cost for Data Pipelines on Google GCP
Raw Repository
Low-Cost Storage
• Storage Layer = R$ 336
Data Processing Data Serving • Data Processing Layer = R$ 2.156
Cost for Google GCS • Data Serving Layer = R$ 170
Data Analytics Platform Massively Parallel Processing Engine [MPP]
• Region – US-Central Apache Spark Engine Analytical & Modern Data Warehouse [MDW] • Total Monthly Cost – R$ 2.662
• Capacity – 1 TB
• Class A & B Operations – 1 Million
• Monthly Cost – R$ 143
Real-Time Ingestion
Total Cost for Data Pipelines on Microsoft Azure Total Cost for Data Pipelines on Amazon AWS Total Cost for Data Pipelines on Google GCP
• Total Monthly Cost – R$ 23.550 • Total Monthly Cost – R$ 21.663 • Total Monthly Cost – R$ 2.662
Data Lake
Real-Time
Ingestion
OSS Big Data Products on [Spotlight]
Apache Kafka Apache Pulsar Apache Spark Apache Airflow Apache Pinot Trino Dremio YugaByteDB
80% of ALL Fortune 100 Messaging & Streaming Platform. PySpark, Spark SQL, Java, Programmatically Author, Schedule Real-Time Distributed OLAP Data Data Processing Engine Next-Generation Data Lake Cloud-Native Database
Companies Trust. Ingest and Pulsar Functions, Persistent Storage, Scala, R, .NET. Most Used Big & Monitor Workflows using Store, Designed for Low-Latency Unleashing SQL at Scale & Engine for Interactive Query Platform using Different APIS
Process Data Effortlessly Multi-Tenancy with Low-Latency Data Product Python. Newest 2.0 Version Out Queries at Scale Providing Data Virtualization in a Blazing Fast Speed – Redis, Postgres & Cassandra
Process Layer
Azure Big Data Products on [Spotlight]
Unified Data Governance with Data Discovery, Sensitive Fast, Easy & Collaborative Apache Spark Based Analytics MDW with Limitless Analytics Service with Unmatched Time to
Data Classification & End-to-End Data Lineage Service Providing Fast Deployment Process Insight – PaaS & SaaS Approaches
1. Data Discovery, Classification and Mapping 1. Databricks Runtime ~ Optimized for Cloud Storage 1. Serverless & Dedicated Options
2. Data Catalog: Searching & Web-Based Experience 2. Managed Delta Lake 2. Data Lake Exploration
3. Data Governance: Enabling Key Insights and 3. Integrated Workspace – GitHub 3. Code-Free ETL & ELT
Understanding of Data Quality Rules 4. Production Jobs & Workflows 4. Deeply Integration with Apache Spark & SQL Engines
5. Enterprise Security 5. Languages – T-SQL, Python, Scala, Spark SQL and .NET
6. Integrations using ODBC & JDBC 6. Cloud-Native HTAP with Azure Synapse Link ~ CosmosDB
7. SQL Analytics – Redash + Delta Lake Engine 7. AI & BI
AWS Big Data Products on [Spotlight]
Managed Streaming for Apache Kafka [MSK] Amazon Glue & DataBrew Amazon Managed Workflows for Apache Airflow [MWAA]
Fully Managed Service to Build and Run Applications ~ Apache Kafka to Serverless Data Integration Service for ETL, ELT, Catalog, Lineage & Managed Orchestration Service for Apache Airflow, Operate End-to-End Data Pipelines in
Process Streaming Data Effortlessly Transformations for Cleaning and Enriching Data Cloud at Scale with Minimum Effort & Configuration
1. Amazon MSK Runs and Manages Apache Kafka, Maintain Open- 1. Discover, Prepare, & Combine Data for Analytics, Machine 1. Data Secured by Default Running in an Isolated and Secure Cloud Environment
Source Compatibility, MirrorMaker, Apache Flink, and Prometheus Learning, and Application Development using VPC, Data is Automatically Encrypted using KMS
2. VPC Network Isolation, AWS IAM for Control-Plane API 2. DataBrew is New Visual Data Preparation Tool ~ Clean and 2. Connect ~ AWS or On-Premises Resources Required for Workflows Including
Authorization, Encryption at Rest, TLS Encryption & In-Transit Normalize Data for Analytics and Machine Learning Athena, Batch, Cloudwatch, DynamoDB, DataSync, EMR, Fargate, EKS, Firehose,
Glue, Lambda, Redshift, SQS, SNS, Sagemaker & S3
GCP Big Data Products on [Spotlight]
Fully Managed, Cloud-Native Data Integration at Any Scale using Unified Stream and Batch Data Processing Serverless, Fast, and Cost-Effective Serverless, Highly Scalable, and Cost-Effective Multi-Cloud Data
Ephemeral DataProc Cluster Underneath using Beam Framework Warehouse Designed for Business Agility
1. Automated Provisioning and Management of Processing Resources 1. Analyze Petabytes of Data Using ANSI SQL at Blazing-Fast Speeds,
1. Code-Free ETL & ELT Deployment of Data Pipelines 2. Horizontal Autoscaling of Worker Resources with Zero Operational Overhead
2. Library of 150+ Configured Connectors & Transformations 3. OSS Community-Driven Innovation with Apache Beam SDK 2. Democratize Insights with a Trusted and Secure Platform Scales
3. Built with OSS Core CDAP for Pipeline Portability 4. Reliable and Consistent Exactly-Once Processing 3. Gain Insights from Data Across Clouds with a Flexible, Multi-Cloud
Analytics Solution ~ Omni
“War is 90% information.”
- Napoleon Bonaparte
“A scientist can discover a new star, but he cannot make
one. He would have to ask an engineer to do it for him.”
- Gordon Lindsay Glegg
“In God we trust. All others must bring data.”
- W. Edwards Deming
Data Engineer Technical Skills
Data Engineer Career - Part 1 OS & Programming Language
1 • Linux
Data Pipelines &
5 • SQL
Cloud Computing • Python
• Scala
• Lambda & Kappa
• Google GCP
• Amazon AWS
• Microsoft Azure
DBMS & NoSQL
2 • SQL Server
• Oracle
• PostgreSQL
• MySQL
• MongoDB
• Cassandra
Distributed Systems & 4 • Redis Cache
Big Data Frameworks
Industry Knowledge
Effective Collaboration 02 04
understanding the way your chosen industry functions and
how data can be collected, analyzed and utilized;
carefully listening to management, data scientists maintaining flexibility in the face of big data developments.
and data architects to establish their needs.
Data Engineer Certifications
Data Engineer Career - Part 3
Data Engineer
LinkedIN
Luan Moreno Medeiros Maciel
Facebook
Luan Moreno Medeiros Maciel
Instagram
engenhariadedados
Podcast
engenhariadedadoscast
Thank You