Day5 Patterns Use Cases

One Way Solution
Patterns & Common Use-Cases

Data Engineering –[Day 5]
LUAN MORENO
CEO & CDO
Data Engineer & Data Platform MVP
The Spark Lifecycle
Data Lake Apache Spark Data Warehouse
Repository of Raw Data Distributed Cluster-Computing Framework Analytics Platform for Enterprises
Without Schema Enforcement Optimized for Memory Computation Scalability – Horizontally & Vertically
Raw Ingestion Transformations Business-Level

Data Lake vs. Data Lakehouses
Data Lakehouses
Metadata Layers for Data Lakes
New Query Engine Design
High-Performance SQL Execution
Optimized Access of Data
Data Lake
Repository of Raw Data
Unsiloed Data
Without Schema Enforcement
Data Swamp & Data Quality Issues
The Delta Architecture
Delta Lake
Data Lake
Storage Layer with ACID Transactions
Big Data Workloads
Repository of Raw Data
Without Schema Enforcement
Batch
Stream
Bronze Silver Gold
Ingestion Tables Refined Tables Feature & Agg Data Store
Raw Ingestion Transformations Business Level

Apache Spark Query Plans Distilled ~ Query Execution Model
Query Execution Plan

Entry Point for Understanding Execution. Important for Debugging
or Investigating Heavy Workloads. Understanding Query Plan is
First Step Towards Optimizing Apache Spark Application Code
Job Task
Driver Job Stage Task
Job Task
Job Stage Task
Sequence of Stages, Triggered By an Action Sequence of Tasks in Parallel, Based on Single Operation Applied to a Single
Such as Count(), ForeachRDD(), Collect(), Computation Boundaries, Based of Partition. Task Executed as a Single Thread
Read(), Write(). Transforms Each Jobs in DAGs Number of Partitions of a Dataset in an Executor. (Unit of Execution)
Apache Spark Query Plans Distilled ~ The Catalyst Optimizer
Logical & Physical Plans
Computational Query & Converts into an Execution Plan. Stages ~ Analysis,
Logical Optimization, Physical Planning & Code Generation. Catalyst Optimizer
Provides Rule-Based and Bost-Based Optimizations
Query
Unresolved Optimized Selected

DataFrame Logical Plan
Logical Plan
Logical Plan
Physical Plans Cost Model
Physical Plan
RDDs
DataSet
Catalog
Analysis Logical Optimization Physical Planning Code Generation
Optimizer Takes Unresolved Plan & Cross Once Query is Analyzed, Catalyst Optimizes Optimizer uses Optimized Logical Plan & Once Best Physical Plan is Chosen, Apache Spark
Checks with Catalog, Verify If Plan is Correct Query using Rule-Based Optimization ~ Derive Generates Physical Plans. Apache Spark Decides uses Tungsten Backend to Generate Java ByteCode
an Optimized Logical Plan Which Algorithm Must Be Used for Every Operator to Run on Each Machine ~ Executor
During Resolution, Tries to Identify Data Type, ~ SortMergeJoin & BroadcastHashJoin
Existence, and Location of Columns. Analyzer
Validates Operations Best Plan is Selected using Cost-Based Model ~
Model Costs for Engine
If Query Resolves Successfully, Analyzed Query
Plan, Additional Information Included
Apache Spark Query Plans Distilled ~ Query Plan Operators
Collapse Code Gen Stages Scan Parquet Additional Info
Operators Grouped. During Physical Planning, Catalyst Optimizer Read Operations on Source Files ~ Apache Parquet & Delta. Objective ~ Additional Information Regarding Reading from Storage System, Number of
Follows a Rule, CollapseCodeGenStages and Groups Operators ~ Pull Data from Source, Return Only Requested and Selected for Files Read and Size of Files. Details Used ~ Understanding About Source Data
Support Code Generation Together ~ Speed Up Execution Process Column Pruning, Filter Rows using Pushed & Partition Filters
Apache Spark Query Plans Distilled ~ Query Plan Operators
Exchange
Joins
Simply Means Shuffle, Meaning Physical Data Movement in Cluster.
One of Most Expensive Operations, Triggered Types of Joins Used By Apache Spark Engine
• Joins ~ Between DataSets, DataFrames • BHJ ~ One Side is Very Small (MBs), Smaller Table is Broadcasted ~ Every
• Repartition ~ Repartition Data ~ Reduce Data Skew Executor (Exchange) and Joined with Bigger Table using Hash Join
• Coalesce ~ Move All Data ~ Single Executor ~ Output of CSV
• Sort ~ Output Data Sorted • SHJ ~ One Side is 3x Smaller, and Average of Partition Size is Small Enough for
a Broadcast. During Join Partitions are Broadcasted and Joined using Hash Join
• SortMergeJoin ~ Most Common Join, Cannot Apply Available Options. During

Join Data on Both Sides are Sorted and Joined using Merge Sort
SaaS Data Pipeline Orchestration Options
Azure Data Factory AWS Glue Cloud Data Fusion
Fully Managed, ServerLess Data Integration Solution for ServerLess Data Integration Service, Fully Managed ETL Fully Managed, Cloud-Native Data Integration Service at Any
Ingesting, Preparing, and Transforming Data at Scale Service and Cost-Effective for Data Clean, Enrich & Move Scale, Open Core, Delivering Hybrid Integration using CDAP
1. Easy-to-Use = Rehost SSIS Effortlessly 1. Discover, Prepare, & Combine Data for Analytics, 1. Visual Point-and-Click Interface Enabling Code-Free
2. Cost-Effective = Pay-as-You-Go Machine Learning, & Application Development Deployment of ETL/ELT Data Pipelines
3. + 90 Built-In Connectors 2. Automatic Schema Discovery using Crawlers 2. Broad Library of 150+ Pre-Configured Connectors &
3. Manage and Enforce Schema for Data Streams [AWS Transformations
Glue Schema Registry] 3. Natively Integrated Best-in-Class Google Cloud Services
Apache Airflow Managed Deployment Options
Amazon Managed Workflows for Apache Airflow Google Cloud Composer Kubernetes
[MWAA]
Managed Orchestration Service for Apache Airflow, Operate End-to-End Fully Managed Workflow Orchestration Service Built On Open-Source System for Automating Deployment,
Data Pipelines in Cloud at Scale with Minimum Effort & Configuration Apache Airflow Scaling, and Management of Containerized Applications
1. Data Secured by Default Running in an Isolated and Secure Cloud 1. Author, Schedule, and Monitor Pipelines ~ Span 1. Planet Scale
Environment using VPC, Data is Automatically Encrypted using KMS Across Hybrid and Multi-Cloud Environments 2. Runs Anywhere
2. Connect ~ AWS or On-Premises Resources Required for Workflows 2. Frees You from Lock-In and is Easy to Use 3. Batch Execution
Including Athena, Batch, Cloudwatch, DynamoDB, DataSync, EMR, 3. Supports Hybrid and Multi-Cloud 4. Self Healing
Fargate, EKS, Firehose, Glue, Lambda, Redshift, SQS, SNS, Sagemaker 5. Designed for Extensibility
& S3 6. Multi-Cloud Approach
Lambda Architecture – Cloud Agnostic & Simplified
Data Source Batch-Layer
JSON & CSV
SQL Server & Redis
Data Storage - Data Lake Storage Gen2 | GCS | S3 | HDFS Serving-Layer
Batch-Processing - Apache Spark | Databricks Azure SQL Data Warehouse
Internet – Siri | Spotify | YouTube
Amazon Redshift
Google BigQuery
Apache Hive
CosmosDB
Speed-Layer
Real-Time Ingestion - Apache Kafka [Confluent]
Stream Processing - Apache Kakfa [Confluent] | Apache Spark [Databricks]
Lambda Architecture
Batch Data
Ingestion, Storage and Processing
SQL Server
1
2 3 4
MySQL
Serving Visualization
5
Azure Data Factory Data Lake [Gen2] Azure Databricks
PostgreSQL
9 Azure Synapse Analytics PowerBi

Stream Data
Streaming Processing
7 8
Mobile 6
Application
Azure Event Hubs Azure Databricks
Python Producer
Kappa Architecture – Cloud Agnostic & Simplified
Speed-Layer
Apache Kafka ~ [Confluent]
Apache Spark ~ [Databricks]
Serving-Layer
Output of Batch & Speed Layer
Processed & Computed Data
Ad-Hoc Queries & Data Analysis
Long Term Store

Azure Data Lake Storage Gen2
Event-Source Google Cloud Storage [GCS]
Unified Log Entry [Event Data] Amazon S3
Append-Only Log Store Hadoop Distributed File System [HDFS]
Kappa Architecture
Producers Streams Consumers Serving
sending data in process data inside of getting data out visualize data
Producer App Consumer App Metabase

Kafka Python Client Kafka Python Client Visualization
EDH
SQL Server
RDBMS enterprise data hub ElasticSearch
& Kibana
NoSQL
PostgreSQL
RDBMS
PostgreSQL
RDBMS
MySQL
RDBMS
Minio
MongoDB Storage
NoSQL
Microsoft Azure Big Data Landscape for Data Pipelines
Data Processing Data Serving
Data Ingestion ADF ~ Mapping Data Flows HDInsight ~ [Apache Hive]
Azure Data Factory Azure Databricks HDInsight ~ [Interactive Query]
Azure Event Hubs Azure Stream Analytics Azure CosmosDB
HDInsight ~ [Apache Kafka] Azure Functions Azure Synapse Analytics
Confluent Cloud Snowflake Data Viz
Azure Blob Storage Power Bi
Data Lake Data Exploration

Azure Data Lake Gen 2 Azure Data Explorer
Dremio
RDBMS
Azure SQL DB NoSQL
Data Discovery Azure DB for MySQL Azure CosmosDB Search Orchestration Monitoring
Shared Resources Azure Purview Azure DB for PostgreSQL Azure Cache for Redis Azure Cognitive Search Azure Data Factory Azure Monitor
Shared Among Pipeline
Cost of a Data Pipeline on Microsoft Azure
Data Lake
Total Cost for Data Pipelines on Microsoft Azure
Raw Repository
Low-Cost Storage • Storage Layer = R$ 2.414
Data Processing Data Serving • Data Processing Layer = R$ 6.556
Cost for Azure Blob Storage • Data Serving Layer = R$ 14.580
Data Analytics Platform Massively Parallel Processing Engine [MPP]
• Region – EastUS2 Analytical & Modern Data Warehouse [MDW] • Total Monthly Cost – R$ 23,550
Apache Spark Engine
• Tier – Premium
• Redundancy – LRS
• Capacity – 1 TB
• Monthly Cost – R$ 847
Real-Time Ingestion
Ingesting Events at Scale

Store and Replicate Data Securely
Cost for Synapse Analytics Apache Spark Pools Cost for Synapse Analytics Dedicated SQL Pool
Cost for Azure Event Hubs
• Region – EastUS2 • Region – EastUS2
• Region – EastUS2 • Instance – Small • Tier – Compute Optimized Gen2
• Tier – Standard with Capture • Spec – 4 vCPUs & 32 GB of RAM • DWU – 300
• Ingress – 1 Million • VMs – 3 • Storage – 1 TB
• Throughput – 3 MB • Monthly Cost – R$ 6.556 • Monthly Cost – R$ 14.580
• Monthly Cost – R$ 1.567
Amazon AWS Big Data Landscape for Data Pipelines
Data Ingestion
AWS Data Pipeline
AWS Glue
Kinesis Firehose Data Processing
Kinesis Data Streams
AWS Glue ~ DataBrew
Amazon MSK Data Serving
Databricks
Confluent Cloud
Amazon EMR Amazon EMR ~ [Apache Hive]
Amazon S3
Kinesis Analytics Amazon EMR ~ [Presto]
AWS Lambda Amazon Redshift Data Viz
Data Studio
PowerBi
Tableau
Qlik
Data Storage Metabase
Amazon S3 ~ AWS Data Exploration
Lake Formation
Amazon Athena
NoSQL
Amazon DynamoDB
Data Discovery RDBMS Amazon Neptune Search Data Orchestration Monitoring
Shared Resources AWS Glue Amazon RDS Amazon ElastiCache Amazon CloudSearch AWS Glue & MWAA Amazon CloudWatch
Cost of a Data Pipeline on Amazon AWS
Data Lake
Total Cost for Data Pipelines on Amazon AWS
Raw Repository
Low-Cost Storage
• Storage Layer = R$ 2.634
Cost for Amazon S3 • Data Serving Layer = R$ 13.238
• Region – USEast • Total Monthly Cost – R$ 21.663
Apache Spark Engine Analytical & Modern Data Warehouse [MDW]
• Scanned & Returned – 100 GB
Real-Time Ingestion

Cost for AWS Databricks Cost for Amazon Redshift
Cost for Amazon Kinesis Firehose
• Region – USEast • Region – USEast
• Region – USEast • Workload – ALL-Purpose Compute • Nodes – 3
• Units – Thousands • Tier – Enterprise • Instance – RA3.XLPlus
• Records Amount – 1 Million • Instance – M4.XLarge • Spec – 4 vCPUs & 32 GiB
• Record Size – 10 KB • Spec – 4 vCPUs & 16 GB of RAM • Backup – 100 GB
• Conversion – Parquet & ORC • VMs – 3 • Spectrum – 100 GB
• Monthly Cost – R$ 2.464 • Monthly Cost – R$ 5.791 • Storage – 1 TB
Google GCP Big Data Landscape for Data Pipelines
Data Processing Data Viz

Data Ingestion Google Cloud DataPrep Data Studio
Google Cloud Pub/Sub Google Cloud DataProc PowerBi
Confluent Cloud Google Cloud DataFlow
Data Serving Tableau
Google Cloud Storage [GCS] Google Cloud Functions Google BigQuery Qlik
Google Cloud Data Fusion Metabase
Data Exploration
Google Cloud DataPrep
Data Storage Google Cloud DataLab
Google Cloud Storage [GCS]
NoSQL
Shared Resources RDBMS Google Cloud BigTable
Data Discovery Google Cloud SQL Google Cloud Firestore Data Orchestration Monitoring
Google Cloud Data Catalog Google Cloud Spanner Google Cloud MemoryStore Google Cloud Composer Google Cloud Stackdriver
Cost of a Data Pipeline on Google GCP
Data Lake
Total Cost for Data Pipelines on Google GCP
Raw Repository
Low-Cost Storage
• Storage Layer = R$ 336
Cost for Google GCS • Data Serving Layer = R$ 170
• Region – US-Central Apache Spark Engine Analytical & Modern Data Warehouse [MDW] • Total Monthly Cost – R$ 2.662
• Class A & B Operations – 1 Million
Real-Time Ingestion

Cost for Google DataProc Cost for Google BigQuery
Cost for Cloud Pub/Sub
• Region – US-Central • Region – US-Central
• Region – US-Central
•
• VM-Class – Regular Storage – 1 TB
• Message Volume – 100 GiB
•
• Instance – N1-Standard-4 Streaming Inserts – 100 GB
• Retained – 10 GiB
•
• Spec – 4 vCPUs & 15 GB of RAM Queries – 2 TB
• Snapshot – 10 GiB
•
• VMs – 3 Monthly Cost – R$ 170
• Storage – 500 GB SSD
Cost for a Data Pipeline on Cloud Computing Vendors
Total Cost for Data Pipelines on Microsoft Azure Total Cost for Data Pipelines on Amazon AWS Total Cost for Data Pipelines on Google GCP
• Storage Layer = R$ 2.414 • Storage Layer = R$ 2.634 • Storage Layer = R$ 336

• Data Processing Layer = R$ 6.556 • Data Processing Layer = R$ 5.791 • Data Processing Layer = R$ 2.156
• Data Serving Layer = R$ 14.580 • Data Serving Layer = R$ 13.238 • Data Serving Layer = R$ 170
• Total Monthly Cost – R$ 23.550 • Total Monthly Cost – R$ 21.663 • Total Monthly Cost – R$ 2.662
Data Lake
Data Processing Data Serving
Real-Time
Ingestion
OSS Big Data Products on [Spotlight]
Apache Kafka Apache Pulsar Apache Spark Apache Airflow Apache Pinot Trino Dremio YugaByteDB
80% of ALL Fortune 100 Messaging & Streaming Platform. PySpark, Spark SQL, Java, Programmatically Author, Schedule Real-Time Distributed OLAP Data Data Processing Engine Next-Generation Data Lake Cloud-Native Database
Companies Trust. Ingest and Pulsar Functions, Persistent Storage, Scala, R, .NET. Most Used Big & Monitor Workflows using Store, Designed for Low-Latency Unleashing SQL at Scale & Engine for Interactive Query Platform using Different APIS
Process Data Effortlessly Multi-Tenancy with Low-Latency Data Product Python. Newest 2.0 Version Out Queries at Scale Providing Data Virtualization in a Blazing Fast Speed – Redis, Postgres & Cassandra
Process Layer
Azure Big Data Products on [Spotlight]
Azure Purview Azure Databricks Azure Synapse Analytics
Unified Data Governance with Data Discovery, Sensitive Fast, Easy & Collaborative Apache Spark Based Analytics MDW with Limitless Analytics Service with Unmatched Time to
Data Classification & End-to-End Data Lineage Service Providing Fast Deployment Process Insight – PaaS & SaaS Approaches
1. Data Discovery, Classification and Mapping 1. Databricks Runtime ~ Optimized for Cloud Storage 1. Serverless & Dedicated Options
2. Data Catalog: Searching & Web-Based Experience 2. Managed Delta Lake 2. Data Lake Exploration
3. Data Governance: Enabling Key Insights and 3. Integrated Workspace – GitHub 3. Code-Free ETL & ELT
Understanding of Data Quality Rules 4. Production Jobs & Workflows 4. Deeply Integration with Apache Spark & SQL Engines
5. Enterprise Security 5. Languages – T-SQL, Python, Scala, Spark SQL and .NET
6. Integrations using ODBC & JDBC 6. Cloud-Native HTAP with Azure Synapse Link ~ CosmosDB
7. SQL Analytics – Redash + Delta Lake Engine 7. AI & BI
AWS Big Data Products on [Spotlight]
Managed Streaming for Apache Kafka [MSK] Amazon Glue & DataBrew Amazon Managed Workflows for Apache Airflow [MWAA]
Fully Managed Service to Build and Run Applications ~ Apache Kafka to Serverless Data Integration Service for ETL, ELT, Catalog, Lineage & Managed Orchestration Service for Apache Airflow, Operate End-to-End Data Pipelines in
Process Streaming Data Effortlessly Transformations for Cleaning and Enriching Data Cloud at Scale with Minimum Effort & Configuration
1. Amazon MSK Runs and Manages Apache Kafka, Maintain Open- 1. Discover, Prepare, & Combine Data for Analytics, Machine 1. Data Secured by Default Running in an Isolated and Secure Cloud Environment
Source Compatibility, MirrorMaker, Apache Flink, and Prometheus Learning, and Application Development using VPC, Data is Automatically Encrypted using KMS
2. VPC Network Isolation, AWS IAM for Control-Plane API 2. DataBrew is New Visual Data Preparation Tool ~ Clean and 2. Connect ~ AWS or On-Premises Resources Required for Workflows Including
Authorization, Encryption at Rest, TLS Encryption & In-Transit Normalize Data for Analytics and Machine Learning Athena, Batch, Cloudwatch, DynamoDB, DataSync, EMR, Fargate, EKS, Firehose,
Glue, Lambda, Redshift, SQS, SNS, Sagemaker & S3
GCP Big Data Products on [Spotlight]
Cloud Data Fusion Cloud Dataflow BigQuery
Fully Managed, Cloud-Native Data Integration at Any Scale using Unified Stream and Batch Data Processing Serverless, Fast, and Cost-Effective Serverless, Highly Scalable, and Cost-Effective Multi-Cloud Data
Ephemeral DataProc Cluster Underneath using Beam Framework Warehouse Designed for Business Agility
1. Automated Provisioning and Management of Processing Resources 1. Analyze Petabytes of Data Using ANSI SQL at Blazing-Fast Speeds,
1. Code-Free ETL & ELT Deployment of Data Pipelines 2. Horizontal Autoscaling of Worker Resources with Zero Operational Overhead
2. Library of 150+ Configured Connectors & Transformations 3. OSS Community-Driven Innovation with Apache Beam SDK 2. Democratize Insights with a Trusted and Secure Platform Scales
3. Built with OSS Core CDAP for Pipeline Portability 4. Reliable and Consistent Exactly-Once Processing 3. Gain Insights from Data Across Clouds with a Flexible, Multi-Cloud
Analytics Solution ~ Omni
“War is 90% information.”
- Napoleon Bonaparte
“A scientist can discover a new star, but he cannot make
one. He would have to ask an engineer to do it for him.”
- Gordon Lindsay Glegg
“In God we trust. All others must bring data.”
- W. Edwards Deming
Data Engineer Technical Skills
Data Engineer Career - Part 1 OS & Programming Language
1 • Linux
Data Pipelines &
5 • SQL
Cloud Computing • Python
• Scala
• Lambda & Kappa
• Google GCP
• Amazon AWS
• Microsoft Azure
DBMS & NoSQL
2 • SQL Server
• Oracle
• PostgreSQL
• MySQL
• MongoDB
• Cassandra
Distributed Systems & 4 • Redis Cache
Big Data Frameworks
• Apache Hadoop [HDFS] ETL & DW

• Apache Spark 3
• Apache Kafka • SSIS & ODI
• Apache Airflow • PowerCenter
• Talend
• Pentaho
• Oracle Exadata
• Sybase IQ
Data Engineer Business Skills
Data Engineer Career - Part 2
Creative Problem-Solving Intellectual Curiosity

01 03
approaching data organization challenges with exploring new territories and finding creative and
a clear eye on what is important; employing the unusual ways to solve data management problems.
right approach/methods to make the
maximum use of time and human resources.
Industry Knowledge
Effective Collaboration 02 04
understanding the way your chosen industry functions and
how data can be collected, analyzed and utilized;
carefully listening to management, data scientists maintaining flexibility in the face of big data developments.
and data architects to establish their needs.
Data Engineer Certifications
Databricks Certified Data Engineer

Associate
Amazon Web Services (AWS) Databricks Certified Associate
Certified Big Data – Specialty exam assesses an individual’s ability to use the Developer for Apache Spark
databricks lakehouse platform to complete
the aws certified big data – specialty certification is introductory data engineering tasks the exam assesses the understanding of the
intended for individuals who perform complex big spark dataframe api and the ability to apply
data analysis with at least two years of experience the spark dataframe api to complete basic data
using aws technology. manipulation tasks within a spark session
Databricks Certified Data Engineer

Professional
Microsoft Certified: Azure Data Engineer
Associate exam assesses an individual’s ability to use
Google Professional Data Engineer databricks to perform advanced data engineering
azure data engineers design and implement the management, tasks. this includes an understanding of the
professional data engineer enables data-driven monitoring, security, and privacy of data using the full stack databricks platform and developer tools
decision making by collecting, transforming, of azure data services to satisfy business needs.
and publishing data.
Data Engineer Study
Python for Everybody Specialization
introduce fundamental programming concepts

including data structures, networked application
program interfaces, and databases, using the
python programming language.
Data Engineer
learn to design data models, build data warehouses

and data lakes, automate data pipelines, and work
with massive datasets.
Architecting with GKE Specialization

Data Engineer with Python
how to implement solutions using gke, including building, scheduling,
an increasing number of businesses are collecting huge load balancing, and monitoring workloads, as well as providing for
discovery of services, managing role-based access control and security,
amounts of data. often, they’ll have data scientists or data
and providing persistent storage to these applications.
analysts analyze that data.
Luan Moreno M. Maciel
YouTube
luanmorenommaciel
LinkedIN
Luan Moreno Medeiros Maciel
Facebook
Luan Moreno Medeiros Maciel
Instagram
engenhariadedados
Podcast
engenhariadedadoscast
Thank You
One Way Solution

Day5 Patterns Use Cases

Uploaded by

Copyright:

Available Formats

Day5 Patterns Use Cases

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day5 Patterns Use Cases

Uploaded by

Copyright:

Available Formats

One Way Solution

Patterns & Common Use-Cases

Raw Ingestion Transformations Business-Level

Raw Ingestion Transformations Business Level

Query Execution Plan

Driver Job Stage Task

Job Stage Task

Unresolved Optimized Selected

Analysis Logical Optimization Physical Planning Code Generation

Collapse Code Gen Stages Scan Parquet Additional Info

• SortMergeJoin ~ Most Common Join, Cannot Apply Available Options. During

Azure Data Factory AWS Glue Cloud Data Fusion

Ingestion, Storage and Processing

9 Azure Synapse Analytics PowerBi

Azure Event Hubs Azure Databricks

Long Term Store

Producers Streams Consumers Serving

Producer App Consumer App Metabase

Data Lake Data Exploration

Ingesting Events at Scale

Ingesting Events at Scale

Data Processing Data Viz

Ingesting Events at Scale

• Storage Layer = R$ 2.414 • Storage Layer = R$ 2.634 • Storage Layer = R$ 336

Data Processing Data Serving

Azure Purview Azure Databricks Azure Synapse Analytics

Cloud Data Fusion Cloud Dataflow BigQuery

• Apache Hadoop [HDFS] ETL & DW

Creative Problem-Solving Intellectual Curiosity

Databricks Certified Data Engineer

Databricks Certified Data Engineer

Python for Everybody Specialization

introduce fundamental programming concepts

learn to design data models, build data warehouses

Architecting with GKE Specialization

One Way Solution

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.