0% found this document useful (0 votes)
45 views

Incremental Data Ingestion From Files

Uploaded by

forotheuse123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Incremental Data Ingestion From Files

Uploaded by

forotheuse123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Incremental Data

Ingestion from Files


Learning Objectives

u What is incremental data Ingestion from file

u COPY INTO

u Auto Loader

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Incremental Data Ingestion

u Loading new data files encountered since the last ingestion

u Reduces redundant processing

u 2 mechanisms:
u COPY INTO
u Auto loader

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


COPY INTO

u SQL command

u Idempotently and incrementally load new data files


u Files that have already been loaded are skipped.

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


COPY INTO

u COPY INTO my_table


FROM '/path/to/files’
FILEFORMAT = <format>
FORMAT_OPTIONS (<format options>)
COPY_OPTIONS (<copy options>);

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Example

u COPY INTO my_table


FROM '/path/to/files’
FILEFORMAT = CSV
FORMAT_OPTIONS ('delimiter' = '|’,
'header' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true’)

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Auto loader

u Structured Streaming

u Can process billions of files

u Support near real-time ingestion of millions of files per hour.

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Auto loader Checkpointing

u Store metadata of the discovered files

u Exactly-once guarantees

u Fault tolerance

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Auto Loader in PySpark API

spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", <source_format>)
.load('/path/to/files’)
.writeStream
.option("checkpointLocation", <checkpoint_directory>)
.table(<table_name>)

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Auto Loader + Schema

spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", <source_format>)
.option("cloudFiles.schemaLocation", <schema_directory>)
.load('/path/to/files’)
.writeStream
.option("checkpointLocation", <checkpoint_directory>)
.option("mergeSchema", “true”)
.table(<table_name>)
Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation
COPY INTO vs. Auto Loader

COPY INTO Auto Loader


u Thousands of files u Millions of files

u Less efficient at scale u Efficient at scale

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy