0% found this document useful (0 votes)
133 views37 pages

Use Delta Lake in Azure Synapse Analytics

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It adds relational database semantics like tables, queries, and data modification, along with features like ACID transactions, schema enforcement, and support for both batch and streaming data. Delta Lake tables can be queried and modified using Spark SQL and are compatible with other big data technologies.

Uploaded by

jaspreet singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views37 pages

Use Delta Lake in Azure Synapse Analytics

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It adds relational database semantics like tables, queries, and data modification, along with features like ACID transactions, schema enforcement, and support for both batch and streaming data. Delta Lake tables can be queried and modified using Spark SQL and are compatible with other big data technologies.

Uploaded by

jaspreet singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Perform data engineering

with Azure Synapse Apache


Spark Pools
Use Delta Lake in Azure Synapse Analytics

© Copyright Microsoft Corporation. All rights reserved.


Analyze data with Apache Spark in Azure Synapse Analytics

Agenda
Transform data with Apache Spark in Azure Synapse Analytics

Use Delta Lake in Azure Synapse Analytics

© Copyright Microsoft Corporation. All rights reserved.


Use Delta Lake in Azure Synapse Analytics

© Copyright Microsoft Corporation. All rights reserved.


What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.

Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.

Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.
• A Delta Lake is a storage layer designed to run on top of an existing data lake and improve its
reliability, security, and performance. Delta Lakes support ACID transactions, scalable metadata,
unified streaming, and batch data processing.

Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.
• A Delta Lake is a storage layer designed to run on top of an existing data lake and improve its
reliability, security, and performance. Delta Lakes support ACID transactions, scalable metadata,
unified streaming, and batch data processing.
• Delta Lake is the default storage format for all operations on Databricks. Unless otherwise
specified, all tables on Databricks are Delta tables. Databricks originally developed the Delta
Lake protocol and continues to actively contribute to the open-source project.

Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.
• A Delta Lake is a storage layer designed to run on top of an existing data lake and improve its
reliability, security, and performance. Delta Lakes support ACID transactions, scalable metadata,
unified streaming, and batch data processing.
• Delta Lake is the default storage format for all operations on Databricks. Unless otherwise
specified, all tables on Databricks are Delta tables. Databricks originally developed the Delta
Lake protocol and continues to actively contribute to the open-source project.

The current version of Delta Lake included with Azure Synapse has language support for Scala,
PySpark, and .NET and is compatible with Linux Foundation Delta Lake.

Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification

© Copyright Microsoft Corporation. All rights reserved.


What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification
• Support for ACID* transactions
*ACID is an acronym that refers to the set of 4 key properties that define a transaction:
• Atomicity: Multiple operations can be grouped into a single logical entity. Each statement in a transaction (to read,
write, update or delete data) is treated as a single unit. Either the entire statement is executed, or none of it is
executed. This property prevents data loss and corruption from occurring if, for example, if your streaming data
source fails mid-stream.
• Consistency: ensures that transactions only make changes to tables in predefined, predictable ways.
Transactional consistency ensures that corruption or errors in your data do not create unintended
consequences for the integrity of your table.
• Isolation: when multiple users are reading and writing from the same table all at once, isolation of their
transactions ensures that the concurrent transactions don't interfere with or affect one another. Each request
can occur as though they were occurring one by one, even though they're actually occurring simultaneously.
• Durability: ensures that changes to your data made by successfully executed transactions will be saved, even
in the event of system failure.
If a database operation has the above ACID properties, it can be called an ACID transaction, and data storage
systems that apply these operations are called transactional systems.
© Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification
• Support for ACID* transactions

ACID (Atomicity, Consistency, Isolation, Durability) transactions are fundamental concepts in database systems that
ensure data integrity and reliability. Here's a simplified example of an ACID transaction:
Let's consider a banking application where a user transfers money from one account to another. We'll ensure the
transaction adheres to the ACID properties:
1. Atomicity: The entire transaction must be completed successfully, or none of it should be applied.
2. Consistency: The database must transition from one consistent state to another consistent state after a
transaction. For example, if the total sum of all accounts' balances must remain the same before and after the
transaction.
3. Isolation: Transactions should appear isolated from each other. Even if multiple transactions are executing
concurrently, the result should be the same as if they were executed serially.
4. Durability: Once a transaction is committed, its effects are permanent, even in the case of system failure.

© Copyright Microsoft Corporation. All rights reserved.


What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification
• Support for ACID* transactions
• Data versioning and Time Travel

© Copyright Microsoft Corporation. All rights reserved.


What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification
• Support for ACID* transactions
• Data versioning and Time Travel
• Support for batch and streaming data

© Copyright Microsoft Corporation. All rights reserved.


What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification
• Support for ACID* transactions
• Data versioning and Time Travel
• Support for batch and streaming data
• Standard formats and interoperability

© Copyright Microsoft Corporation. All rights reserved.


What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification
• Support for ACID* transactions
• Data versioning and Time Travel
• Support for batch and streaming data
• Standard formats and interoperability

[Good to know]
Please note that you don’t need to use Delta Lake to manipulate or query data in Spark using SQL. You can create
tables in the Hive metastore with data files in CSV or Parquet format and query them with SELECT statements.
However, Delta Lake saves the data in Parquet format with additional metadata that enables tracking of transactions
and versioning, providing an experience much more similar to a relational database system like SQL Server. Most
new “data lakehouse” solutions built on Spark are based on Delta Lake, enabling you to combine the flexibility of
file-based data lake storage with the transactional data integrity capabilities of a relational data warehouse.

© Copyright Microsoft Corporation. All rights reserved.


Create Delta Lake tables

Create a Delta Lake table from a dataframe


df = spark.read.load("/data/mydata.csv", format="csv", header=True)

delta_table_path = "/delta/mydata”

df.write.format("delta").save(delta_table_path)

[Good to know]
Delta Lake tables are just like any other metastore tables except that the delta file format is
used to save the data. This results in a folder structure that not only contains the data files, but
also metadata files that enable transactional functionality.

Learn more >> How to create and append to Delta Lake tables with pandas
© Copyright Microsoft Corporation. All rights reserved.
Create Delta Lake tables

Make conditional updates

To update a delta lake table using the API, you create a DeltaTable object based on the path where the data
files are stored, and use its methods to perform data updates, inserts, and deletes.

from delta.tables import *


from pyspark.sql.functions import *
spark: pyspark.sql.session.SparkSession
deltaTable = DeltaTable.forPath(spark, delta_table_path)
From the previous slides:
deltaTable.update( delta_table_path = "/delta/mydata”
condition = "Category == 'Accessories'",
set = { "Price": "Price * 0.9" }) deltaTable is a DeltaTable object

Learn more >> Table deletes, updates, and merges


Delta Lake supports several statements to facilitate deleting data from and updating data in Delta tables.

© Copyright Microsoft Corporation. All rights reserved.


Create Delta Lake tables
If you want to access the data that you overwrote, you can query a snapshot of the table before you overwrote the first set of
data using the versionAsOf option.

From the previous slides:


Query a previous version (Time Travel) delta_table_path = "/delta/mydata”

df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)

The “Time Travel” feature of Delta Lake takes advantage of the metadata, which tracks transactions in the table.
This enables you to retrieve and compare different versions of the same row based on the sequence of
modifications made to the table or a given point in time.

© Copyright Microsoft Corporation. All rights reserved.


Create catalog tables

• So far you’ve worked with delta tables by loading data from the folder containing
the parquet files on which the table is based.

• You can define catalog tables that encapsulate the data and provide a named table
entity that you can reference in SQL code.

• Spark supports two kinds of catalog tables for delta lake.


• Managed tables
• External tables

© Copyright Microsoft Corporation. All rights reserved.


Create catalog tables
Catalog tables are how the metastore defines relational tables “on top of” file locations.
• They’re not unique to Delta Lake (you can create managed and external tables for Parquet and CSV
formats too)
• [but] increasingly Delta Lake is the preferred way to build relational semantics into a data lakehouse
solution on Spark.

© Copyright Microsoft Corporation. All rights reserved.


Create catalog tables
Catalog tables are how the metastore defines relational tables “on top of” file locations.
• They’re not unique to Delta Lake (you can create managed and external tables for Parquet and CSV
formats too)
• [but] increasingly Delta Lake is the preferred way to build relational semantics into a data lakehouse
solution on Spark.

Key difference between managed and internal tables

• Managed tables
• Defined without a specific location – files are created in metastore folder [stored in the default
metastore file system location]
• [tightly bound to the files] >> Dropping the table deletes the files

© Copyright Microsoft Corporation. All rights reserved.


Create catalog tables
Catalog tables are how the metastore defines relational tables “on top of” file locations.
• They’re not unique to Delta Lake (you can create managed and external tables for Parquet and CSV
formats too)
• [but] increasingly Delta Lake is the preferred way to build relational semantics into a data lakehouse
solution on Spark.

Key difference between managed and internal tables

• Managed tables
• Defined without a specific location – files are created in metastore folder [stored in the default
metastore file system location]
• [tightly bound to the files] >> Dropping the table deletes the files
• External tables
• Defined with a specific file location
• Dropping the table does not delete the files

© Copyright Microsoft Corporation. All rights reserved.


Create catalog tables

df.write.format("delta").option("path","/mydata").saveAsTable("MyExternalTable")

OR

spark.sql("CREATE TABLE MyExternalTable USING DELTA LOCATION '/mydata'")

OR

%%sql
CREATE TABLE MyExternalTable
USING DELTA
LOCATION '/mydata'

© Copyright Microsoft Corporation. All rights reserved.


Use Delta Lake with streaming data

Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.

Learn more >> Table streaming reads and writes


© Copyright Microsoft Corporation. All rights reserved.
Use Delta Lake with streaming data

Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.

Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
• Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
• Efficiently discovering which files are new when using files as the source for a stream

Learn more >> Table streaming reads and writes


© Copyright Microsoft Corporation. All rights reserved.
Use Delta Lake with streaming data

Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.

Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
• Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
• Efficiently discovering which files are new when using files as the source for a stream

The key point is that Spark Structured Streaming provides a way to handle a stream of real-time data by
using Dataframe semantics – essentially you can query a stream of data like a boundless dataframe that is
perpetually receiving new data.

Learn more >> Table streaming reads and writes


© Copyright Microsoft Corporation. All rights reserved.
Use Delta Lake with streaming data

Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.

Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
• Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
• Efficiently discovering which files are new when using files as the source for a stream

The key point is that Spark Structured Streaming provides a way to handle a stream of real-time data by
using Dataframe semantics – essentially you can query a stream of data like a boundless dataframe that is
perpetually receiving new data.

Streaming support in Delta Lake builds on this by enabling you to treat a table as a source of streaming
data for a Spark Structured Streaming dataframe, or as a sink (target) to which a Spark Structured Streaming
dataframe writes its data.

Learn more >> Table streaming reads and writes


© Copyright Microsoft Corporation. All rights reserved.
Use Delta Lake with streaming data
Use Delta Lake table as a streaming source
from pyspark.sql.types import *
from pyspark.sql.functions import *

stream_df = spark.readStream.format("delta").option("ignoreChanges", "true").load("/delta/internetorders")

stream_df.show()

© Copyright Microsoft Corporation. All rights reserved.


Use Delta Lake with streaming data
Use Delta Lake table as a streaming source
from pyspark.sql.types import *
from pyspark.sql.functions import *

stream_df = spark.readStream.format("delta").option("ignoreChanges", "true").load("/delta/internetorders")

stream_df.show()

Use Delta Lake table as a streaming sink


from pyspark.sql.types import *
from pyspark.sql.functions import *
#Define your schema if it's known
stream_df = (rather than relying on Spark to
spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger", 1).json(inputPath) infer the schema) e.g.:

table_path = '/delta/devicetable’ jsonSchema =


StructType([StructField("time",
checkpoint_path = '/delta/checkpoint’ TimestampType(), True),
StructField("id", IntegerType(), True),
delta_stream = stream_df.writeStream.format("delta").option("checkpointLocation", StructField("value", StringType(), True)])
checkpoint_path).start(table_path)

© Copyright Microsoft Corporation. All rights reserved.


Learn more >> Structured streaming
Use Delta Lake in a SQL pool
Until now, we’ve focused on working with Delta Lake tables using Spark.

In Azure Synapse Analytics, you can also work with Delta Lake tables using SQL in a SQL pool.

• The OPENROWSET function enables you to read the content of Delta Lake files by providing the URL
to your root folder.

Learn more >> Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics

© Copyright Microsoft Corporation. All rights reserved.


Use Delta Lake in a SQL pool
Until now, we’ve focused on working with Delta Lake tables using Spark.

In Azure Synapse Analytics, you can also work with Delta Lake tables using SQL in a SQL pool.

• The OPENROWSET function enables you to read the content of Delta Lake files by providing the URL
to your root folder.

Query delta table files using OPENROWSET


SELECT *
FROM
OPENROWSET(
BULK 'https://mystore.dfs.core.windows.net/files/delta/mytable/', Folder location
FORMAT = 'DELTA' Specifying DELTA format
) AS deltadata

Learn more >> Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics

© Copyright Microsoft Corporation. All rights reserved.


Use Delta Lake in a SQL pool

You can also use native SQL SELECT statements to query Delta Lake tables in the Spark metastore.

• By default, metastore tables are defined in a database named default; but you can create additional
databases in the metastore just as you can in SQL Server and query the tables they contain by using
the specific database name.

Query delta tables in Spark metastore databases


USE default;

SELECT * FROM MyDeltaTable;

© Copyright Microsoft Corporation. All rights reserved.


Exercise: Use Delta Lake in Azure Synapse Analytics

Use the hosted lab environment


provided, or view the lab
instructions at the link below:

https://aka.ms/mslearn-delta-lake

© Copyright Microsoft Corporation. All rights reserved.


Knowledge check
Which of the following descriptions best fits Delta Lake?
q A Spark API for exporting data from a relational database into CSV files
q A relational storage layer for Spark that supports tables based on Parquet files
q A synchronization solution that replicates data between SQL pools and Spark pools

You've loaded a Spark dataframe with data, that you now want to use in a Delta Lake table. What
format should you use to write the dataframe to storage?
q CSV
q PARQUET
q DELTA

What feature of Delta Lake enables you to retrieve data from previous versions of a table?
q Spark Structured Streaming
q Time Travel
q Catalog Tables

© Copyright Microsoft Corporation. All rights reserved.


Further reading

Perform data engineering with Azure Synapse Apache Spark Pools


https://aka.ms/mslearn-spark

© Copyright Microsoft Corporation. All rights reserved.


What is Delta Lake? [A Good Read]

Delta Lake is an open-source data management system that runs on top


of Apache Spark. It aims to bring atomicity, consistency, isolation,
durability (ACID) transactions closer to the data lake, which was
previously challenging. Delta Lake improves the reliability, consistency,
and scalability of data lakes, making them suitable for use cases not
previously possible. Delta Lake also supports schema enforcement,
ensuring your data conforms to pre-defined schemas. This helps
maintain data quality and consistency, reducing the risk of errors and
inconsistencies in downstream processes.

Delta Lakes vs Data lakehouse >> A Good Read


© Copyright Microsoft Corporation. All rights reserved.
Key differences between Data Lakehouse and Delta Lake
1. Architecture: Data Lakehouse is a hybrid architecture that combines the best of data
lake and data warehouse capabilities. Delta Lake, on the other hand, is a data management
system running on Apache Spark.
2. Reliability: Although data lakes are highly scalable and flexible, they are not known for
their reliability. Delta Lake, on the other hand, adds ACID transactions to data lakes, making
them much more reliable.
3. Consistency: Data lakes are designed to store data in its raw form, which can make it
difficult to ensure consistency. Delta Lake supports schema enforcement, which ensures
your data conforms to a predefined schema, helping you maintain consistency and reduce
the risk of errors.
4. Processing: Data Lake Houses provides SQL-based interfaces that simplify data access
for analysts and data scientists. On the other hand, Delta Lake is designed to work with
Apache Spark, a powerful processing engine capable of handling large amounts of data
and complex analytics workloads.

Delta Lakes vs Data lakehouse >> A Good Read


© Copyright Microsoft Corporation. All rights reserved.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy