Use Delta Lake in Azure Synapse Analytics
Use Delta Lake in Azure Synapse Analytics
Agenda
Transform data with Apache Spark in Azure Synapse Analytics
Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.
Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.
• A Delta Lake is a storage layer designed to run on top of an existing data lake and improve its
reliability, security, and performance. Delta Lakes support ACID transactions, scalable metadata,
unified streaming, and batch data processing.
Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.
• A Delta Lake is a storage layer designed to run on top of an existing data lake and improve its
reliability, security, and performance. Delta Lakes support ACID transactions, scalable metadata,
unified streaming, and batch data processing.
• Delta Lake is the default storage format for all operations on Databricks. Unless otherwise
specified, all tables on Databricks are Delta tables. Databricks originally developed the Delta
Lake protocol and continues to actively contribute to the open-source project.
Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
• Delta lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and
durability) transactions to Apache Spark and big data workloads.
• Delta lake is a type of data lake that adds additional features, such as ACID transactions, schema
enforcement, and lineage tracking. These features make Delta Lakes more reliable and easier to
manage than traditional data lakes. Delta Lakes is also a good choice for streaming data
applications.
• A Delta Lake is a storage layer designed to run on top of an existing data lake and improve its
reliability, security, and performance. Delta Lakes support ACID transactions, scalable metadata,
unified streaming, and batch data processing.
• Delta Lake is the default storage format for all operations on Databricks. Unless otherwise
specified, all tables on Databricks are Delta tables. Databricks originally developed the Delta
Lake protocol and continues to actively contribute to the open-source project.
The current version of Delta Lake included with Azure Synapse has language support for Scala,
PySpark, and .NET and is compatible with Linux Foundation Delta Lake.
Learn more >> Delta Lake Official Website © Copyright Microsoft Corporation. All rights reserved.
What is Delta Lake?
So, the Delta Lakes are open-source storage layer that adds relational database
semantics to Spark
• Relational tables that support querying and data modification
ACID (Atomicity, Consistency, Isolation, Durability) transactions are fundamental concepts in database systems that
ensure data integrity and reliability. Here's a simplified example of an ACID transaction:
Let's consider a banking application where a user transfers money from one account to another. We'll ensure the
transaction adheres to the ACID properties:
1. Atomicity: The entire transaction must be completed successfully, or none of it should be applied.
2. Consistency: The database must transition from one consistent state to another consistent state after a
transaction. For example, if the total sum of all accounts' balances must remain the same before and after the
transaction.
3. Isolation: Transactions should appear isolated from each other. Even if multiple transactions are executing
concurrently, the result should be the same as if they were executed serially.
4. Durability: Once a transaction is committed, its effects are permanent, even in the case of system failure.
[Good to know]
Please note that you don’t need to use Delta Lake to manipulate or query data in Spark using SQL. You can create
tables in the Hive metastore with data files in CSV or Parquet format and query them with SELECT statements.
However, Delta Lake saves the data in Parquet format with additional metadata that enables tracking of transactions
and versioning, providing an experience much more similar to a relational database system like SQL Server. Most
new “data lakehouse” solutions built on Spark are based on Delta Lake, enabling you to combine the flexibility of
file-based data lake storage with the transactional data integrity capabilities of a relational data warehouse.
delta_table_path = "/delta/mydata”
df.write.format("delta").save(delta_table_path)
[Good to know]
Delta Lake tables are just like any other metastore tables except that the delta file format is
used to save the data. This results in a folder structure that not only contains the data files, but
also metadata files that enable transactional functionality.
Learn more >> How to create and append to Delta Lake tables with pandas
© Copyright Microsoft Corporation. All rights reserved.
Create Delta Lake tables
To update a delta lake table using the API, you create a DeltaTable object based on the path where the data
files are stored, and use its methods to perform data updates, inserts, and deletes.
df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)
The “Time Travel” feature of Delta Lake takes advantage of the metadata, which tracks transactions in the table.
This enables you to retrieve and compare different versions of the same row based on the sequence of
modifications made to the table or a given point in time.
• So far you’ve worked with delta tables by loading data from the folder containing
the parquet files on which the table is based.
• You can define catalog tables that encapsulate the data and provide a named table
entity that you can reference in SQL code.
• Managed tables
• Defined without a specific location – files are created in metastore folder [stored in the default
metastore file system location]
• [tightly bound to the files] >> Dropping the table deletes the files
• Managed tables
• Defined without a specific location – files are created in metastore folder [stored in the default
metastore file system location]
• [tightly bound to the files] >> Dropping the table deletes the files
• External tables
• Defined with a specific file location
• Dropping the table does not delete the files
df.write.format("delta").option("path","/mydata").saveAsTable("MyExternalTable")
OR
OR
%%sql
CREATE TABLE MyExternalTable
USING DELTA
LOCATION '/mydata'
Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.
Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.
Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
• Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
• Efficiently discovering which files are new when using files as the source for a stream
Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.
Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
• Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
• Efficiently discovering which files are new when using files as the source for a stream
The key point is that Spark Structured Streaming provides a way to handle a stream of real-time data by
using Dataframe semantics – essentially you can query a stream of data like a boundless dataframe that is
perpetually receiving new data.
Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream.
Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including:
• Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)
• Efficiently discovering which files are new when using files as the source for a stream
The key point is that Spark Structured Streaming provides a way to handle a stream of real-time data by
using Dataframe semantics – essentially you can query a stream of data like a boundless dataframe that is
perpetually receiving new data.
Streaming support in Delta Lake builds on this by enabling you to treat a table as a source of streaming
data for a Spark Structured Streaming dataframe, or as a sink (target) to which a Spark Structured Streaming
dataframe writes its data.
stream_df.show()
stream_df.show()
In Azure Synapse Analytics, you can also work with Delta Lake tables using SQL in a SQL pool.
• The OPENROWSET function enables you to read the content of Delta Lake files by providing the URL
to your root folder.
Learn more >> Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics
In Azure Synapse Analytics, you can also work with Delta Lake tables using SQL in a SQL pool.
• The OPENROWSET function enables you to read the content of Delta Lake files by providing the URL
to your root folder.
Learn more >> Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics
You can also use native SQL SELECT statements to query Delta Lake tables in the Spark metastore.
• By default, metastore tables are defined in a database named default; but you can create additional
databases in the metastore just as you can in SQL Server and query the tables they contain by using
the specific database name.
https://aka.ms/mslearn-delta-lake
You've loaded a Spark dataframe with data, that you now want to use in a Delta Lake table. What
format should you use to write the dataframe to storage?
q CSV
q PARQUET
q DELTA
What feature of Delta Lake enables you to retrieve data from previous versions of a table?
q Spark Structured Streaming
q Time Travel
q Catalog Tables