Databricks Performance Tuning
Databricks Performance Tuning
Optimization
What will be covered?
● Databricks Fundamental
Architecture
● Concurrency Handling in
Databricks
● Resources
▪ Databricks leverages open source Spark and Delta Lake to build a collaborative notebooks, integrated
workflows, and enterprise security — all in a fully managed cloud platform
▪ Databricks uses Lakehouse architecture that combines data lakes and data warehouses. Lakehouse
implements the data structures and management features on top of cloud storage in open formats.
▪ One of the key tenets of the Lakehouse architecture is the open storage format using Parquet
▪ Delta Lake is the default storage format for all operations on Databricks. Delta Lake extends Parquet
data files with a file-based transaction log for ACID transactions and scalable metadata handling
▪ Modern data warehouses employ the MPP (massively parallel processing) architecture, to leverage
multiple nodes to process a single query and achieve SQL performance. Databricks uses Photon, a
native vectorized engine built from scratch in C++, for modern SIMD hardware and does heavy parallel
query processing
Processes:
1. Identify the longest running queries
2. Identify the expensive query operators
3. Analyze the statistics (Time spend) for each query operator
4. Identify the elements to be improved or optimized
Databricks sets many default parameters for Delta Lake that impact the size of data files and number of
table versions that are retained in history. Delta Lake uses a combination of metadata parsing and
physical data layout to reduce the number of files scanned to fulfill any query.
Configuration options available when creating and editing Databricks clusters to handle concurrencies.
Autoscaling allows clusters to resize automatically based on workloads and handle concurrencies.
It could take minutes for the additional clusters to warm up when Databricks scales up. To reduce cluster
start time, you can attach a cluster to a predefined pool of idle instances, for the driver and worker nodes.
● https://www.databricks.com/blog/2022/03/10/top-5-databricks-performance-tips.html
● https://docs.databricks.com/optimizations/disk-cache.html
● https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dyna
mic-file-pruning.html?_ga=2.101289923.316542466.1673610351-185347766.1672944434
● https://docs.databricks.com/delta/data-skipping.html