0% found this document useful (0 votes)
195 views

Databricks Performance Tuning

Uploaded by

mishraanjana15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views

Databricks Performance Tuning

Uploaded by

mishraanjana15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Databricks Performance

Optimization
What will be covered?

● Triage Query Performance

● Databricks Fundamental
Architecture

● Scan Reduction in Databricks

● Concurrency Handling in
Databricks

● Resources

© 2022 ThoughtSpot, Inc.


Triage Query Performance

Triage Performance ThoughtSpot or Databricks

Initial triage should identify if ThoughtSpot or the warehouse is the issue:


▪ Define acceptable criteria (SLA) for Search Time Response. I.e how long should the query take to
execute. Ideally this should be secs for the majority of queries and < 15 secs for the longer processing.
▪ Execute the search in ThoughtSpot
▪ Copy the generated SQL from the Information’s SQL. Note that the syntax of the SQL will be that of the
underlying warehouse
▪ Execute this SQL in Databricks
▪ Verify results, note that caching may impact the result times
▪ Confirm Databricks query response vs ThoughtSpot

SA: William Huang; Last updated: 01/13/2023 3


© 2021 ThoughtSpot, Inc.
Databricks Fundamental Architecture

▪ Databricks leverages open source Spark and Delta Lake to build a collaborative notebooks, integrated
workflows, and enterprise security — all in a fully managed cloud platform
▪ Databricks uses Lakehouse architecture that combines data lakes and data warehouses. Lakehouse
implements the data structures and management features on top of cloud storage in open formats.
▪ One of the key tenets of the Lakehouse architecture is the open storage format using Parquet
▪ Delta Lake is the default storage format for all operations on Databricks. Delta Lake extends Parquet
data files with a file-based transaction log for ACID transactions and scalable metadata handling
▪ Modern data warehouses employ the MPP (massively parallel processing) architecture, to leverage
multiple nodes to process a single query and achieve SQL performance. Databricks uses Photon, a
native vectorized engine built from scratch in C++, for modern SIMD hardware and does heavy parallel
query processing

SA: William Huang; Last updated: 01/13/2023 4


© 2021 ThoughtSpot, Inc.
Understand Query Execution in Databricks

Use Query profile to understand query performance:


When a SQL query is submitted, the optimizer builds a plan of how to execute the query, followed by
executing that plan. To troubleshoot the slow queries, the first step is to understand the query execution
plan.

Processes:
1. Identify the longest running queries
2. Identify the expensive query operators
3. Analyze the statistics (Time spend) for each query operator
4. Identify the elements to be improved or optimized

SA: William Huang; Last updated: 01/13/2023 5


© 2021 ThoughtSpot, Inc.
Scan Reduction in Databricks

Databricks sets many default parameters for Delta Lake that impact the size of data files and number of
table versions that are retained in history. Delta Lake uses a combination of metadata parsing and
physical data layout to reduce the number of files scanned to fulfill any query.

● Data skipping with Z-order indexes for Delta Lake


● Compact data files with optimize on Delta Lake
● Remove unused data files with vacuum
● Configure Delta Lake to control data file size

SA: William Huang; Last updated: 01/13/2023 6


© 2021 ThoughtSpot, Inc.
Concurrency Handling

Configuration options available when creating and editing Databricks clusters to handle concurrencies.

Autoscaling allows clusters to resize automatically based on workloads and handle concurrencies.

Set Cluster model as “High Concurrency”.

It could take minutes for the additional clusters to warm up when Databricks scales up. To reduce cluster
start time, you can attach a cluster to a predefined pool of idle instances, for the driver and worker nodes.

SA: William Huang; Last updated: 01/13/2023 7


© 2021 ThoughtSpot, Inc.
Resources
Documentation

● https://www.databricks.com/blog/2022/03/10/top-5-databricks-performance-tips.html
● https://docs.databricks.com/optimizations/disk-cache.html
● https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dyna
mic-file-pruning.html?_ga=2.101289923.316542466.1673610351-185347766.1672944434
● https://docs.databricks.com/delta/data-skipping.html

© 2022 ThoughtSpot, Inc.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy