0% found this document useful (0 votes)

195 views

Databricks Performance Tuning

Uploaded by

mishraanjana15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views

Databricks Performance Tuning

Uploaded by

mishraanjana15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Databricks Performance

Optimization
What will be covered?

● Triage Query Performance

● Databricks Fundamental
Architecture

● Scan Reduction in Databricks

● Concurrency Handling in
Databricks

● Resources

© 2022 ThoughtSpot, Inc.

Triage Query Performance

Triage Performance ThoughtSpot or Databricks

Initial triage should identify if ThoughtSpot or the warehouse is the issue:

▪ Deﬁne acceptable criteria (SLA) for Search Time Response. I.e how long should the query take to
execute. Ideally this should be secs for the majority of queries and < 15 secs for the longer processing.
▪ Execute the search in ThoughtSpot
▪ Copy the generated SQL from the Information’s SQL. Note that the syntax of the SQL will be that of the
underlying warehouse
▪ Execute this SQL in Databricks
▪ Verify results, note that caching may impact the result times
▪ Conﬁrm Databricks query response vs ThoughtSpot

SA: William Huang; Last updated: 01/13/2023 3

▪ Databricks leverages open source Spark and Delta Lake to build a collaborative notebooks, integrated
workﬂows, and enterprise security — all in a fully managed cloud platform
▪ Databricks uses Lakehouse architecture that combines data lakes and data warehouses. Lakehouse
implements the data structures and management features on top of cloud storage in open formats.
▪ One of the key tenets of the Lakehouse architecture is the open storage format using Parquet
▪ Delta Lake is the default storage format for all operations on Databricks. Delta Lake extends Parquet
data files with a file-based transaction log for ACID transactions and scalable metadata handling
▪ Modern data warehouses employ the MPP (massively parallel processing) architecture, to leverage
multiple nodes to process a single query and achieve SQL performance. Databricks uses Photon, a
native vectorized engine built from scratch in C++, for modern SIMD hardware and does heavy parallel
query processing

SA: William Huang; Last updated: 01/13/2023 4

Use Query proﬁle to understand query performance:

When a SQL query is submitted, the optimizer builds a plan of how to execute the query, followed by
executing that plan. To troubleshoot the slow queries, the ﬁrst step is to understand the query execution
plan.

Processes:
1. Identify the longest running queries
2. Identify the expensive query operators
3. Analyze the statistics (Time spend) for each query operator
4. Identify the elements to be improved or optimized

SA: William Huang; Last updated: 01/13/2023 5

Databricks sets many default parameters for Delta Lake that impact the size of data files and number of
table versions that are retained in history. Delta Lake uses a combination of metadata parsing and
physical data layout to reduce the number of files scanned to fulfill any query.

● Data skipping with Z-order indexes for Delta Lake

● Compact data files with optimize on Delta Lake
● Remove unused data files with vacuum
● Configure Delta Lake to control data file size

SA: William Huang; Last updated: 01/13/2023 6

Conﬁguration options available when creating and editing Databricks clusters to handle concurrencies.

Autoscaling allows clusters to resize automatically based on workloads and handle concurrencies.

Set Cluster model as “High Concurrency”.

It could take minutes for the additional clusters to warm up when Databricks scales up. To reduce cluster
start time, you can attach a cluster to a predeﬁned pool of idle instances, for the driver and worker nodes.

SA: William Huang; Last updated: 01/13/2023 7

● https://www.databricks.com/blog/2022/03/10/top-5-databricks-performance-tips.html
● https://docs.databricks.com/optimizations/disk-cache.html
● https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dyna
mic-ﬁle-pruning.html?_ga=2.101289923.316542466.1673610351-185347766.1672944434
● https://docs.databricks.com/delta/data-skipping.html

Mallikarjun Ramadurg Mallik034: Follow Me by Search Mallik034 at Youtube / FB / Linkedin / Twitter / Blogspot / Instagram
No ratings yet
Mallikarjun Ramadurg Mallik034: Follow Me by Search Mallik034 at Youtube / FB / Linkedin / Twitter / Blogspot / Instagram
14 pages
Solution Architecture Document Word Formatdoc147
100% (1)
Solution Architecture Document Word Formatdoc147
24 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Sales Order Header Test Script
No ratings yet
Sales Order Header Test Script
13 pages
Bloxburg Epic Thing Cracked
0% (1)
Bloxburg Epic Thing Cracked
125 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Databricks Question
No ratings yet
Databricks Question
7 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Hadoop Commands Cheat Sheet
No ratings yet
Hadoop Commands Cheat Sheet
1 page
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Pyspark
No ratings yet
Pyspark
31 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
RMK Group CoE Selection Result
0% (1)
RMK Group CoE Selection Result
32 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Azure Data Engineering Course
No ratings yet
Azure Data Engineering Course
20 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Azure SQL Trainings: Contact: +91 90 32 82 44 67
No ratings yet
Azure SQL Trainings: Contact: +91 90 32 82 44 67
6 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
SnowFlake Course Brochure FINAL
No ratings yet
SnowFlake Course Brochure FINAL
7 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Unit - 5 Introduction To MIS
No ratings yet
Unit - 5 Introduction To MIS
78 pages
Grade 4 -Term 3 Workbook - ANSWERS
No ratings yet
Grade 4 -Term 3 Workbook - ANSWERS
8 pages
A Lightweight Data Sharing Scheme
100% (1)
A Lightweight Data Sharing Scheme
15 pages
MCA304
No ratings yet
MCA304
459 pages
Chapter 1 Problem and Its Background
No ratings yet
Chapter 1 Problem and Its Background
17 pages
Ste Microproject
No ratings yet
Ste Microproject
21 pages
Oxygene Is A Powerful General Purpose Programming Language
No ratings yet
Oxygene Is A Powerful General Purpose Programming Language
10 pages
Symantec Reporter 10.5.x Administrator's Guide: Revision - Wednesday, March 11, 2020
No ratings yet
Symantec Reporter 10.5.x Administrator's Guide: Revision - Wednesday, March 11, 2020
132 pages
Chapter 1
No ratings yet
Chapter 1
89 pages
Milestone Vms Integration Release Note.2
No ratings yet
Milestone Vms Integration Release Note.2
38 pages
Cloud Data Warehouse Benchmark
No ratings yet
Cloud Data Warehouse Benchmark
11 pages
AIIM23 Conference Play Book For Year Around Learning
No ratings yet
AIIM23 Conference Play Book For Year Around Learning
31 pages
Fs Tech Protect Against Phishing e
No ratings yet
Fs Tech Protect Against Phishing e
7 pages
Sniffers: Group Members
No ratings yet
Sniffers: Group Members
11 pages
SE Models
No ratings yet
SE Models
68 pages
Demo
67% (3)
Demo
21 pages
Weather Forecasting Project
No ratings yet
Weather Forecasting Project
8 pages
Zabbix Appliance
No ratings yet
Zabbix Appliance
10 pages
Answer-key-Computer-Class-7-Ethics-and-Safety-Measures-in-Computing
No ratings yet
Answer-key-Computer-Class-7-Ethics-and-Safety-Measures-in-Computing
5 pages
8024 Ecap200 Database Management Systems
No ratings yet
8024 Ecap200 Database Management Systems
3 pages
SQL Server 2005 Admin
No ratings yet
SQL Server 2005 Admin
1,185 pages
Hotel Management System Source Code - C#, JAVA, PHP, Programming, Source Code
No ratings yet
Hotel Management System Source Code - C#, JAVA, PHP, Programming, Source Code
18 pages
Based On What You Know of The Operation of A University
No ratings yet
Based On What You Know of The Operation of A University
2 pages
Oracle Dataguard
No ratings yet
Oracle Dataguard
7 pages
TFTL 1 2023
No ratings yet
TFTL 1 2023
194 pages
SAP Feildglass
No ratings yet
SAP Feildglass
48 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Databricks Performance Tuning

Uploaded by

Databricks Performance Tuning

Uploaded by

Databricks Performance

● Triage Query Performance

● Scan Reduction in Databricks

© 2022 ThoughtSpot, Inc.

Triage Performance ThoughtSpot or Databricks

Initial triage should identify if ThoughtSpot or the warehouse is the issue:

SA: William Huang; Last updated: 01/13/2023 3

SA: William Huang; Last updated: 01/13/2023 4

Use Query proﬁle to understand query performance:

SA: William Huang; Last updated: 01/13/2023 5

● Data skipping with Z-order indexes for Delta Lake

SA: William Huang; Last updated: 01/13/2023 6

Set Cluster model as “High Concurrency”.

SA: William Huang; Last updated: 01/13/2023 7

© 2022 ThoughtSpot, Inc.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.