0% found this document useful (0 votes)

36 views

Building Cost-Based Query Optimizers With Apache Calcite

Uploaded by

tlxarena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Building Cost-Based Query Optimizers With Apache Calcite

Uploaded by

tlxarena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Building Cost-Based Query

Optimizers with Apache Calcite

Vladimir Ozerov
Querify Labs, CEO
SQL use cases: technology

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)

2
SQL use cases: technology

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)

● “New” products
○ Relational (CockroachDB, TiDB, YugaByte)
○ BigData/Analytics (Hive, Snowﬂake, Dremio, Clickhouse, Presto)
○ NoSQL (DataStax*, Couchbase*)
○ Compute/streaming (Spark, ksqlDB, Apache Flink)
○ In-memory (Apache Ignite, Hazelcast, Gigaspaces)
● Rebels:
○ MongoDB
○ Redis

* Uses SQL-like languages or builds SQL engine right now

3
SQL use cases: technology

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)

* Uses SQL-like languages or builds SQL engine right now

https://insights.stackoverﬂow.com/survey/2020

4
SQL use cases: applied

● Query custom data sources

○ Internal business systems
○ Infrastructure: logs, metrics, conﬁgs, events, …
● Federated SQL - run queries across multiple sources
○ Data lakes
● Custom requirements
○ New syntax / DSL
○ UDFs
○ Internal optimizations

5
What does it take to build an SQL engine?

6
Optimization with Apache Calcite

7
Optimization with Apache Calcite

8
Projects that already use Apache Calcite

● Data Management:
○ Apache Hive
○ Apache Flink
○ Dremio
○ VoltDB
○ IMDGs (Apache Ignite, Hazelcast, Gigaspaces)
○ …
● Applied:
○ Alibaba / Ant Group
○ Uber
○ LinkedIn
○ …

https://calcite.apache.org/docs/powered_by.html

9
Parsing

● Goal: convert query string to AST

● How to create a parser?
○ Write a parser by hand? Not practical
○ Use parser generator? Better, but still a lot of work
○ Use Apache Calcite
● Parsing with Apache Calcite
○ Uses JavaCC parser generator under the hood
○ Provides a ready-to-use generated parser with the ANSI SQL grammar
○ Allows for custom extensions to the syntax

10
Semantic Analysis

● Goal: verify that AST makes any sense

● Semantic analysis with Apache Calcite
○ Provide a schema
○ (optionally) Provide custom operators
○ Run Calcite’s SQL validator
● Validator responsibilities
○ Bind tables and columns
○ Bind operators
○ Resolve data types
○ Verify relational semantics

11
Relational tree

● AST is not convenient for optimization: complex operator semantics

● A relational tree is a better IR: simple operators with well-deﬁned scopes
● Apache Calcite can translate AST to relational tree
12
Relational tree

Operator Description

Scan Scan a data source

Project Transform tuple attributes (e.g. a+b)

Filter Filter rows according to a predicate (WHERE, HAVING)

Sort ORDER BY / LIMIT / OFFSET

Aggregate Aggregate operator

Window Window aggregation

Join 2-way join

Union/Minus/Intersect N-way set operators

13
Transformations

● Every query might be executed in multiple alternative ways

● We need to apply transformations to ﬁnd better plans
● Apache Calcite: custom transformations (visitors) or rule-based transformations
14
Transformations: custom
Custom transformations implemented using a visitor pattern
(traverse the relational tree, create a new tree):

● Field trimming: remove unused columns from the plan

● Subquery elimination: rewrite subqueries to joins/aggregates

15
Transformations: rule-based

● A rule is a self-contained optimization unit: pattern + transformation

● There are hundreds of valid transformations in relational algebra
● Apache Calcite provides ~100 transformation rules out-of-the-box!
16
Rules
Examples of rules:

● Operator transpose - move operators wrt each other (e.g., ﬁlter push-down)
● Operator simpliﬁcation - merge or eliminate operators, convert to simpler equivalents
● Join planning - commute, associate

https://github.com/apache/calcite/tree/master/core/src/main/java/org/apache/calcite/rel/rules

17
Rule drivers: heuristic (HepPlanner)

● Apply transformations until

there is anything to transform
● Fast, but cannot guarantee
optimality

18
Rule drivers: cost-based (VolcanoPlanner)

● Consider multiple plans

simultaneously in a special
data structure (MEMO)

19
Rule drivers: cost-based (VolcanoPlanner)

● Consider multiple plans

simultaneously in a special
data structure (MEMO)
● Assign non-cumulative costs
to operators

20
Rule drivers: cost-based (VolcanoPlanner)

● Consider multiple plans

simultaneously in a special
data structure (MEMO)
● Assign non-cumulative costs
to operators
● Maintain the winner for every
equivalence group
● Heavier than the heuristic
driver but guarantees
optimality

21
Metadata

Metadata is a set of properties, common to all operators in the given equivalence group. Used extensively in rules
and cost functions.

Examples:

● Statistics (cardinalities, selectivites, min/max, NDV)

● Attribute uniqueness
○ SELECT a … GROUP BY a -> the ﬁrst attribute is unique
● Attribute constraints
○ WHERE a.a1=1 and a.a1=b.b1 -> both a.a1 and b.b1 are always 1 and their NDV is 1

22
Implementing an operator

● Create your custom operator, extending the RelNode class or one of existing abstract operators
● Override the copy routine to allow for operator copying to/from MEMO (copy)
● Override operator’s digest for proper deduplication (explainTerms)
○ Usually: dump a minimal set of fields that makes the operator unique wrt other operators.
● Override the cost function (computeSelfCost)
○ Usually: consult to metadata, first of all input’s cardinality, apply some coefficients.
○ You may even provide you own definition of the cost

23
Enforcers

● Operators may expose physical

properties
● Parent operator may demand a certain
property on the input
● If the input cannot satisfy the requested
property, an enforcer operator is
injected
● Examples:
○ Collation (Sort)
○ Distribution (Exchange)

24
VolcanoOptimizer

Vanilla Top-down
● The original implementation of the ● Implemented recently by Alibaba engineers
cost-based optimizer in Apache Calcite. ● Based on the Cascades algorithm: the guided
● Optimize nodes in an arbitrary order. top-down search.
● Cannot propagate physical properties. ● Propagates the physical properties between
● Cannot do eﬃcient pruning. operators (requires manual implementation).
● Applies branch-and-bound pruning to limit the
search space.

25
Physical property propagation

● Available only in the top-down

optimizer
● Pass-through (1, 2, 3) -
propagate optimization request
to inputs
● Derive (4, 5) - notify the parent
about the new implementation

26
Branch-and-bound pruning

Accumulated cost bounding:

● There is a viable aggregate

○ Total cost = 500
○ Self cost = 150
○ Input’s budget = 350
● The new join is created
○ Self cost = 450
○ May never be part of an optimal
plan, prune

27
Multi-phase optimization

● Practical optimizers often split optimization into several phases to reduce the search space, at the cost of
possibly missing the optimal plan
● Apache Calcite allows you to implement a multi-phase optimizer
28
Federated queries

● You may optimize towards different backends simultaneously (federated queries)

○ E.g., JDBC + Apache Cassandra
● Apache Calcite has the built-in Enumerable execution backend that compiles operators into a Java
bytecode in runtime 29
Your optimizer

● Deﬁne operators speciﬁc to your backend

● Provide custom rules that convert abstract Calcite operators to your operators
○ E.g., LogicalJoin -> HashJoin
● Run Calcite driver(s) with the built-in and/or custom rules 30
Example: Apache Flink

● Custom physical batch and streaming operators

● Custom cost: row count, cpu, IO, network, memory
● The custom distribution property with an Exchange enforcer
● Custom rules (e.g., subquery rewrite, physical rules)
● Multi-phase optimization: heuristic and cost-based phases

https://github.com/apache/flink/tree/release-1.12.2/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner

31
Summary

● Apache Calcite is a toolbox to build query engines

○ Syntax analyzer
○ Semantic analyzer
○ Translator
○ Optimization drivers and rules
○ The Enumerable backend
● Apache Calcite dramatically reduces the efforts required to build an optimizer for your backend
○ Weeks to have a working prototype
○ Months to have an MVP
○ Year(s) to have a solid product, but not decades!

32
Links

● Speaker
○ https://www.linkedin.com/in/devozerov/
○ https://twitter.com/devozerov
● Apache Calcite:
○ https://calcite.apache.org/
○ https://github.com/apache/calcite
● Demo:
○ https://github.com/querifylabs/talks-2021-percona
● Our blog:
○ https://www.querifylabs.com/blog

1498Get (eBook PDF) Shelly Cashman Series Microsoft Office 365 & Access 2019 Comprehensive free all chapters
100% (1)
1498Get (eBook PDF) Shelly Cashman Series Microsoft Office 365 & Access 2019 Comprehensive free all chapters
55 pages
iCEDQ Brochure - Product Datasheet
No ratings yet
iCEDQ Brochure - Product Datasheet
5 pages
Ignitebook Sample
100% (1)
Ignitebook Sample
128 pages
Rapidminer 4.6 Tutorial
100% (1)
Rapidminer 4.6 Tutorial
695 pages
RapidMiner Minibook
No ratings yet
RapidMiner Minibook
121 pages
Project 2 Ms SQL Dev PDF
100% (1)
Project 2 Ms SQL Dev PDF
3 pages
Apache Calcite Paper
No ratings yet
Apache Calcite Paper
10 pages
Calcite
No ratings yet
Calcite
10 pages
Apache Calcite - A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources - sigmod-2018
No ratings yet
Apache Calcite - A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources - sigmod-2018
23 pages
Apache Calcite Tutorial
No ratings yet
Apache Calcite Tutorial
83 pages
apachecon2019-forexport-191026073804
No ratings yet
apachecon2019-forexport-191026073804
44 pages
Multi Query Optimization and Applications
No ratings yet
Multi Query Optimization and Applications
157 pages
Algorithms for Data Engineers 1737183205
No ratings yet
Algorithms for Data Engineers 1737183205
6 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
2010 KamilAnikijej
No ratings yet
2010 KamilAnikijej
72 pages
Enabling Real Time Search of Medical Ima
No ratings yet
Enabling Real Time Search of Medical Ima
137 pages
Automated Physical Database Design and Tuning Emerging Directions in Database Systems and Applications 1st Edition Nicolas Bruno pdf download
100% (1)
Automated Physical Database Design and Tuning Emerging Directions in Database Systems and Applications 1st Edition Nicolas Bruno pdf download
61 pages
1746752480555_CT 2
No ratings yet
1746752480555_CT 2
8 pages
Extensible Operator Models
No ratings yet
Extensible Operator Models
20 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
Big Data Analysis Patterns
No ratings yet
Big Data Analysis Patterns
61 pages
Query Processing Concepts
No ratings yet
Query Processing Concepts
99 pages
Milvus Overview
No ratings yet
Milvus Overview
53 pages
Talend Webinar CDT 19 May 2020
No ratings yet
Talend Webinar CDT 19 May 2020
46 pages
Data Engineeing 1 Pages 2
No ratings yet
Data Engineeing 1 Pages 2
14 pages
Drools and Rule Based Systems: Srinath Perera
No ratings yet
Drools and Rule Based Systems: Srinath Perera
20 pages
RDF Query Path Optimization Using Hybrid Genetic Algorithms
No ratings yet
RDF Query Path Optimization Using Hybrid Genetic Algorithms
16 pages
Bca 4 Sem - Data Mining and Data Warehouse: Unit I Introduction
No ratings yet
Bca 4 Sem - Data Mining and Data Warehouse: Unit I Introduction
3 pages
SC4x W3L1 TopicsInDatabases v2
No ratings yet
SC4x W3L1 TopicsInDatabases v2
37 pages
4th Sem Syllabus
No ratings yet
4th Sem Syllabus
12 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
Data Engineering Roadmap uYdSPm5q
100% (1)
Data Engineering Roadmap uYdSPm5q
5 pages
Adbms Notes
No ratings yet
Adbms Notes
17 pages
Aarate_1_
No ratings yet
Aarate_1_
3 pages
Data-Engineering Compressed
No ratings yet
Data-Engineering Compressed
20 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
BD UNIT 4&5
No ratings yet
BD UNIT 4&5
10 pages
Interview QnAs - CloudyML
No ratings yet
Interview QnAs - CloudyML
13 pages
Inside RavenDB 3 0
No ratings yet
Inside RavenDB 3 0
187 pages
Query Evaluation
No ratings yet
Query Evaluation
51 pages
Adbms Finals Reviewer
No ratings yet
Adbms Finals Reviewer
3 pages
Near_real_time_fraud_detection_with_Apac
No ratings yet
Near_real_time_fraud_detection_with_Apac
87 pages
(Ebook) Data-oriented design: software engineering for limited resources and short schedules by Richard Fabian ISBN 9781916478701, 1916478700 all chapter instant download
100% (9)
(Ebook) Data-oriented design: software engineering for limited resources and short schedules by Richard Fabian ISBN 9781916478701, 1916478700 all chapter instant download
67 pages
Internship-Data Science and Machine Learning Using Python
No ratings yet
Internship-Data Science and Machine Learning Using Python
5 pages
IRS
No ratings yet
IRS
88 pages
SCMH-7.18.3-Use-of-Data-Science-Artificial-Intelligence-in-Quality-Management-Rev-NEW-Dated-11AUG2022
No ratings yet
SCMH-7.18.3-Use-of-Data-Science-Artificial-Intelligence-in-Quality-Management-Rev-NEW-Dated-11AUG2022
54 pages
Computerassisted Query Formulation Alvin Cheung Armando Solarlezama pdf download
No ratings yet
Computerassisted Query Formulation Alvin Cheung Armando Solarlezama pdf download
51 pages
Big _Data _ISE 2
No ratings yet
Big _Data _ISE 2
12 pages
An Overview of Query Optimization in Relation Systems
No ratings yet
An Overview of Query Optimization in Relation Systems
11 pages
6.830/6.814 - Notes For Lecture 4: Database Internals Overview
No ratings yet
6.830/6.814 - Notes For Lecture 4: Database Internals Overview
7 pages
Distributed Query Processing
No ratings yet
Distributed Query Processing
24 pages
MAQ Software - Job Description - Software Engineer - SE1
No ratings yet
MAQ Software - Job Description - Software Engineer - SE1
3 pages
Heart Disease Prediction Report
No ratings yet
Heart Disease Prediction Report
113 pages
CV Template - Scalian Benelux_FY24_DS (1)
No ratings yet
CV Template - Scalian Benelux_FY24_DS (1)
3 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
Job Description: Data Engineer Mumbai - Bangalore
No ratings yet
Job Description: Data Engineer Mumbai - Bangalore
3 pages
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Advanced Database Query Systems Techniques Applications And Technologies 1st Edition Li Yan pdf download
No ratings yet
Advanced Database Query Systems Techniques Applications And Technologies 1st Edition Li Yan pdf download
76 pages
dbms ----
No ratings yet
dbms ----
12 pages
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
NVMe Performance Hacks
From Everand
NVMe Performance Hacks
Mei Gates
No ratings yet
How To Add JWT-Based Authentication in NestJS
No ratings yet
How To Add JWT-Based Authentication in NestJS
18 pages
Data Dictionary
No ratings yet
Data Dictionary
2 pages
Mysql
No ratings yet
Mysql
33 pages
Erp SQL Python
No ratings yet
Erp SQL Python
2 pages
Chapter 6
100% (1)
Chapter 6
51 pages
Hari - OracleAppsTech OAF
No ratings yet
Hari - OracleAppsTech OAF
4 pages
DP 203 Demo
No ratings yet
DP 203 Demo
9 pages
Rules, Scripts and Beanshell
No ratings yet
Rules, Scripts and Beanshell
7 pages
Assignment 2 - CSE3CWACSE5006
No ratings yet
Assignment 2 - CSE3CWACSE5006
10 pages
63838
No ratings yet
63838
55 pages
CH 03
No ratings yet
CH 03
49 pages
SQL 2022
No ratings yet
SQL 2022
1 page
Building Blocks & Trends in Data Warehouse
No ratings yet
Building Blocks & Trends in Data Warehouse
45 pages
New – Export Amazon DynamoDB Table Data to Your Data Lake in Amazon S3, No Code Writing Required
No ratings yet
New – Export Amazon DynamoDB Table Data to Your Data Lake in Amazon S3, No Code Writing Required
6 pages
Introduction To Databases Transparencies: © Pearson Education Limited 1995, 2005
No ratings yet
Introduction To Databases Transparencies: © Pearson Education Limited 1995, 2005
24 pages
Al Quran Ontology Based On Knowledge Themes
No ratings yet
Al Quran Ontology Based On Knowledge Themes
18 pages
aws-certified-developer-associate-dva-c02
No ratings yet
aws-certified-developer-associate-dva-c02
84 pages
SQL QUERIES
No ratings yet
SQL QUERIES
3 pages
CS309 2025 L9.pptx
No ratings yet
CS309 2025 L9.pptx
26 pages
Howto Configurethe Authoringand DWHModels
No ratings yet
Howto Configurethe Authoringand DWHModels
156 pages
Developing of M&E Database
No ratings yet
Developing of M&E Database
1 page
Question Text: Answer Saved Marked Out of 1.00
No ratings yet
Question Text: Answer Saved Marked Out of 1.00
30 pages
Configuring BDE For Windows 7 - Willneumann
No ratings yet
Configuring BDE For Windows 7 - Willneumann
4 pages
Section 16.1: Tampilan Table
No ratings yet
Section 16.1: Tampilan Table
16 pages
hakam_resume
No ratings yet
hakam_resume
1 page
21bce0968 VL2023240100969 Ast02
No ratings yet
21bce0968 VL2023240100969 Ast02
20 pages
RODBC
No ratings yet
RODBC
28 pages
DBMS Seminar
No ratings yet
DBMS Seminar
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Building Cost-Based Query Optimizers With Apache Calcite

Uploaded by

Building Cost-Based Query Optimizers With Apache Calcite

Uploaded by

Building Cost-Based Query

Optimizers with Apache Calcite

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)

* Uses SQL-like languages or builds SQL engine right now

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)

* Uses SQL-like languages or builds SQL engine right now

● Query custom data sources

● Goal: convert query string to AST

● Goal: verify that AST makes any sense

● AST is not convenient for optimization: complex operator semantics

Scan Scan a data source

Project Transform tuple attributes (e.g. a+b)

Filter Filter rows according to a predicate (WHERE, HAVING)

Sort ORDER BY / LIMIT / OFFSET

Aggregate Aggregate operator

Window Window aggregation

Join 2-way join

Union/Minus/Intersect N-way set operators

● Every query might be executed in multiple alternative ways

● Field trimming: remove unused columns from the plan

● A rule is a self-contained optimization unit: pattern + transformation

● Apply transformations until

● Consider multiple plans

● Consider multiple plans

● Consider multiple plans

● Statistics (cardinalities, selectivites, min/max, NDV)

● Operators may expose physical

● Available only in the top-down

Accumulated cost bounding:

● There is a viable aggregate

● You may optimize towards different backends simultaneously (federated queries)

● Deﬁne operators speciﬁc to your backend

● Custom physical batch and streaming operators

● Apache Calcite is a toolbox to build query engines

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.