Building Cost-Based Query Optimizers With Apache Calcite
Building Cost-Based Query Optimizers With Apache Calcite
2
SQL use cases: technology
3
SQL use cases: technology
https://insights.stackoverflow.com/survey/2020
4
SQL use cases: applied
5
What does it take to build an SQL engine?
6
Optimization with Apache Calcite
7
Optimization with Apache Calcite
8
Projects that already use Apache Calcite
● Data Management:
○ Apache Hive
○ Apache Flink
○ Dremio
○ VoltDB
○ IMDGs (Apache Ignite, Hazelcast, Gigaspaces)
○ …
● Applied:
○ Alibaba / Ant Group
○ Uber
○ LinkedIn
○ …
https://calcite.apache.org/docs/powered_by.html
9
Parsing
10
Semantic Analysis
11
Relational tree
Operator Description
13
Transformations
15
Transformations: rule-based
● Operator transpose - move operators wrt each other (e.g., filter push-down)
● Operator simplification - merge or eliminate operators, convert to simpler equivalents
● Join planning - commute, associate
https://github.com/apache/calcite/tree/master/core/src/main/java/org/apache/calcite/rel/rules
17
Rule drivers: heuristic (HepPlanner)
18
Rule drivers: cost-based (VolcanoPlanner)
19
Rule drivers: cost-based (VolcanoPlanner)
20
Rule drivers: cost-based (VolcanoPlanner)
21
Metadata
Metadata is a set of properties, common to all operators in the given equivalence group. Used extensively in rules
and cost functions.
Examples:
22
Implementing an operator
● Create your custom operator, extending the RelNode class or one of existing abstract operators
● Override the copy routine to allow for operator copying to/from MEMO (copy)
● Override operator’s digest for proper deduplication (explainTerms)
○ Usually: dump a minimal set of fields that makes the operator unique wrt other operators.
● Override the cost function (computeSelfCost)
○ Usually: consult to metadata, first of all input’s cardinality, apply some coefficients.
○ You may even provide you own definition of the cost
23
Enforcers
24
VolcanoOptimizer
Vanilla Top-down
● The original implementation of the ● Implemented recently by Alibaba engineers
cost-based optimizer in Apache Calcite. ● Based on the Cascades algorithm: the guided
● Optimize nodes in an arbitrary order. top-down search.
● Cannot propagate physical properties. ● Propagates the physical properties between
● Cannot do efficient pruning. operators (requires manual implementation).
● Applies branch-and-bound pruning to limit the
search space.
25
Physical property propagation
26
Branch-and-bound pruning
27
Multi-phase optimization
● Practical optimizers often split optimization into several phases to reduce the search space, at the cost of
possibly missing the optimal plan
● Apache Calcite allows you to implement a multi-phase optimizer
28
Federated queries
https://github.com/apache/flink/tree/release-1.12.2/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner
31
Summary
32
Links
● Speaker
○ https://www.linkedin.com/in/devozerov/
○ https://twitter.com/devozerov
● Apache Calcite:
○ https://calcite.apache.org/
○ https://github.com/apache/calcite
● Demo:
○ https://github.com/querifylabs/talks-2021-percona
● Our blog:
○ https://www.querifylabs.com/blog
33