0% found this document useful (0 votes)
36 views

Building Cost-Based Query Optimizers With Apache Calcite

Uploaded by

tlxarena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Building Cost-Based Query Optimizers With Apache Calcite

Uploaded by

tlxarena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Building Cost-Based Query

Optimizers with Apache Calcite


Vladimir Ozerov
Querify Labs, CEO
SQL use cases: technology

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)

2
SQL use cases: technology

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)


● “New” products
○ Relational (CockroachDB, TiDB, YugaByte)
○ BigData/Analytics (Hive, Snowflake, Dremio, Clickhouse, Presto)
○ NoSQL (DataStax*, Couchbase*)
○ Compute/streaming (Spark, ksqlDB, Apache Flink)
○ In-memory (Apache Ignite, Hazelcast, Gigaspaces)
● Rebels:
○ MongoDB
○ Redis

* Uses SQL-like languages or builds SQL engine right now

3
SQL use cases: technology

● “Old-school” databases (MySQL, Postgres, SQL Server, Oracle)


● “New” products
○ Relational (CockroachDB, TiDB, YugaByte)
○ BigData/Analytics (Hive, Snowflake, Dremio, Clickhouse, Presto)
○ NoSQL (DataStax*, Couchbase*)
○ Compute/streaming (Spark, ksqlDB, Apache Flink)
○ In-memory (Apache Ignite, Hazelcast, Gigaspaces)
● Rebels:
○ MongoDB
○ Redis

* Uses SQL-like languages or builds SQL engine right now

https://insights.stackoverflow.com/survey/2020

4
SQL use cases: applied

● Query custom data sources


○ Internal business systems
○ Infrastructure: logs, metrics, configs, events, …
● Federated SQL - run queries across multiple sources
○ Data lakes
● Custom requirements
○ New syntax / DSL
○ UDFs
○ Internal optimizations

5
What does it take to build an SQL engine?

6
Optimization with Apache Calcite

7
Optimization with Apache Calcite

8
Projects that already use Apache Calcite

● Data Management:
○ Apache Hive
○ Apache Flink
○ Dremio
○ VoltDB
○ IMDGs (Apache Ignite, Hazelcast, Gigaspaces)
○ …
● Applied:
○ Alibaba / Ant Group
○ Uber
○ LinkedIn
○ …

https://calcite.apache.org/docs/powered_by.html

9
Parsing

● Goal: convert query string to AST


● How to create a parser?
○ Write a parser by hand? Not practical
○ Use parser generator? Better, but still a lot of work
○ Use Apache Calcite
● Parsing with Apache Calcite
○ Uses JavaCC parser generator under the hood
○ Provides a ready-to-use generated parser with the ANSI SQL grammar
○ Allows for custom extensions to the syntax

10
Semantic Analysis

● Goal: verify that AST makes any sense


● Semantic analysis with Apache Calcite
○ Provide a schema
○ (optionally) Provide custom operators
○ Run Calcite’s SQL validator
● Validator responsibilities
○ Bind tables and columns
○ Bind operators
○ Resolve data types
○ Verify relational semantics

11
Relational tree

● AST is not convenient for optimization: complex operator semantics


● A relational tree is a better IR: simple operators with well-defined scopes
● Apache Calcite can translate AST to relational tree
12
Relational tree

Operator Description

Scan Scan a data source

Project Transform tuple attributes (e.g. a+b)

Filter Filter rows according to a predicate (WHERE, HAVING)

Sort ORDER BY / LIMIT / OFFSET

Aggregate Aggregate operator

Window Window aggregation

Join 2-way join

Union/Minus/Intersect N-way set operators

13
Transformations

● Every query might be executed in multiple alternative ways


● We need to apply transformations to find better plans
● Apache Calcite: custom transformations (visitors) or rule-based transformations
14
Transformations: custom
Custom transformations implemented using a visitor pattern
(traverse the relational tree, create a new tree):

● Field trimming: remove unused columns from the plan


● Subquery elimination: rewrite subqueries to joins/aggregates

15
Transformations: rule-based

● A rule is a self-contained optimization unit: pattern + transformation


● There are hundreds of valid transformations in relational algebra
● Apache Calcite provides ~100 transformation rules out-of-the-box!
16
Rules
Examples of rules:

● Operator transpose - move operators wrt each other (e.g., filter push-down)
● Operator simplification - merge or eliminate operators, convert to simpler equivalents
● Join planning - commute, associate

https://github.com/apache/calcite/tree/master/core/src/main/java/org/apache/calcite/rel/rules

17
Rule drivers: heuristic (HepPlanner)

● Apply transformations until


there is anything to transform
● Fast, but cannot guarantee
optimality

18
Rule drivers: cost-based (VolcanoPlanner)

● Consider multiple plans


simultaneously in a special
data structure (MEMO)

19
Rule drivers: cost-based (VolcanoPlanner)

● Consider multiple plans


simultaneously in a special
data structure (MEMO)
● Assign non-cumulative costs
to operators

20
Rule drivers: cost-based (VolcanoPlanner)

● Consider multiple plans


simultaneously in a special
data structure (MEMO)
● Assign non-cumulative costs
to operators
● Maintain the winner for every
equivalence group
● Heavier than the heuristic
driver but guarantees
optimality

21
Metadata

Metadata is a set of properties, common to all operators in the given equivalence group. Used extensively in rules
and cost functions.

Examples:

● Statistics (cardinalities, selectivites, min/max, NDV)


● Attribute uniqueness
○ SELECT a … GROUP BY a -> the first attribute is unique
● Attribute constraints
○ WHERE a.a1=1 and a.a1=b.b1 -> both a.a1 and b.b1 are always 1 and their NDV is 1

22
Implementing an operator

● Create your custom operator, extending the RelNode class or one of existing abstract operators
● Override the copy routine to allow for operator copying to/from MEMO (copy)
● Override operator’s digest for proper deduplication (explainTerms)
○ Usually: dump a minimal set of fields that makes the operator unique wrt other operators.
● Override the cost function (computeSelfCost)
○ Usually: consult to metadata, first of all input’s cardinality, apply some coefficients.
○ You may even provide you own definition of the cost

23
Enforcers

● Operators may expose physical


properties
● Parent operator may demand a certain
property on the input
● If the input cannot satisfy the requested
property, an enforcer operator is
injected
● Examples:
○ Collation (Sort)
○ Distribution (Exchange)

24
VolcanoOptimizer

Vanilla Top-down
● The original implementation of the ● Implemented recently by Alibaba engineers
cost-based optimizer in Apache Calcite. ● Based on the Cascades algorithm: the guided
● Optimize nodes in an arbitrary order. top-down search.
● Cannot propagate physical properties. ● Propagates the physical properties between
● Cannot do efficient pruning. operators (requires manual implementation).
● Applies branch-and-bound pruning to limit the
search space.

25
Physical property propagation

● Available only in the top-down


optimizer
● Pass-through (1, 2, 3) -
propagate optimization request
to inputs
● Derive (4, 5) - notify the parent
about the new implementation

26
Branch-and-bound pruning

Accumulated cost bounding:

● There is a viable aggregate


○ Total cost = 500
○ Self cost = 150
○ Input’s budget = 350
● The new join is created
○ Self cost = 450
○ May never be part of an optimal
plan, prune

27
Multi-phase optimization

● Practical optimizers often split optimization into several phases to reduce the search space, at the cost of
possibly missing the optimal plan
● Apache Calcite allows you to implement a multi-phase optimizer
28
Federated queries

● You may optimize towards different backends simultaneously (federated queries)


○ E.g., JDBC + Apache Cassandra
● Apache Calcite has the built-in Enumerable execution backend that compiles operators into a Java
bytecode in runtime 29
Your optimizer

● Define operators specific to your backend


● Provide custom rules that convert abstract Calcite operators to your operators
○ E.g., LogicalJoin -> HashJoin
● Run Calcite driver(s) with the built-in and/or custom rules 30
Example: Apache Flink

● Custom physical batch and streaming operators


● Custom cost: row count, cpu, IO, network, memory
● The custom distribution property with an Exchange enforcer
● Custom rules (e.g., subquery rewrite, physical rules)
● Multi-phase optimization: heuristic and cost-based phases

https://github.com/apache/flink/tree/release-1.12.2/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner

31
Summary

● Apache Calcite is a toolbox to build query engines


○ Syntax analyzer
○ Semantic analyzer
○ Translator
○ Optimization drivers and rules
○ The Enumerable backend
● Apache Calcite dramatically reduces the efforts required to build an optimizer for your backend
○ Weeks to have a working prototype
○ Months to have an MVP
○ Year(s) to have a solid product, but not decades!

32
Links

● Speaker
○ https://www.linkedin.com/in/devozerov/
○ https://twitter.com/devozerov
● Apache Calcite:
○ https://calcite.apache.org/
○ https://github.com/apache/calcite
● Demo:
○ https://github.com/querifylabs/talks-2021-percona
● Our blog:
○ https://www.querifylabs.com/blog

33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy