Ch13-Query Optimization
Ch13-Query Optimization
Find the names of all instructors in the Music department together with the course title of all
the courses that the instructors teach.
Introduction (Cont.)
An evaluation plan defines exactly what algorithm is used for each
operation, and how the execution of the operations is coordinated.
Find out how to view query execution plans on your favorite database
Introduction (Cont.)
1(E1 2 E2) = E1 1 2 E2
Equivalence Rules (Cont.)
5. Theta-join operations (and natural joins) are commutative.
E1 E2 = E2 E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)
7a-When all the attributes in 0 involve only the attributes of one of the expressions
(E1) being joined.
(r1 r2) r3
so that we compute and store a smaller temporary relation.
Join Ordering Example (Cont.)
Consider the expression
name, title(dept_name= Music (instructor) teaches)
course_id, title (course))))
Could compute teaches course_id, title (course) first, and
join result with
dept_name= Music (instructor)
but the result of the first join is likely to be a large relation.
Only a small fraction of the universitys instructors are likely to
be from the Music department
it is better to compute
dept_name= Music (instructor) teaches
first.
Enumeration of Equivalent Expressions
E1 E2
There are (2(n 1))!/(n 1)! different join orders for above expression.
With n = 7, the number is 665280, with n = 10, the number is greater
than 176 billion!
No need to generate all the join orders. Using dynamic programming,
the least-cost join order for any subset of
{r1, r2, . . . rn} is computed only once and stored for future use.
Cost Based Optimization with Equivalence
Rules
Physical equivalence rules allow logical query plan to be converted
to physical query plan specifying what algorithms are used for each
operation.
Efficient optimizer based on equivalent rules depends on
A space efficient representation of expressions which avoids
making multiple copies of subexpressions
Efficient techniques for detecting duplicate derivations of
expressions
A form of dynamic programming based on memoization, which
stores the best plan for a subexpression the first time it is
optimized, and reuses in on repeated optimization calls on same
subexpression
Cost-based pruning techniques that avoid generating all plans
Pioneered by the Volcano project and implemented in the SQL Server
optimizer
Dynamic Programming in Optimization
Database System Concepts - 6th Edition 1.30 Silberschatz, Korth and Sudarshan
Structure of Query Optimizers (Cont.)
Some query optimizers integrate heuristic selection and the generation of
alternative access plans.
Frequently used approach
heuristic rewriting of nested block structure and aggregation
followed by cost-based join-order optimization for each block
Some optimizers (e.g. SQL Server) apply transformations to entire query
and do not depend on block structure
Optimization cost budget to stop optimization early (if cost of plan is
less than cost of optimization)
Plan caching to reuse previously computed plan if query is resubmitted
Even with different constants in query
Even with the use of heuristics, cost-based query optimization imposes a
substantial overhead.
But is worth it for expensive queries
Optimizers often use simple heuristics for very cheap queries, and
perform exhaustive enumeration for more expensive queries
Cost of Optimization
With dynamic programming time complexity of optimization with bushy
trees is O(3n).
With n = 10, this number is 59000 instead of 176 billion!
Space complexity is O(2n)
To find best left-deep join tree for a set of n relations:
Consider n alternatives with one relation as right-hand side input and
the other relations as left-hand side input.
Modify optimization algorithm:
Replace for each non-empty subset S1 of S such that S1 S
By: for each relation r in S
let S1 = S r .
If only left-deep trees are considered, time complexity of finding best join
order is O(n 2n)
Space complexity remains at O(2n)
Cost-based optimization is expensive, but worthwhile for queries on
large datasets (typical queries have small n, generally < 10)
Database System Concepts - 6th Edition 1.32 Silberschatz, Korth and Sudarshan
Statistical Information for Cost Estimation
nr
br
fr
Histograms
Histogram on attribute age of relation person
50
40
frequency
30
20
10
Database System Concepts - 6th Edition 1.35 Silberschatz, Korth and Sudarshan
Selection Size Estimation
A=v(r)
nr / V(A,r) : number of records that will satisfy the selection
Equality condition on a key attribute: size estimate = 1
AV(r) (case of A V(r) is symmetric)
Let c denote the estimated number of tuples satisfying the condition.
If min(A,r) and max(A,r) are available in catalog
c = 0 if v < min(A,r)
c= nr if v max(A, r ),
v min( A, r )
nr .
c= max( A, r ) min( A, r )
nr ns
V ( A, r )
The lower of these two estimates is probably the more accurate one.
Can improve on above if histograms are available
Use formula similar to above, for each cell of histograms on the
two relations
Estimation of the Size of Joins (Cont.)
Compute the size estimates for depositor customer without using
information about foreign keys:
V(ID, takes) = 2500, and
V(ID, student) = 5000
The two estimates are 5000 * 10000/2500 = 20,000 and 5000 *
10000/5000 = 10000
We choose the lower estimate, which in this case, is the same as
our earlier computation using foreign keys.
End of Chapter