CAS CS 460/660 Introduction To Database Systems Query Optimization
CAS CS 460/660 Introduction To Database Systems Query Optimization
Query Optimization
1.1
Review
Implementation of Relational Operations as Iterators
Focus largely on External algorithms (sorting/hashing)
Choices depend on indexes, memory, stats,…
Joins
Blocked nested loops:
simple, exploits extra memory
Indexed nested loops:
best if 1 rel small and one indexed
Sort/Merge Join
good with small amount of memory, bad with duplicates
Hash Join
fast (enough memory), bad with skewed data
Relatively easy to parallelize
Sort and Hash-Based Aggs and DupElim
1.2
Query Optimization Overview
Query can be converted to relational algebra
Rel. Algebra converted to tree, joins as branches
Each operator has implementation choices
Operators can also be applied in different order!
sid=sid
sname((bid=100 rating > 5) (Reserves Sailors))
Reserves Sailors
1.3
Iterator Interface (pull from the top)
Recall:
•Relational operators at nodes support uniform
iterator interface:
sname
Open( ), get_next( ), close( )
bid=100 rating > 5 •Unary Ops – On Open() call Open() on child.
•Binary Ops – call Open() on left child then
on right.
sid=sid
•By convention, outer is on left.
Reserves Sailors
1.4
Query Optimization Overview (cont)
1.5
Cost-based Query Sub-System
Select *
Queries From Blah B
Where B.blah = blah
Usually there is a
heuristics-based
rewriting step before
the cost-based steps.
Query Parser
Query Optimizer
Schema Statistics
Query Plan Evaluator
1.6
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
1.7
Motivating Example
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
Sailors Reserves
1.8
Alternative Plans – Push Selects
(No Indexes)
(On-the-fly)
sname
(On-the-fly)
sname
bid=100 (On-the-fly)
1.9
Alternative Plans – Push Selects
(No Indexes)
sname (On-the-fly)
(On-the-fly)
sname
bid=100 (On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(Page-Oriented
sid=sid Nested loops) rating > 5 bid = 100
(On-the-fly)
(On-the-fly)
rating > 5
(On-the-fly) Reserves
Sailors Reserves
Sailors
250,500 IOs
250,500 IOs
1.10
Alternative Plans – Push Selects
(No Indexes)
(On-the-fly) (On-the-fly)
sname sname
(Page-Oriented (Page-Oriented
sid=sid Nested loops) sid=sid Nested loops)
Sailors Reserves
6000 IOs
250,500 IOs
1.11
Alternative Plans – Push Selects
(No Indexes)
(On-the-fly)
sname
(On-the-fly)
sname
rating > 5 (On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(Page-Oriented
sid=sid Nested loops) (Scan &
bid=100 rating > 5 Write to
(On-the-fly) temp T2)
bid=100 Sailors
(On-the-fly)
Reserves Sailors
Reserves
4250 IOs
6000 IOs 1000 + 500+ 250 + (10 * 250)
1.12
Alternative Plans – Push Selects
(No Indexes)
(On-the-fly) (On-the-fly)
sname sname
(Page-Oriented (Page-Oriented
sid=sid Nested loops) sid=sid Nested loops)
1.13
Alternative Plans 1 sname
(On-the-fly)
(No Indexes)
(Sort-Merge Join)
sid=sid
Main difference: Sort
(Scan; (Scan;
Merge Join write to
temp T1)
bid=100 rating > 5 write to
temp T2)
With 5 buffers, cost of plan:
Reserves Sailors
Scan Reserves (1000) + write temp T1 (10 pages, if we have 100
boats, uniform distribution).
Scan Sailors (500) + write temp T2 (250 pages, if have 10 ratings).
Sort T1 (2*2*10), sort T2 (2*4*250), merge (10+250)
Total: 4060 page I/Os. (note: T2 sort takes 4 passes with B=5)
If use BNL join, join = 10+4*250, total cost = 2770.
Can also `push’ projections, but must be careful!
T1 has only sid, T2 only sid, sname:
T1 fits in 3 pgs, cost of BNL under 250 pgs, total < 2000.
1.14
(On-the-fly)
Alt Plan 2: Indexes sname
(On-the-fly)
rating > 5
1.15
What is needed for optimization?
Iterator Interface
Cost Estimation
Statistics and Catalogs
Size Estimation and Reduction Factors
1.16
Query Blocks: Units of Optimization
SELECT S.sname
FROM Sailors S
WHERE S.age IN
(SELECT MAX (S2.age)
Outer block FROM Sailors S2
GROUP BY S2.rating)
Nested block
An SQL query is parsed into a collection of query blocks, and these are
optimized one block at a time.
1.17
Translating SQL to Relational Algebra
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
AND S.rating = ( SELECT MAX (S2.rating) FROM Sailors S2)
GROUP BY S.sid
HAVING COUNT (*) >= 2
For each sailor with the highest rating (over all sailors), and at least two
reservations for red boats, find the sailor id and the earliest date on which the
sailor has a reservation for a red boat.
1.18
Translating SQL to Relational Algebra
1.19
Relational Algebra Equivalences
Allow us to choose different operator orders and to `push’ selections and
projections ahead of joins.
Selections:
(Cascade)
c1 ... cn R c1 . . . cn R
c1 c2 R c2 c1 R (Commute)
Projections:
a1
R
a1
... R
an (Cascade)
(
i
Joins: R (S
f
a T) (R S) T (Associative)
n
(R S) i (S R) (Commute)
n
These two mean
c we can do joins in any order.
l
u 1.20