CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization
CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization
CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization
Principles of Database
Management Systems
Fall 2016
Lecture 8 - Query optimization
Announcements
Additional resources:
• Chaudhuri, "An Overview of Query Optimization in
Relational Systems," Proceedings of ACM PODS, 1998
Logical
Query Select Logical Plan
plan
optimization
Select Physical Plan
Physical
plan
Query Execution
Disk 4
What We Already Know…
Supplier(sno,sname,scity,sstate)
Part(pno,pname,psize,pcolor)
Supply(sno,pno,price)
For each SQL query….
SELECT S.sname
FROM Supplier S, Supply U
WHERE S.scity='Seattle' AND S.sstate='WA’
AND S.sno = U.sno
AND U.pno = 2
π sname
sno = sno
Supplier Supply
CSE 544 - Fall 2016 6
Example Query: Logical Plan 2
π sname
sno = sno
Supplier Supply
CSE 544 - Fall 2016 7
What We Also Know
(Nested loop)
sno = sno
Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 9
Example Query: Physical Plan 2
(On the fly) π sname
Supplier Supply
(File scan) (Index scan)
CSE 544 - Fall 2016 10
Query Optimization
• We still need to do
– Access path selection: compute cost of retrieving tuples from disk
with different access paths
– Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 12
Access Path
• A file scan
• Supplier(sid,sname,scity,sstate)
• Let’s assume
– V(Supplier,scity) = 20
– Max(Supplier, sid) = 1000, Min(Supplier,sid)=1
– B(Supplier) = 100, T(Supplier) = 1000
• We still need to do
– Access path selection: compute cost of retrieving tuples from disk
with different access paths
– Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 19
Statistics on Base Data
Join R ⋈ S
Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 27
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 28
Plan 2 with Different Numbers
Total cost
What if we had: π sname (4) = 10000 + 50 (1)
10K pages of Suppliers
+ 10000 + 4 (2)
10K pages of Supplies
+ 4*50 + 2*4 + 4 + 50 (3)
(Sort-merge join) (3) + 0 (4)
sno = sno Total cost ≈ 20,316 I/Os
(Scan
write to T1) (Scan
write to T2)
(1) σ scity=‘Seattle’ ∧sstate=‘WA’ (2) σ pno=2
Assuming naive
Supplier Supply two-pass sort
(File scan) (File scan) algorithm
CSE 544 - Fall 2016 29
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Supply Supplier
(Hash index on pno ) (Hash index on sno)
30
Assume: clustered Clustering does not matter
Simplifications
• We still need to do
– Access path selection: compute cost of retrieving tuples from disk
with different access paths
– Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 33
Query Optimization
• Selections
– Commutative: σc1(σc2(R)) same as σc2(σc1(R))
– Cascading: σc1∧c2(R) same as σc2(σc1(R))
• Projections
– Cascading
• Joins
– Commutative : R ⋈ S same as S ⋈ R
– Associative: R ⋈ (S ⋈ T) same as (R ⋈ S) ⋈ T
R2
R4
R3 R1 R3 R1 R2 R4
Left-deep plan Bushy plan
• Typical compromises:
– Only left-deep plans
– Only plans without cartesian products
– Always push selections down to the leaves
40
Query Optimization
• Heuristic-based optimizers:
– Apply greedily rules that always improve plan
• Typically: push selections down
– Very limited: no longer used today
• Cost-based optimizers:
– Use a cost model to estimate the cost of each plan
– Select the “cheapest” plan
– We focus on cost-based optimizers
• Bottom-up plans
• Top-down plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
⨝
⨝
⨝ Why is this
T search space
σA<40 ⨝ inefficient ?
σA<40 S
R S T
R
CSE 544 - Fall 2016 44
Bottom-up Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
Why is this ⨝
better ?
⨝ ⨝ T
⨝ ⨝ σA<40
⨝
T T SELECT R.A, T.D
SELECT * FROM R, S, T
FROM R, S WHERE R.B=S.B
WHERE R.B=S.B
and R.A < 40 SELECT * S
and S.C=T.C …..
FROM R
WHERE R.A < 40
• Inspired by System R
– Left-deep plans and dynamic programming
– Cost-based optimization (CPU and IO)
• Rule-based
– Extensible collection of rules
– Rule = Algebraic law with a direction
– Algorithm for firing these rules
• Generate many alternative plans, in some order
• Prune by cost
– Startburst (later DB2) and Volcano (later SQL Server)