CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization

CSE 544
Principles of Database
Management Systems
Fall 2016
Lecture 8 - Query optimization
Announcements
•  HW2 (SimpleDB) is due next Friday!
•  Midterm in two weeks, Nov. 10, in class
•  Project Milestone due on Nov. 16
CSE 544 - Fall 2016 2

Recommended Readings
Access path selection in a relational database management

system.
Selinger. et. al. SIGMOD 1979
Additional resources:
•  Chaudhuri, "An Overview of Query Optimization in
Relational Systems," Proceedings of ACM PODS, 1998
•  Database management systems.

Ramakrishnan and Gehrke.
Third Ed. Chapter 15.
CSE 544 - Fall 2016 3
Query Optimization Motivation
SQL query
Declarative query
Recall physical and
Parse & Rewrite Query logical data independence
Logical
Query Select Logical Plan
plan
optimization
Select Physical Plan
Physical
plan
Query Execution
Disk 4
What We Already Know…
Supplier(sno,sname,scity,sstate)
Part(pno,pname,psize,pcolor)
Supply(sno,pno,price)
For each SQL query….
SELECT S.sname
FROM Supplier S, Supply U
WHERE S.scity='Seattle' AND S.sstate='WA’
AND S.sno = U.sno
AND U.pno = 2
There exist many logical query plan…
CSE 544 - Fall 2016 5

Example Query: Logical Plan 1
π sname
σ sscity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2
sno = sno
Supplier Supply
CSE 544 - Fall 2016 6
Example Query: Logical Plan 2
π sname
sno = sno
σ sscity=‘Seattle’ ∧sstate=‘WA’ σ pno=2
Supplier Supply
CSE 544 - Fall 2016 7
What We Also Know
•  For each logical plan…
•  There exist many physical plans
CSE 544 - Fall 2016 8

Example Query: Physical Plan 1
(On the fly) π sname
(On the fly)

σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2
(Nested loop)
sno = sno
Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 9
Example Query: Physical Plan 2
(On the fly) π sname
(On the fly)

(Index nested loop)

sno = sno
Supplier Supply
(File scan) (Index scan)
CSE 544 - Fall 2016 10
Query Optimization
Three major components:
1.  Cardinality and cost estimation
2.  Search space
3.  Plan enumeration algorithms
CSE 544 - Fall 2016 11

Estimating Cost of a Query Plan
Goal: compute the cost of an entire physical query plan
•  We already know how to

–  Compute the cost of different operations in terms of number Ios,
given the T(R)’s and the B(R)’s
•  We still need to do
–  Access path selection: compute cost of retrieving tuples from disk
with different access paths
–  Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 12
Access Path
Access path: a way to retrieve tuples from a table
•  A file scan
•  An index plus a matching selection condition
CSE 544 - Fall 2016 13

Access Path Selection
•  Supplier(sid,sname,scity,sstate)
•  Selection condition: sid > 300 ∧ scity=‘Seattle’
•  Indexes: B+-tree on sid and B+-tree on scity
•  Which access path should we use?
•  We should pick the most selective access path
CSE 544 - Fall 2016 14

Access Path Selectivity
•  Access path selectivity is the number of pages

retrieved if we use this access path
–  Most selective retrieves fewest pages
•  As we saw earlier, for equality predicates

–  Selection on equality: σa=v(R)
–  V(R, a) = # of distinct values of attribute a
–  1/V(R,a) is thus the reduction factor
–  Clustered index on a: cost B(R)/V(R,a)
–  Unclustered index on a: cost T(R)/V(R,a)
–  (we are ignoring I/O cost of index pages for simplicity)
CSE 544 - Fall 2016 15

Selectivity for Range Predicates
Selection on range: σa>v(R)
•  How to compute the selectivity?

•  Assume values are uniformly distributed
•  Reduction factor X
•  X = (Max(R,a) - v) / (Max(R,a) - Min(R,a))
•  Clustered index on a: cost B(R)*X

•  Unclustered index on a: cost T(R)*X
CSE 544 - Fall 2016 16

Back to Our Example
•  Selection condition: sid > 300 ∧ scity=‘Seattle’

–  Index I1: B+-tree on sid clustered
–  Index I2: B+-tree on scity unclustered
•  Let’s assume
–  V(Supplier,scity) = 20
–  Max(Supplier, sid) = 1000, Min(Supplier,sid)=1
–  B(Supplier) = 100, T(Supplier) = 1000
•  Cost I1: B(R) * (Max-v)/(Max-Min) = 100*700/999 ≈ 70

•  Cost I2: T(R) * 1/V(Supplier,scity) = 1000/20 = 50
CSE 544 - Fall 2016 17

Selectivity with
Multiple Conditions
What if we have an index on multiple attributes?
•  Example selection σa=v1 ∧ b= v2(R) and index on <a,b>
How to compute the selectivity?

•  Assume attributes are independent
•  X = 1 / (V(R,a) * V(R,b))
•  Clustered index on <a,b>: cost B(R)*X

•  Unclustered index on <a,b>: cost T(R)*X
CSE 544 - Fall 2016 18

Estimating Cost of a Query Plan

CSE 544 - Fall 2016 19
Statistics on Base Data
•  Collected information for each relation

–  Number of tuples (cardinality) T(R)
–  Number of physical pages B(R), clustering info
–  Indexes, number of keys in the index V(R,a)
–  Statistical information on attributes
•  Min value, max value, number distinct values
•  Histograms
–  Correlations between columns (hard)
•  Collection approach: periodic, using sampling
CSE 544 - Fall 2016 20

Size Estimation
Projection: output size same as input size
Selection: multiply input size by reduction factor

•  Similar to what we did for estimating access path
selectivity
•  Assume independence between conditions in the
predicate
•  Examples:
T(σA=...(R)) = T(R) / V(R,A)
T(σA=...∧ B=... (R)) = T(R) / (V(R,A) * V(R,B))
CSE 544 - Fall 2016 21

Estimating Result Sizes
Join R ⋈ S
•  Take product of cardinalities of relations R and S

•  Apply reduction factors for each term in join condition
•  Terms are of the form: column1 = column2
•  Reduction: 1/ ( MAX( V(R,column1), V(S,column2))
•  Why? Will explain next...
CSE 544 - Fall 2016 22

Assumptions
•  Containment of values: if V(R,A) <= V(S,B), then

the set of A values of R is included in the set of B
values of S
–  Note: this indeed holds when A is a foreign key in R,
and B is a key in S
•  Preservation of values: for any other attribute C,

V(R ⨝A=B S, C) = V(R, C) (or V(S, C))
CSE 544 - Fall 2016 23

Selectivity of R ⨝A=B S
Assume V(R,A) <= V(S,B)

•  Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S
•  Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)
In general: T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B))
CSE 544 - Fall 2016 24

Complete Example
Supplier(sid, sname, scity, sstate) SELECT sname

Supply(sid, pno, quantity) FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
•  Some statistics
and x.scity = ‘Seattle’
–  T(Supplier) = 1000 records and x.sstate = ‘WA’
–  T(Supply) = 10,000 records
–  B(Supplier) = 100 pages
–  B(Supply) = 100 pages
–  V(Supplier,scity) = 20, V(Suppliers,state) = 10
–  V(Supply,pno) = 2,500
–  Both relations are clustered
•  M = 11
CSE 544 - Fall 2016 25
Computing the Cost of a Plan
•  Estimate cardinality in a bottom-up fashion

–  Cardinality is the size of a relation (nb of tuples)
–  Compute size of all intermediate relations in plan
•  Estimate cost by using the estimated cardinalities
CSE 544 - Fall 2016 26

T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 1

(On the fly) π sname Selection and project on-the-fly
-> No additional cost.
(On the fly)
Total cost of plan is thus cost of join:
= B(Supplier)+B(Supplier)*B(Supplies)
(Nested loop) = 100 + 100 * 100
sno = sno
= 10,100 I/Os
Supplier Supply
CSE 544 - Fall 2016 27

Total cost
(On the fly) π sname (4) = 100 + 100 * 1/20 * 1/10 (1)
+ 100 + 100 * 1/2500 (2)
+ 2 (3)
(Sort-merge join) (3) + 0 (4)
sno = sno Total cost ≈ 204 I/Os
(Scan
write to T1) (Scan
write to T2)
(1) σ scity=‘Seattle’ ∧sstate=‘WA’ (2) σ pno=2
Supplier Supply
CSE 544 - Fall 2016 28
Plan 2 with Different Numbers
Total cost
What if we had: π sname (4) = 10000 + 50 (1)
10K pages of Suppliers
+ 10000 + 4 (2)
10K pages of Supplies
+ 4*50 + 2*4 + 4 + 50 (3)
(Sort-merge join) (3) + 0 (4)
sno = sno Total cost ≈ 20,316 I/Os
(Scan
write to T1) (Scan
write to T2)
(1) σ scity=‘Seattle’ ∧sstate=‘WA’ (2) σ pno=2
Assuming naive
Supplier Supply two-pass sort
(File scan) (File scan) algorithm
CSE 544 - Fall 2016 29

(On the fly) (4) π sname
Total cost
= 1 (1)
(On the fly)
+ 4 (2)
(3) σ scity=‘Seattle’ ∧sstate=‘WA’ + 0 (3)
+ 0 (3)
Total cost ≈ 5 I/Os
(2) (Index nested loop)
sno = sno
(Use hash index) 4 tuples

(1) σ pno=2
Supply Supplier
(Hash index on pno ) (Hash index on sno)
30
Assume: clustered Clustering does not matter
Simplifications
•  In the previous examples, we assumed that all index

pages were in memory
•  When this is not the case, we need to add the cost of

fetching index pages from disk
CSE 544 - Fall 2016 31

Different Cost Models
•  In previous examples, we considered IO costs
•  Typically, want IO+CPU
•  For parallel/distributed queries, add network bandwidth
•  If need to compare logical plans

–  Compute the cardinality of each intermediate relation
–  Sum up all the cardinalities
CSE 544 - Fall 2016 32

Summary

CSE 544 - Fall 2016 33
Query Optimization
2.  Search space
CSE 544 - Fall 2016 34

Relational Algebra Laws
•  Selections
–  Commutative: σc1(σc2(R)) same as σc2(σc1(R))
–  Cascading: σc1∧c2(R) same as σc2(σc1(R))
•  Projections
–  Cascading
•  Joins
–  Commutative : R ⋈ S same as S ⋈ R
–  Associative: R ⋈ (S ⋈ T) same as (R ⋈ S) ⋈ T
CSE 544 - Fall 2016 35

Left-Deep Plans and
Bushy Plans
R2
R4
R3 R1 R3 R1 R2 R4
Left-deep plan Bushy plan
CSE 544 - Fall 2016 36

Relational Algebra Laws
•  Selects, projects, and joins

–  We can commute and combine all three types of operators
–  We just have to be careful that the fields we need are available
when we apply the operator
–  Relatively straightforward. See book 15.3.
•  More info in optional paper (by Chaudhuri), Section 4.
CSE 544 - Fall 2016 37

Group-by and Join
R(A, B), S(C,D)
γA, sum(D)(R(A,B) ⨝ B=C S(C,D)) = ?
CSE 544 - Fall 2016 38

Group-by and Join
R(A, B), S(C,D)
γA, sum(D)(R(A,B) ⨝ B=C S(C,D)) =

γA, sum(D)(R(A,B) ⨝ B=C (γC, sum(D)S(C,D)))
These are very powerful laws.

They were introduced only in the 90’s.
CSE 544 - Fall 2016 39

Search Space Challenges
•  Search space is huge!

–  Many possible equivalent trees (logical)
–  Many implementations for each operator (physical)
–  Many access paths for each relation (physical)
•  Cannot consider ALL plans

•  Want a search space that includes low-cost plans
•  Typical compromises:
–  Only left-deep plans
–  Only plans without cartesian products
–  Always push selections down to the leaves
40
Query Optimization
2.  Search space
CSE 544 - Fall 2016 41

Two Types of Optimizers
•  Heuristic-based optimizers:
–  Apply greedily rules that always improve plan
•  Typically: push selections down
–  Very limited: no longer used today
•  Cost-based optimizers:
–  Use a cost model to estimate the cost of each plan
–  Select the “cheapest” plan
–  We focus on cost-based optimizers
CSE 544 - Fall 2016 42

Three Approaches to Search
Space Enumeration
•  Complete plans
•  Bottom-up plans
•  Top-down plans
CSE 544 - Fall 2016 43

Complete Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
⨝
⨝
⨝ Why is this
T search space
σA<40 ⨝ inefficient ?
σA<40 S
R S T
R
CSE 544 - Fall 2016 44
Bottom-up Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
Why is this ⨝
better ?
⨝ ⨝ T
σA<40 ⨝ σA<40 S ⨝ σA<40 S

…..
R S T R R S R
CSE 544 - Fall 2016 45
Top-down Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
⨝ ⨝ σA<40
⨝
T T SELECT R.A, T.D
SELECT * FROM R, S, T
FROM R, S WHERE R.B=S.B
WHERE R.B=S.B
and R.A < 40 SELECT * S
and S.C=T.C …..
FROM R
WHERE R.A < 40
CSE 544 - Fall 2016 46

Two Types of Plan
Enumeration Algorithms
•  Dynamic programming (in class)
–  Based on System R (aka Selinger) style optimizer[1979]
–  Limited to joins: join reordering algorithm
–  Bottom-up
•  Rule-based algorithm (will not discuss)

–  Database of rules (=algebraic laws)
–  Usually: dynamic programming
–  Usually: top-down
CSE 544 - Fall 2016 47

System R Search Space
•  Only left-deep plans

–  Enable dynamic programming for enumeration
–  Facilitate tuple pipelining from outer relation
•  Consider plans with all “interesting orders”
•  Perform cross-products after all other joins (heuristic)
•  Only consider nested loop & sort-merge joins
•  Consider both file scan and indexes
•  Try to evaluate predicates early
CSE 544 - Fall 2016 48

Plan Enumeration Algorithm
•  Idea: use dynamic programming

•  For each subset of {R1, …, Rn}, compute the best plan
for that subset
•  In increasing order of set cardinality:
–  Step 1: for {R1}, {R2}, …, {Rn}
–  Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}
–  …
–  Step n: for {R1, …, Rn}
•  It is a bottom-up strategy
•  A subset of {R1, …, Rn} is also called a subquery
CSE 544 - Fall 2016 49

Dynamic Programming Algo.
•  For each subquery Q ⊆{R1, …, Rn} compute the

following:
–  Size(Q)
–  A best plan for Q: Plan(Q)
–  The cost of that plan: Cost(Q)
CSE 544 - Fall 2016 50

•  Step 1: Enumerate all single-relation plans
–  Consider selections on attributes of relation

–  Consider all possible access paths
–  Consider attributes that are not needed
–  Compute cost for each plan
–  Keep cheapest plan per “interesting” output order
CSE 544 - Fall 2016 51

•  Step 2: Generate all two-relation plans
–  For each each single-relation plan from step 1

–  Consider that plan as outer relation
–  Consider every other relation as inner relation
CSE 544 - Fall 2016 52

•  Step 3: Generate all three-relation plans
–  For each each two-relation plan from step 2

–  Consider that plan as outer relation
–  Consider every other relation as inner relation
•  Steps 4 through n: repeat until plan contains all the

relations in the query
CSE 544 - Fall 2016 53

Commercial Query Optimizers
DB2, Informix, Microsoft SQL Server, Oracle 8
•  Inspired by System R
–  Left-deep plans and dynamic programming
–  Cost-based optimization (CPU and IO)
•  Go beyond System R style of optimization

–  Also consider right-deep and bushy plans (e.g., Oracle and DB2)
–  Variety of additional strategies for generating plans (e.g., DB2
and SQL Server)
CSE 544 - Fall 2016 54

Other Query Optimizers
•  Randomized plan generation

–  Genetic algorithm
–  PostgreSQL uses it for queries with many joins
•  Rule-based
–  Extensible collection of rules
–  Rule = Algebraic law with a direction
–  Algorithm for firing these rules
•  Generate many alternative plans, in some order
•  Prune by cost
–  Startburst (later DB2) and Volcano (later SQL Server)
CSE 544 - Fall 2016 55

CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization

Uploaded by

Copyright:

Available Formats

CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization

Uploaded by

Copyright:

Available Formats

CSE 544

• HW2 (SimpleDB) is due next Friday!

• Midterm in two weeks, Nov. 10, in class

• Project Milestone due on Nov. 16

CSE 544 - Fall 2016 2

Access path selection in a relational database management

• Database management systems.

There exist many logical query plan…

CSE 544 - Fall 2016 5

σ sscity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2

σ sscity=‘Seattle’ ∧sstate=‘WA’ σ pno=2

• For each logical plan…

• There exist many physical plans

CSE 544 - Fall 2016 8

(On the fly)

(On the fly)

(Index nested loop)

Three major components:

1. Cardinality and cost estimation

2. Search space

3. Plan enumeration algorithms

CSE 544 - Fall 2016 11

• We already know how to

Access path: a way to retrieve tuples from a table

• An index plus a matching selection condition

CSE 544 - Fall 2016 13

• Selection condition: sid > 300 ∧ scity=‘Seattle’

• Indexes: B+-tree on sid and B+-tree on scity

• Which access path should we use?

• We should pick the most selective access path

CSE 544 - Fall 2016 14

• Access path selectivity is the number of pages

• As we saw earlier, for equality predicates

CSE 544 - Fall 2016 15

Selection on range: σa>v(R)

• How to compute the selectivity?

• Clustered index on a: cost B(R)*X

CSE 544 - Fall 2016 16

• Selection condition: sid > 300 ∧ scity=‘Seattle’

• Cost I1: B(R) * (Max-v)/(Max-Min) = 100*700/999 ≈ 70

CSE 544 - Fall 2016 17

How to compute the selectivity?

• Clustered index on <a,b>: cost B(R)*X

CSE 544 - Fall 2016 18

• We already know how to

• Collected information for each relation

• Collection approach: periodic, using sampling

CSE 544 - Fall 2016 20

Projection: output size same as input size

Selection: multiply input size by reduction factor

CSE 544 - Fall 2016 21

• Take product of cardinalities of relations R and S

CSE 544 - Fall 2016 22

• Containment of values: if V(R,A) <= V(S,B), then

• Preservation of values: for any other attribute C,

CSE 544 - Fall 2016 23

Assume V(R,A) <= V(S,B)

• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)

•  HW2 (SimpleDB) is due next Friday!

•  Midterm in two weeks, Nov. 10, in class

•  Project Milestone due on Nov. 16

•  Database management systems.

•  For each logical plan…

•  There exist many physical plans

1.  Cardinality and cost estimation

2.  Search space

3.  Plan enumeration algorithms

•  We already know how to

•  An index plus a matching selection condition

•  Selection condition: sid > 300 ∧ scity=‘Seattle’

•  Indexes: B+-tree on sid and B+-tree on scity

•  Which access path should we use?

•  We should pick the most selective access path

•  Access path selectivity is the number of pages

•  As we saw earlier, for equality predicates

•  How to compute the selectivity?

•  Clustered index on a: cost B(R)*X

•  Selection condition: sid > 300 ∧ scity=‘Seattle’

•  Cost I1: B(R) * (Max-v)/(Max-Min) = 100*700/999 ≈ 70

•  Clustered index on <a,b>: cost B(R)*X

•  We already know how to

•  Collected information for each relation

•  Collection approach: periodic, using sampling

•  Take product of cardinalities of relations R and S

•  Containment of values: if V(R,A) <= V(S,B), then

•  Preservation of values: for any other attribute C,

•  Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)

•  Estimate cardinality in a bottom-up fashion

•  Estimate cost by using the estimated cardinalities

•  In the previous examples, we assumed that all index

•  When this is not the case, we need to add the cost of

•  In previous examples, we considered IO costs

•  Typically, want IO+CPU

•  For parallel/distributed queries, add network bandwidth

•  If need to compare logical plans

•  We already know how to

1.  Cardinality and cost estimation

2.  Search space

3.  Plan enumeration algorithms

•  Selects, projects, and joins

•  More info in optional paper (by Chaudhuri), Section 4.

•  Search space is huge!

•  Cannot consider ALL plans

1.  Cardinality and cost estimation

2.  Search space

3.  Plan enumeration algorithms

•  Rule-based algorithm (will not discuss)

•  Only left-deep plans

•  Idea: use dynamic programming

•  For each subquery Q ⊆{R1, …, Rn} compute the

•  Step 1: Enumerate all single-relation plans

–  Consider selections on attributes of relation

–  Compute cost for each plan

–  Keep cheapest plan per “interesting” output order

•  Step 2: Generate all two-relation plans

–  For each each single-relation plan from step 1

–  Compute cost for each plan

–  Keep cheapest plan per “interesting” output order

•  Step 3: Generate all three-relation plans

–  For each each two-relation plan from step 2

•  Steps 4 through n: repeat until plan contains all the