CSE 544 Principles of Database Management Systems: Lecture 8 - Query Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

CSE 544

Principles of Database
Management Systems

Fall 2016
Lecture 8 - Query optimization
Announcements

•  HW2 (SimpleDB) is due next Friday!

•  Midterm in two weeks, Nov. 10, in class

•  Project Milestone due on Nov. 16

CSE 544 - Fall 2016 2


Recommended Readings

Access path selection in a relational database management


system.
Selinger. et. al. SIGMOD 1979

Additional resources:
•  Chaudhuri, "An Overview of Query Optimization in
Relational Systems," Proceedings of ACM PODS, 1998

•  Database management systems.


Ramakrishnan and Gehrke.
Third Ed. Chapter 15.
CSE 544 - Fall 2016 3
Query Optimization Motivation
SQL query
Declarative query
Recall physical and
Parse & Rewrite Query logical data independence

Logical
Query Select Logical Plan
plan
optimization
Select Physical Plan
Physical
plan
Query Execution

Disk 4
What We Already Know…
Supplier(sno,sname,scity,sstate)
Part(pno,pname,psize,pcolor)
Supply(sno,pno,price)
For each SQL query….
SELECT S.sname
FROM Supplier S, Supply U
WHERE S.scity='Seattle' AND S.sstate='WA’
AND S.sno = U.sno
AND U.pno = 2

There exist many logical query plan…

CSE 544 - Fall 2016 5


Example Query: Logical Plan 1

π sname

σ sscity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2

sno = sno

Supplier Supply
CSE 544 - Fall 2016 6
Example Query: Logical Plan 2

π sname

sno = sno

σ sscity=‘Seattle’ ∧sstate=‘WA’ σ pno=2

Supplier Supply
CSE 544 - Fall 2016 7
What We Also Know

•  For each logical plan…

•  There exist many physical plans

CSE 544 - Fall 2016 8


Example Query: Physical Plan 1
(On the fly) π sname

(On the fly)


σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2

(Nested loop)
sno = sno

Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 9
Example Query: Physical Plan 2
(On the fly) π sname

(On the fly)


σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2

(Index nested loop)


sno = sno

Supplier Supply
(File scan) (Index scan)
CSE 544 - Fall 2016 10
Query Optimization

Three major components:

1.  Cardinality and cost estimation

2.  Search space

3.  Plan enumeration algorithms

CSE 544 - Fall 2016 11


Estimating Cost of a Query Plan
Goal: compute the cost of an entire physical query plan

•  We already know how to


–  Compute the cost of different operations in terms of number Ios,
given the T(R)’s and the B(R)’s

•  We still need to do
–  Access path selection: compute cost of retrieving tuples from disk
with different access paths
–  Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 12
Access Path

Access path: a way to retrieve tuples from a table

•  A file scan

•  An index plus a matching selection condition

CSE 544 - Fall 2016 13


Access Path Selection

•  Supplier(sid,sname,scity,sstate)

•  Selection condition: sid > 300 ∧ scity=‘Seattle’

•  Indexes: B+-tree on sid and B+-tree on scity

•  Which access path should we use?

•  We should pick the most selective access path

CSE 544 - Fall 2016 14


Access Path Selectivity

•  Access path selectivity is the number of pages


retrieved if we use this access path
–  Most selective retrieves fewest pages

•  As we saw earlier, for equality predicates


–  Selection on equality: σa=v(R)
–  V(R, a) = # of distinct values of attribute a
–  1/V(R,a) is thus the reduction factor
–  Clustered index on a: cost B(R)/V(R,a)
–  Unclustered index on a: cost T(R)/V(R,a)
–  (we are ignoring I/O cost of index pages for simplicity)

CSE 544 - Fall 2016 15


Selectivity for Range Predicates

Selection on range: σa>v(R)

•  How to compute the selectivity?


•  Assume values are uniformly distributed
•  Reduction factor X
•  X = (Max(R,a) - v) / (Max(R,a) - Min(R,a))

•  Clustered index on a: cost B(R)*X


•  Unclustered index on a: cost T(R)*X

CSE 544 - Fall 2016 16


Back to Our Example

•  Selection condition: sid > 300 ∧ scity=‘Seattle’


–  Index I1: B+-tree on sid clustered
–  Index I2: B+-tree on scity unclustered

•  Let’s assume
–  V(Supplier,scity) = 20
–  Max(Supplier, sid) = 1000, Min(Supplier,sid)=1
–  B(Supplier) = 100, T(Supplier) = 1000

•  Cost I1: B(R) * (Max-v)/(Max-Min) = 100*700/999 ≈ 70


•  Cost I2: T(R) * 1/V(Supplier,scity) = 1000/20 = 50

CSE 544 - Fall 2016 17


Selectivity with
Multiple Conditions
What if we have an index on multiple attributes?
•  Example selection σa=v1 ∧ b= v2(R) and index on <a,b>

How to compute the selectivity?


•  Assume attributes are independent
•  X = 1 / (V(R,a) * V(R,b))

•  Clustered index on <a,b>: cost B(R)*X


•  Unclustered index on <a,b>: cost T(R)*X

CSE 544 - Fall 2016 18


Estimating Cost of a Query Plan
Goal: compute the cost of an entire physical query plan

•  We already know how to


–  Compute the cost of different operations in terms of number Ios,
given the T(R)’s and the B(R)’s

•  We still need to do
–  Access path selection: compute cost of retrieving tuples from disk
with different access paths
–  Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 19
Statistics on Base Data

•  Collected information for each relation


–  Number of tuples (cardinality) T(R)
–  Number of physical pages B(R), clustering info
–  Indexes, number of keys in the index V(R,a)
–  Statistical information on attributes
•  Min value, max value, number distinct values
•  Histograms
–  Correlations between columns (hard)

•  Collection approach: periodic, using sampling

CSE 544 - Fall 2016 20


Size Estimation

Projection: output size same as input size

Selection: multiply input size by reduction factor


•  Similar to what we did for estimating access path
selectivity
•  Assume independence between conditions in the
predicate
•  Examples:
T(σA=...(R)) = T(R) / V(R,A)
T(σA=...∧ B=... (R)) = T(R) / (V(R,A) * V(R,B))

CSE 544 - Fall 2016 21


Estimating Result Sizes

Join R ⋈ S

•  Take product of cardinalities of relations R and S


•  Apply reduction factors for each term in join condition
•  Terms are of the form: column1 = column2
•  Reduction: 1/ ( MAX( V(R,column1), V(S,column2))
•  Why? Will explain next...

CSE 544 - Fall 2016 22


Assumptions

•  Containment of values: if V(R,A) <= V(S,B), then


the set of A values of R is included in the set of B
values of S
–  Note: this indeed holds when A is a foreign key in R,
and B is a key in S

•  Preservation of values: for any other attribute C,


V(R ⨝A=B S, C) = V(R, C) (or V(S, C))

CSE 544 - Fall 2016 23


Selectivity of R ⨝A=B S

Assume V(R,A) <= V(S,B)


•  Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S

•  Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)

In general: T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B))

CSE 544 - Fall 2016 24


Complete Example

Supplier(sid, sname, scity, sstate) SELECT sname


Supply(sid, pno, quantity) FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
•  Some statistics
and x.scity = ‘Seattle’
–  T(Supplier) = 1000 records and x.sstate = ‘WA’
–  T(Supply) = 10,000 records
–  B(Supplier) = 100 pages
–  B(Supply) = 100 pages
–  V(Supplier,scity) = 20, V(Suppliers,state) = 10
–  V(Supply,pno) = 2,500
–  Both relations are clustered
•  M = 11
CSE 544 - Fall 2016 25
Computing the Cost of a Plan

•  Estimate cardinality in a bottom-up fashion


–  Cardinality is the size of a relation (nb of tuples)
–  Compute size of all intermediate relations in plan

•  Estimate cost by using the estimated cardinalities

CSE 544 - Fall 2016 26


T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 1


(On the fly) π sname Selection and project on-the-fly
-> No additional cost.
(On the fly)
σ scity=‘Seattle’ ∧sstate=‘WA’ ∧ pno=2
Total cost of plan is thus cost of join:
= B(Supplier)+B(Supplier)*B(Supplies)
(Nested loop) = 100 + 100 * 100
sno = sno
= 10,100 I/Os

Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 27
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 2


Total cost
(On the fly) π sname (4) = 100 + 100 * 1/20 * 1/10 (1)
+ 100 + 100 * 1/2500 (2)
+ 2 (3)
(Sort-merge join) (3) + 0 (4)
sno = sno Total cost ≈ 204 I/Os
(Scan
write to T1) (Scan
write to T2)
(1) σ scity=‘Seattle’ ∧sstate=‘WA’ (2) σ pno=2

Supplier Supply
(File scan) (File scan)
CSE 544 - Fall 2016 28
Plan 2 with Different Numbers
Total cost
What if we had: π sname (4) = 10000 + 50 (1)
10K pages of Suppliers
+ 10000 + 4 (2)
10K pages of Supplies
+ 4*50 + 2*4 + 4 + 50 (3)
(Sort-merge join) (3) + 0 (4)
sno = sno Total cost ≈ 20,316 I/Os
(Scan
write to T1) (Scan
write to T2)
(1) σ scity=‘Seattle’ ∧sstate=‘WA’ (2) σ pno=2

Assuming naive
Supplier Supply two-pass sort
(File scan) (File scan) algorithm
CSE 544 - Fall 2016 29
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 11
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 3


(On the fly) (4) π sname
Total cost
= 1 (1)
(On the fly)
+ 4 (2)
(3) σ scity=‘Seattle’ ∧sstate=‘WA’ + 0 (3)
+ 0 (3)
Total cost ≈ 5 I/Os
(2) (Index nested loop)
sno = sno

(Use hash index) 4 tuples


(1) σ pno=2

Supply Supplier
(Hash index on pno ) (Hash index on sno)
30
Assume: clustered Clustering does not matter
Simplifications

•  In the previous examples, we assumed that all index


pages were in memory

•  When this is not the case, we need to add the cost of


fetching index pages from disk

CSE 544 - Fall 2016 31


Different Cost Models

•  In previous examples, we considered IO costs

•  Typically, want IO+CPU

•  For parallel/distributed queries, add network bandwidth

•  If need to compare logical plans


–  Compute the cardinality of each intermediate relation
–  Sum up all the cardinalities

CSE 544 - Fall 2016 32


Summary
Goal: compute the cost of an entire physical query plan

•  We already know how to


–  Compute the cost of different operations in terms of number Ios,
given the T(R)’s and the B(R)’s

•  We still need to do
–  Access path selection: compute cost of retrieving tuples from disk
with different access paths
–  Size estimation: compute the T(R)’s and the B(R)’s for
intermediate relations R
CSE 544 - Fall 2016 33
Query Optimization

Three major components:

1.  Cardinality and cost estimation

2.  Search space

3.  Plan enumeration algorithms

CSE 544 - Fall 2016 34


Relational Algebra Laws

•  Selections
–  Commutative: σc1(σc2(R)) same as σc2(σc1(R))
–  Cascading: σc1∧c2(R) same as σc2(σc1(R))

•  Projections
–  Cascading

•  Joins
–  Commutative : R ⋈ S same as S ⋈ R
–  Associative: R ⋈ (S ⋈ T) same as (R ⋈ S) ⋈ T

CSE 544 - Fall 2016 35


Left-Deep Plans and
Bushy Plans

R2

R4

R3 R1 R3 R1 R2 R4
Left-deep plan Bushy plan

CSE 544 - Fall 2016 36


Relational Algebra Laws

•  Selects, projects, and joins


–  We can commute and combine all three types of operators
–  We just have to be careful that the fields we need are available
when we apply the operator
–  Relatively straightforward. See book 15.3.

•  More info in optional paper (by Chaudhuri), Section 4.

CSE 544 - Fall 2016 37


Group-by and Join

R(A, B), S(C,D)

γA, sum(D)(R(A,B) ⨝ B=C S(C,D)) = ?

CSE 544 - Fall 2016 38


Group-by and Join

R(A, B), S(C,D)

γA, sum(D)(R(A,B) ⨝ B=C S(C,D)) =


γA, sum(D)(R(A,B) ⨝ B=C (γC, sum(D)S(C,D)))

These are very powerful laws.


They were introduced only in the 90’s.

CSE 544 - Fall 2016 39


Search Space Challenges

•  Search space is huge!


–  Many possible equivalent trees (logical)
–  Many implementations for each operator (physical)
–  Many access paths for each relation (physical)

•  Cannot consider ALL plans


•  Want a search space that includes low-cost plans

•  Typical compromises:
–  Only left-deep plans
–  Only plans without cartesian products
–  Always push selections down to the leaves
40
Query Optimization

Three major components:

1.  Cardinality and cost estimation

2.  Search space

3.  Plan enumeration algorithms

CSE 544 - Fall 2016 41


Two Types of Optimizers

•  Heuristic-based optimizers:
–  Apply greedily rules that always improve plan
•  Typically: push selections down
–  Very limited: no longer used today

•  Cost-based optimizers:
–  Use a cost model to estimate the cost of each plan
–  Select the “cheapest” plan
–  We focus on cost-based optimizers

CSE 544 - Fall 2016 42


Three Approaches to Search
Space Enumeration
•  Complete plans

•  Bottom-up plans

•  Top-down plans

CSE 544 - Fall 2016 43


Complete Plans

R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40



⨝ Why is this
T search space
σA<40 ⨝ inefficient ?
σA<40 S
R S T
R
CSE 544 - Fall 2016 44
Bottom-up Partial Plans

R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40

Why is this ⨝
better ?
⨝ ⨝ T

σA<40 ⨝ σA<40 S ⨝ σA<40 S


…..
R S T R R S R
CSE 544 - Fall 2016 45
Top-down Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40

⨝ ⨝ σA<40


T T SELECT R.A, T.D
SELECT * FROM R, S, T
FROM R, S WHERE R.B=S.B
WHERE R.B=S.B
and R.A < 40 SELECT * S
and S.C=T.C …..
FROM R
WHERE R.A < 40

CSE 544 - Fall 2016 46


Two Types of Plan
Enumeration Algorithms
•  Dynamic programming (in class)
–  Based on System R (aka Selinger) style optimizer[1979]
–  Limited to joins: join reordering algorithm
–  Bottom-up

•  Rule-based algorithm (will not discuss)


–  Database of rules (=algebraic laws)
–  Usually: dynamic programming
–  Usually: top-down

CSE 544 - Fall 2016 47


System R Search Space

•  Only left-deep plans


–  Enable dynamic programming for enumeration
–  Facilitate tuple pipelining from outer relation
•  Consider plans with all “interesting orders”
•  Perform cross-products after all other joins (heuristic)
•  Only consider nested loop & sort-merge joins
•  Consider both file scan and indexes
•  Try to evaluate predicates early

CSE 544 - Fall 2016 48


Plan Enumeration Algorithm

•  Idea: use dynamic programming


•  For each subset of {R1, …, Rn}, compute the best plan
for that subset
•  In increasing order of set cardinality:
–  Step 1: for {R1}, {R2}, …, {Rn}
–  Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}
–  …
–  Step n: for {R1, …, Rn}
•  It is a bottom-up strategy
•  A subset of {R1, …, Rn} is also called a subquery

CSE 544 - Fall 2016 49


Dynamic Programming Algo.

•  For each subquery Q ⊆{R1, …, Rn} compute the


following:
–  Size(Q)
–  A best plan for Q: Plan(Q)
–  The cost of that plan: Cost(Q)

CSE 544 - Fall 2016 50


Dynamic Programming Algo.

•  Step 1: Enumerate all single-relation plans

–  Consider selections on attributes of relation


–  Consider all possible access paths
–  Consider attributes that are not needed

–  Compute cost for each plan

–  Keep cheapest plan per “interesting” output order

CSE 544 - Fall 2016 51


Dynamic Programming Algo.

•  Step 2: Generate all two-relation plans

–  For each each single-relation plan from step 1


–  Consider that plan as outer relation
–  Consider every other relation as inner relation

–  Compute cost for each plan

–  Keep cheapest plan per “interesting” output order

CSE 544 - Fall 2016 52


Dynamic Programming Algo.

•  Step 3: Generate all three-relation plans

–  For each each two-relation plan from step 2


–  Consider that plan as outer relation
–  Consider every other relation as inner relation
–  Compute cost for each plan
–  Keep cheapest plan per “interesting” output order

•  Steps 4 through n: repeat until plan contains all the


relations in the query

CSE 544 - Fall 2016 53


Commercial Query Optimizers

DB2, Informix, Microsoft SQL Server, Oracle 8

•  Inspired by System R
–  Left-deep plans and dynamic programming
–  Cost-based optimization (CPU and IO)

•  Go beyond System R style of optimization


–  Also consider right-deep and bushy plans (e.g., Oracle and DB2)
–  Variety of additional strategies for generating plans (e.g., DB2
and SQL Server)

CSE 544 - Fall 2016 54


Other Query Optimizers

•  Randomized plan generation


–  Genetic algorithm
–  PostgreSQL uses it for queries with many joins

•  Rule-based
–  Extensible collection of rules
–  Rule = Algebraic law with a direction
–  Algorithm for firing these rules
•  Generate many alternative plans, in some order
•  Prune by cost
–  Startburst (later DB2) and Volcano (later SQL Server)

CSE 544 - Fall 2016 55

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy