DB2_20240617 solution
DB2_20240617 solution
DB2_20240617 solution
2h
S. Comai, P. Fraternali, D. Martinenghi
A. Concurrency control (10 points, up to 2 per class)
Classify the following schedule with respect to VSR, CSR, 2PL, Strict 2PL, and TS multi (with the
conventions adopted for TS Multi under Snapshot Isolation, used for the exercises). Motivate all your
answers.
If it is VSR, provide all the possible serializations. If it is not 2PL/2PL strict, explain which lock/unlock
requests cause this.
r1(Z) r2(X) w2(Y) w3(Y) w4(Z) w2(Z) w5(X) r5(Y) w4(X) r6(Y) w6(X)
Solution.
A group of people from Milan and Como are looking for a meeting point allowing them to easily travel
back and forth. They want to minimize the mean travel time m to both cities and to have balanced times.
Thus, they devise the following scoring function for any place with a travel time x to Milan and y to Como:
√
x+y 2 2
f (x, y) = m + s, with m = and s = (x − m) + (y − m) ,
2
where m is the mean time and s is akin to the standard deviation of the population {x, y} (geometrically, s
is the distance of ⟨x, y⟩ to the bisector of the first quadrant of the Cartesian plane, where x = y, i.e., times
are balanced). The group is only given three options, A, B, and C, vertically distributed over two rankings
as follows (for your calculations, consider that f (20, 100) ≈ 116.57):
x y
B: 20 A: 20
A: 100 B: 100
C: 110 C: 110
1. What is the skyline of this very small dataset? Do not show any step or calculation, just the result.
2. Compute the top-1 meeting point with TA according to f (x, y). Show depth and accesses.
3. The result found by TA is not a skyline point. Explain how this is possible.
4. Compute the top-1 meeting point with FA according to f (x, y). Show depth and accesses.
5. The result found by FA is wrong. Explain how this is possible.
6. (Bonus for the more mathematically inclined) Can you draw the shape of the iso-score curves of f ?
Solution.
2. A and B are discovered during round 1, with the same score ≈ 116.57 (they are symmetric and so is f ).
The threshold is T = 20 at round 1 and T = 100 at round 2 (the threshold point lies on the bisector, so
the distance to it is 0). At round 3, C is found, with f (C) = 110 (C is on the bisector), and T = 110.
We stop and output C. Depth=3, 6 s.a., and 3 r.a. (or 2, depending on implementation).
3. The skyline is the set of all optimal points according to some monotone function, but f is not monotone.
4. FA stops at depth 2 (with 4 sorted accesses), discovering only A and B and makes no random access
in this case. A and B are tied at 116.57; either is returned.
5. Again, f is not a monotone scoring function and FA’s correctness requires a monotone function.
x
6. A few iso-score curves are shown here: .
Indeed, the locus of points having a score c is characterized as follows:
√
x+y x−y 2 y−x 2 √
f (x, y) = c ⇔ (by substitution) + ( ) +( ) = c ⇔ x + y + 2∣x − y∣ = 2c
2 2 2
√ √
When x > y this reduces to the straight line y = x √2+1
2−1
− √2c ,
2−1
and for x < y to y = x √2−1
2+1
+ √2c
2+1
(and
when x = y then x = y = c).
C. Physical databases (10 points)
Table Customer(CUSTID, LastName, FirstName, Country) contains 800K tuples (primary key is in capital
letters) and Purchase (PRODUCTID, CUSTID, DATE, Qty) contains 8M tuples. Estimate the cost of running
the following query in the scenarios described below. Describe an efficient query plan for the following
scenarios. Estimate their execution costs (write the complete formula for the query) and explain all
the steps of the plan and their costs. The evaluation of the exercise also considers the plan’s degree of
efficiency.
SELECT *
FROM Customer C join Purchase P ON C.CUSTID=P.CUSTID
WHERE Country= ‘‘Italy’’ or Country=‘‘Spain’’
1. Table Purchase is entry-sequenced and occupies 40K blocks. Table Customer is stored in a hash table
with a hash function defined on the key. It occupies 8K blocks and has a negligible overflow chain.
Val(Country)=160.
2. Like in point 1 plus: Table Purchase has a secondary hash index built on attribute CustID with the
same hash function of Customer but with a cost of access due to overflow of 1.3. With the overflow
blocks, it occupies 10.5K blocks.
3. Like in point 2 plus: Table Customer has a B+ index (3 levels, 1K leaf nodes) with search key Country.
Solution.
1. Plan 1 (first scan Customer): Scan (sequentially) table Customer. For each tuple satisfying Coun-
try=“Italy” or Country=“Spain”, scan table Purchase and compute the join condition.
Plan1 = 8K + 2*(800K/160) * 40K = 8K + 10K * 40K = 400M I/O accesses
Plan 2 (first scan Purchase): Scan table Purchase; for each customer access the hash table and check
if Country=“Italy” or Country=“Spain”.
Plan2 = 40K + 8M * 1 = 8,04 M I/O accesses (best plan)
Plan 5 (Hash join + retrieve purchases): compute the hash join; for each Italian/Spanish customer,
retrieve their purchases.
Plan5 = 8K + 10.5K + 2*(800K/160) * (8M/800K) = 8K + 10.5K + 10K * 10 = 118,5K I/O accesses
(best plan)
3. Plan 6 (extend Plan 4): Find Italy and Spain in the B+ index, follow the pointers to retrieve Italian
and Spanish customers, then lookup customers in the hash index and retrieve their purchases.
Plan6 = 2 *(2 + (5K/800) + 5K) + 10K * (1.3 + 10) = 123K I/O accesses (it does not improve Plan
4!)