Cs411fa09 Hw4 Sol
Cs411fa09 Hw4 Sol
Cs411fa09 Hw4 Sol
Fall 2009
HW#4
Due: 3:15pm CST, November 17, 2009
Note: Print your name and NetID in the upper right corner of every page of your submission.
Hand in your stapled homework to Donna Coleman in 2106 SC. In case Donna is not in office,
slide your homework under the door.
To grade homeworks faster, the homework is partitioned into two parts. Please, submit
each part separately. For each part, make sure to write down your name and NetID.
Handwritten submissions will be graded but they will take longer to grade. For clarity,
machine formatted text is preferable: Expect to lose points if your handwritten answer is
unclear or misread by the grader.
This homework is partitioned into two parts as follows:
Using the equation given in Section 15.3.4 of the textbook, solve for M:
I/O = B(S) + (B(S)B(R))
(M −1)
(a) 100,000
100,000 = 10,000 + (10,000×10,000)
(M −1)
M = 1,112.1 or ceil(M) = 1,113
(b) 25,000
25,000 = 10,000 + (10,000×10,000)
(M −1)
M = 6,667.7 or ceil(M) = 6,668
2. If two relations R and S are both unclustered, it seems that the nested-loop join algorithm
requires about T(R)T(S)/M disk I/Os. How can you do signicantly better than this cost?
Describe your modied version of the nested-loop algorithm and give the number of disk I/Os
required for your algorithm. We assume that M is large enough such that M ? 1 ! M , and
that B(R) ! T (R) and B(S) ! T (S); that is, the number of tuples of a relation is much
greater than that of blocks of the relation. (8 points)
Note that the cost of algorithm given in the question is T(R)T(S)/M, which means it is
using tuple-based nested-loop join. In order to improve the disk I/O cost of nested-loop join
algorithm, we need to use block-based nested-loop join. In order to carry out block-based
nested loop join efficiently, we need the inner relation clustered, and search structure built
on the common attributes of R and S.
Let R be the inner relation (assuming S is smaller):
• Cost of reading all tuples of R, cluster them, and write them back: T(R) + B(R)
• Cost of Reading tuples of S, plus the cost of joining them with R in the main memory:
T(S) + B(S)B(R)
M
Therefore the total cost is T(R) + B(R) + T(S) + B(S)B(R)M
.
Problem 2 Two-pass Algorithms Based on Sorting (20 points)
1. Suppose we have a relation with 1,000,000 records and each records requires 10 bytes.
Let the disk-block size be 4,096 bytes. (8 points, 4 points each)
(a) What is the minimum number of blocks in main memory required for using TPMMS
(Two-Phase Multiway Merge-Sort) to sort these records?
The size of the relation in bytes is 1,000,000 × 10 = 10,000,000 bytes, and each
disk-block is 4,096 bytes. The minimum number of blocks to hold the relation is
ceil( 10,000,000
4,096
) = 2,442. The minimum M requirement for TPMMS is B ≤ M2 , M must
√
at least be ceil( 2, 442) = ceil(49.4) = 50.
(b) Following (a), how many disk I/Os are needed to sort all the records?
Number of disk I/O for TPMMS is 3B, which is 3 × 2,443 = 7,329
2. We have two relations R and S where B(R) = B(S) = 10, 000. Give an approximate size
of main memory M required and the number of disk I/Os in order to perform the two-pass
algorithms for the following operations: (12 points, 4 points each)
(c) the more efficient sort-join described in Section 15.4.8 of the textbook
I/O of set union operation = 3 ×(B(S) + B(R))
= 3 ×(10,000 + 10,000)
= 60,000
Requirement for M in the efficient sort-join algorithm is B(R) + B(S) ≤ M2 = 20,000
≤ 1,000,000 which is satisfied here.
Problem 3 Hash-based and Index-based joins (14 points)
1. If B(R) = 10, 000 and B(S) = 30, 000 and M = 101, what is the number of disk I/Os
required for the hash-join algorithm? (5 points)
disk I/O estimation for hash-join algorithm is 3 × (B(R) + B(S)) = 3 × (10,000 + 30,000)
= 120,000.
2. Suppose B(R) = 10,000 and T(R) = 500,000. Let there be an index on R.a, and let
V(R, a) = k for some number k. Give the cost of σa=0 (R), as a function of k, under the
following circumstances. You may neglect disk I/O’s needed to access the index itself.
(b) πA,D (R !
" S)
Draw the table for dynamic programming, to show how you compute the optimal plan for
all possible join orders allowing all trees.
T (R) = 100, V (R, A) = 100, V (R, B) = 10, V (R, C) = 1, V (R, D) = 50; T (S) =
500, V (S, D) = 30, V (S, E) = 100.
B =15 and B > 25 are mutually exclusive predicates, therefore T(σ(B>25)AN D(B=15) (R))
=0
(e) Estimate the number of tuples in R !
"S
T (R)T (S) 50,000
T(R !
" S) = max(V (R,D),V (S,D))
= 50
= 1,000
Problem 7 Pipelining Versus Materialization (10 points)
Consider physical query plans for the expression
(R(w, x) !
" S (x, y)) !
" U(y, z)
in Example 16.36 on page 831 of the textbook (We covered the same example in the class). If
B(R) = 2,000, how would you update the table in Figure 16.38 on page 834 of the textbook?
Show the revised table. We use the same physical plans for the three cases in the table
respectively.
With all the other conditions kept consistent with the example in the textbook, the fol-
lowing are the only changes we need to consider:
• Consider R!"S first. Neither relation fits in main memory so we use two-pass hash-join
method. The number of disk I/O for hash-based join is 3(B(R) + B(S)) given that
min(B(R), B(S)) ≤ M 2 , which is satisfied here with B(R) = 2,000; min(2,000, 10,000)
≤ 1012 . So the disk I/O for R !
" S is 3 × (2,000 + 10,000) = 36,000.
• The number of blocks required to store R in the main memory is ceil( 2,000
101
) = 20. This
means the rest of 80 blocks can be used in the join operation, and remaining 1 bucket
is used as an output buffer.