CS/SE 4AD3 Assignment 3
I. Joins, Query Optimization (92 marks)
Question 1 (8 marks) Join Costing
Consider the join R ◃▹ S on R.a = S.b, given the following information about the relations to be joined. The cost metric is the number of page I/Os unless otherwise noted, and the cost of writing out the result should be uniformly ignored. For R ◃▹ S assume R is the outer loop and S the inner loop.
Relation R contains 10,000 tuples and has 10 tuples per page. Relation S contains 2000 tuples and also has 10 tuples per page. Attribute b of relation S is the primary key for S. Both relations are stored as simple heap files. Neither relation has any indexes built on it. There are 52 buffer pages available.
-
(a) (2 marks) What is the cost of joining R and S using a page-oriented nested loop join?
-
(b) (2 marks) What is the cost of joining R and S using a block nested loop join?
-
(c) (2 marks) What is the cost of joining R and S using sort-merge join? (Not optimized sort- merge).
-
(d) (2 marks) What is the cost of joining R and S using a Grace hash join?
Question 2 (10 marks) Join Ordering
Suppose a SQL query must join three relations, A, B, and C. A join order specifies how to join the
three relations. For example, the join order (A ◃▹ B) ◃▹ C specifies that first we join A and B (with A
being the outer join), and then we join the result with C in a pipeline manner (with C being the inner
join). Current relational DBMS consider only certain join orders, not all possible ones. For the above
three relations, write down all join orders that a System R optimizer would consider. Explain why it
considers only those join orders, and why the remaining join orders are not considered. Give specific
reasons for each of the ignored (join order) cases.
Question 3 (40 marks) Plan Search and Costing
Consider two relations:
Company(cname, city, president) P roduct(pname, maker, price)
Suppose that the size of relation Company is 20,000 pages, with 100 tuples per page, and that the size of the Product relation is 100,000 pages, with 200 tuples per page. Suppose further that each company makes no more than 20 products, and that cname is a key in Company. Lastly, suppose that price is distributed uniformly between 1 and 100 in the P roduct relation.
Consider the following plan: first we do a selection on Product with the predicate condition (price ≥ 20) AND price ≤ 60), i.e., the price is between 20 and 60. We materialize the output of this selection on disk, then join this output with relation Company using a sort-merge join, on the condition (cname = maker). Then we perform on the fly the selection (city = ’Hamilton’) followed by the projection of attributes cname and president. The given memory size is 15,000 pages.
-
(a) (15 marks) Compute the cost of the above plan. Show the query plan tree, each step of your cost derivation.
-
(b) (15 marks) If we change the join to a block nested loop join, what is the cost of the new plan? Show the query plan tree, and the derivation for the plan with minimal cost.
-
(c) (10 marks) Are there further optimizations you can apply to reduce the cost of this plan? What is the minimal cost plan that can be derived? Show the query plan tree, and your cost derivation by applying the heuristics, equivalences, and physical operator implementations discussed in class. State the optimizations you used to obtain your minimal cost plan.
Question 4 (24 marks) Selectivity Estimation
Consider a relation R(A, B, C) with 1000 tuples. We have an index on A with 50 unique values in the range [1, 50] and an index on attribute B with 100 unique values in the range [1, 100]. We do not have an index on attribute C. Use selectivity estimation to estimate the number of tuples produced by the following queries. Show the derivation of your answer for each query.
(a) SELECT * FROM R
(b) SELECT * FROM R
(c) SELECT * FROM R
(d) SELECT * FROM R
(e) SELECT * FROM R
(f) SELECT * FROM R
(g) SELECT * FROM R
(h) SELECT*FROMRWHEREA≤25ANDB≤25
Question 5 (10 marks) Plan Search
Assume you have three relations R(A, B), S(B, C), and T (C, D). Relation R has a clustered hash index on attribute A, relation T has a clustered B+-tree index on attribute D. Consider the following query:
SELECT * FROM R, S, T WHERE R.B = S.B and S.C = T.C AND R.B = 10 AND T.D = 10
Use the dynamic programming (System R style) algorithm for query optimization that we dis- cussed in class. Remember that the algorithm creates the optimal plan through a bottom-up dynamic programming algorithm. Show in detail the intermediate plans that the query optimization algorithm considers in the second pass (i.e., all two relation plans). Ensure you describe the access methods and algorithms considered for each operator.
咨询 Alpha 小助手,获取更多课业帮助