Problem 3:: Ref: Homework Assignment. The George Washington University (Csci 243 - Data Mining, Spring 2007)
Problem 3:: Ref: Homework Assignment. The George Washington University (Csci 243 - Data Mining, Spring 2007)
Problem 3:: Ref: Homework Assignment. The George Washington University (Csci 243 - Data Mining, Spring 2007)
The George Washington University (CSci 243 – Data Mining, Spring 2007)
Problem 3:
Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two
measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in
(a).
(c) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be
performed in order to list the total fee collected by each doctor in 2004?
(d) To obtain the same list, write an SQL query assuming the data are stored in a relational database
with the schema fee (day, month, year, doctor, hospital, patient, count, charge).
Solution:
(a) star schema: a fact table in the middle connected to a set of dimension tables
snowflake schema: a refinement of star schema where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a shape similar to snowflake.
Fact constellations: multiple fact tables share dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation.
(b) As figures below
(c) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be
performed in order to list the total fee collected by each doctor in 2004?
1. roll up from day to month to year
2. slice for year = “2004”
3. roll up on patient from individual patient to all
4. slice for patient = “all”
4. get the list of total fee collected by each doctor in 2004
(d)
Select doctor, Sum(charge)
From fee
Where year = 2004
Group by doctor
Problem 4:
Suppose that a data warehouse for Big University consists of the following four dimensions: student,
course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual
level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure
stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average
grade for the given combination.
(a) Draw a snowflake schema diagram for the data warehouse.
(b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP
operations (e.g., roll-up from semester to year) should one perform in order to list the average
grade of CS courses for each Big University student.
(c) If each dimension has five levels (including all), such as “student < major < status < university
< all”, how many cuboids will this cube contain (including the base and apex cuboids)?
Solution:
(a)
(b)
Starting with the base cuboid [student, course, semester, instructor]
1. roll-up on course from (course_key) to major
2. roll-up on student from (student_key) to university
3. Dice on course, student with department =”CS” and university=”Big University”
4. Drill-down on student from university to student name
(c) The cube will contain 54=625 cuboids.