Mock Final Examination Model Answer: Faculty of Computer Studies TM351 Data Management and Analysis

Faculty Of Computer Studies
TM351
Data management and analysis
Mock Final Examination Model Answer

Fall – 2018/2019
Date / /
Number of Exam Pages: (9 ) Time Allowed: ( 3 ) Hours
Question Type Max Mark

Part 1: Multiple Choice Questions (MCQ) 20
Part 2: Long Questions 80
Total grade 100
Instructions:
1. Answer all questions

2. Mobiles and calculators are not allowed.
Note: Only simple calculations that do not require a calculator

are included in this exam.
Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 1 of 9

Part One: [20 marks]
Please select the most appropriate answer and write it on the answer booklet
1. In SQL, a _________ subquery always returns a single column and a single row, that is, a single
value.
a. row b. scalar c. table d. none of the above
2. All SQL transaction Isolation levels guarantee that the _____________ problem will never occur.
a. Lost update b. Dirty read c. Non-repeatable read d. Phantom rows
3. An overriding disadvantage of ______________databases is that, although data interactions
within individual aggregates are optimised, interactions across aggregates are not.
a. document-databases b. key-value c.aggregate-oriented d. all of the above
4. The χ 2 (chi-squared) test does not show anything about correlation as it is used primarily for
___________ data. Instead it looks at how many items are there in each category.
a. categorical b. ordinal c. interval d. ratio
5. MongoDB’s find operation will only work at the level of single documents. More complex processing,
including grouping, aggregation functions, and data renaming, is available through MongoDB’s ______.
a. aggregation pipeline b. SQL queries c. SELECT statements d. table structure
6. Replication is mostly concerned with data resilience whereas ________ is concerned with capacity
and performance.
a. sharding b. centralisation c. the user interface d. none of the above
7. In a data warehouse hypercube, _________ measure values can be combined along some but not
all dimensions.
a. semi-additive b. additive c. non-additive d. all of the above
8. Classification tasks like identifying email spam and classifying credit applications, which use la-
belled training data to try to classify unseen instances, are known as _______________ tasks
a. regression b. unsupervised learn- c. supervised learning d. clustering
ing
9. The ____________clustering algorithm can have difficulty in detecting intuitively plausible clusters
in datasets which do not form natural clusters around their mean values.
a. density-based b. k-means c. neither a. and b. d. both a. and b. above.
above
10. Cosine similarity is an idea from geometry, which is used to measure the__________________as
a measure of document similarity.
a. angle between two b. the distance be- c. both the angle and d. none of the above.
vectors rather than the tween points rather the distance
distance between points than the angle be-
tween two vectors

Part Two: [80 marks]
Question One: [30 marks]
Use the Hospital relational headings below to answer the following SQL questions:
The Hospital relational headings (Relations)
patient (patient_id, patient_name, gender, height, weight, staff_no, ward_no)
ward (ward_no, ward_name, number_of_beds)
treatment (staff_no, patient_id, start_date, reason)
doctor (staff_no, doctor_name, position)
a. Write a SQL statement to list the patient names and weight for all patients whose
weight over height ratio is below 0.5 [5 marks]
b. Write a SQL statement to list the names and reason of treatment for all patients
treated by staff number 156. . [5 marks]
c. Write a SQL statement to list the average height of all female patients. [5 marks]
d. Write a SQL statement to find the names of the tallest patient in each ward. [5 marks]
e. Write a SQL statement to update the table Ward by increasing the number of beds in
ward number w2 by 5 beds.[5 marks]
f. To which normal form would the patient relation conform to if we know that name de-
termines gender? Explain why you reached this conclusion and normalize the patient
relation to the next higher form. [5 marks]
Model Answers:
a. select patient_name, weight from patient where weight/height
< 0.5;
b.select patient_name, reason from patient inner join treatment
on patient.patient_id = treatment.patient_id where treat-
ment.staff_no = '156';
c.select avg(height) from patient where gender ='F';
d. select patient_name
from patient p1
where not exists (select * from patient p2 where p1.ward_no =
p2.ward_no and p1.height < p2 .height)
e.update ward
set number_of_beds = number_of_beds + 5
where ward_no = 'w2';
f. The relation patient would conform to 2NF only as its primary key consists of a single
attribute. It does not conform to 3NF as one none-prime attribute (gender) depends on
another non-prime attribute (name). To normalize the relation, we split is as follows:
patient1 (patient_id, patient_name, height, weight, staff_no, ward_no)
patient2 (patient_name, gender)

Question Two: [20 marks]
Use the sample document from a restaurants document database collection to answer the
following questions: (assume the database name is: "db" and the collection name is: "res-
taurants".
{
"address": {
"building": "1007",
"coord": [ -73.856077, 40.848447 ],
"street": "Morris Park Ave",
"zipcode": "10462"
},
"borough": "Bronx",
"cuisine": "Bakery",
"grades": [
{ "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 },
{ "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 }
],
"name": "Morris Park Bake Shop",
"restaurant_id": "30075445"
}
Example: find an arbitrary document from the restaurant collection

Answer: db.restaurants.find_one()
a. find any document from the restaurant collection where the cuisine is "Chi-
nese" [2 marks]
b. find all documents in the restaurant collection [4 marks]
c. find all documents whose borough field equals "Manhattan" [2 marks]
d. find all documents whose address has the zipcode equals "10075" [6
marks]
e. find out how many Chinese restaurants are there in Manhattan [6 marks]
Model Answers:
a. db.restaurants.find_one()
b. db.restaurants.find_one({"cuisine":= "Chinese"})
c. db.restaurants.find()
d. db.restaurants.find( { "borough": "Manhattan" } )
e. db.restaurants.find( { "address.zipcode": "10075" } )
f. db.restaurants.find({"borough": "Manhattan"}).count()

Question Three: [20 marks]
Students is a statistical experiment have been classified into two groups based on the simi-
larity of their letter grades in two subjects: Math and English as follows:
Group A Math English Group B Math English

student Id student Id
s01 A C s03 A B
s02 C A s04 C B
a. Find the Euclidean distance between a new student with id s20 and scores B in Math and
C in English, and each of the 4 students above. Letter grades translate to points as follows:
A =4; B = 3; C = 2. [4 marks]
b. To which student(s) is s20 most similar [1 mark]
c. In which group would a K nearest neighbors (KNN) classification algorithm with K = 3 and
which uses simple majority voting classify s20? And why? [5 marks]
d. In which group would a K nearest neighbors (KNN) classification algorithm with K = 3 and
which uses weighted voting classify s20? And why? [5 marks]
e. Suppose we applied the KNN algorithm to each member of a larger training set of stu-
dents in the two groups (s01-s18) and we obtained the results shown below. Based on
these results, which value of k would you choose? And why?: [5 marks]
Correctly classified?
student Id Group k=3 k=5 k=7
s01 A Y Y Y
s02 A Y Y Y
s03 A Y Y Y
s04 A Y Y Y
s05 A N N N
s06 A N N N
s07 A Y Y Y
s08 A Y Y Y
s09 A Y Y Y
s10 A Y Y Y
s11 B N Y N
s12 B Y Y Y
s13 B Y Y Y
s14 B N Y Y
s15 B Y Y Y
s16 B Y Y Y
s17 B Y Y Y
s18 B Y N N

Model Answer:
a.
Group A Math English
student Id
s01 A=4 C=2
s02 C=2 A=4
s03 A= 4 B=3
s04 C=2 B=3
s20 B=3 B=2
distance between s20 and s01 = √(4 − 3)2 + (2 − 2)2 = 1

distance between s20 and s02 = √(2 − 3)2 + (4 − 2)2 = √5
b. Student s20 is most similar to student s01
c. KNN would classify s20 in group B because the K (=3) nearest neighbors to s20 are s01,
S03 and s04 and the majority of those students (s03 and s04) are in group B.
d. KNN with weighted voting would classify s05 in group B because the K (=3) nearest neigh-
1
bors to s05 are s01, S03 and s04 and the weight of group A is 1 = 1 whereas the weight of
1 1 2
group B is + = = √2 > 1.
√2 √2 √2
e. We would choose k = 5 because this value yields the smallest error as the error comes out
as follows:
k error
3 4/18
5 3/18
7 4/18

Or
Question Three: [20 marks]
Below is a depiction of a data cube for sales figures of two types of pain-killer medicine for a
pharmacy chain. The sales are represented as a table. Examine this data cube and answer
the questions afterwards:
Quarter Product Store location
London York Norwich Bristol
Q1 Milk 3 1 2 2
Q1 Juice 2 1 2 2
Q1 Ham 3 2 3 1
Q1 Jam 2 1 2 2
Q2 Milk 2 1 3 1
Q2 Juice 3 1 1 2
Q2 Ham 4 1 3 2
Q2 Jam 3 2 2 2
Q3 Milk 2 2 1 3
Q3 Juice 1 1 1 1
Q3 Ham 3 2 2 3
Q3 Jam 2 1 3 1
Q4 Milk 3 1 1 2
Q4 Juice 2 1 1 2
Q4 Ham 2 3 2 3
Q4 Jam 2 2 1 1
a. Slice the cube by Store location and Quarter, showing only the data for Ham and show
the resulting cube as a table. [5 marks]
b. Dice by showing only the sales data for Juice and ham products sold by stores in York
and Norwich. [5 marks]
c. Roll up this data cube by product to show only the sales figures per Store location. [5
marks]
d. Drill down this data cube by adding a time of day dimension indicating the time of day
(AM or PM) during which the sales were made. Only show the structure of the resulting
table with all the headings (i.e. show an empty table without any numbers). [5 marks]

Model Answers:
a.
Quarter Store location
Q1 3 2 3 1
Q2 4 1 3 2
Q3 3 2 2 3
Q4 2 3 2 3
b.
Quarter Product Store location
York Norwich
Q1 Juice 1 2
Q1 Ham 2 3
Q2 Juice 1 1
Q2 Ham 1 3
Q3 Juice 1 1
Q3 Ham 2 2
Q4 Juice 1 1
Q4 Ham 3 2
c.
Store location
39 23 30 30
d.
Store location
London London York York Norwich Norwich Bristol Bristol
Quarter Product AM PM AM PM AM PM AM PM
Q1 Milk
Q1 Juice
Q1 Ham
Q1 Jam
Q2 Milk
Q2 Juice
Q2 Ham
Q2 Jam
Q3 Milk
Q3 Juice
Q3 Ham
Q3 Jam
Q4 Milk
Q4 Juice
Q4 Ham
Q4 Jam

Question Four: [10 marks]
a. Calculate the Centriod for the following 5 data points: [5 marks]
p1 = (1, 2, 3)
p2 = (1, 3, 5)
p3 = (2, 3, 4)
p4 = (1, 4, 5)
p5 = (3, 2, 1)
b. In a silhouette analysis, the average distance of a point P i – which is in cluster A – to each

of the 3 existing clusters were found to be:
cluster average distance from Pi to each cluster member

A 24
B 48
C 72
Calculate the silhouette coefficient of point Pi. Explain your steps. [5 marks]
Answer:
a.
Centroidx = (1 + 1 + 2 + 1 + 3)/5 = 1.6
Centroidy = (2 + 3 + 3 + 4 + 2)/5 = 2.8
Centroidz = (3 + 5 + 4 + 5 + 1)/5 = 3.6
So, the centroid = (1.6, 2.8, 3.6)
b.
Since point Pi is in cluster A, then a(Pi) = 24
Since the nearest cluster to Pi is cluster B, then b(Pi) is 48
The silhouette coefficient for Pi is then S(Pi) = 1 – (a(Pi)/b(Pi)) = 0.5
 End of Questions 

Mock Final Examination Model Answer: Faculty of Computer Studies TM351 Data Management and Analysis

Uploaded by

Copyright:

Available Formats

Mock Final Examination Model Answer: Faculty of Computer Studies TM351 Data Management and Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mock Final Examination Model Answer: Faculty of Computer Studies TM351 Data Management and Analysis

Uploaded by

Copyright:

Available Formats

Faculty Of Computer Studies

Mock Final Examination Model Answer

Question Type Max Mark

1. Answer all questions

Note: Only simple calculations that do not require a calculator

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 1 of 9

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 2 of 9

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 3 of 9

Example: find an arbitrary document from the restaurant collection

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 4 of 9

Group A Math English Group B Math English

b. To which student(s) is s20 most similar [1 mark]

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 5 of 9

distance between s20 and s01 = √(4 − 3)2 + (2 − 2)2 = 1

b. Student s20 is most similar to student s01

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 6 of 9

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 7 of 9

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 8 of 9

a. Calculate the Centriod for the following 5 data points: [5 marks]

b. In a silhouette analysis, the average distance of a point P i – which is in cluster A – to each

cluster average distance from Pi to each cluster member

Mock FINAL Questions and Model Answer TM351 Fall 2018/2019 9 of 9

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.