Mock Final Examination Model Answer: Faculty of Computer Studies TM351 Data Management and Analysis
Mock Final Examination Model Answer: Faculty of Computer Studies TM351 Data Management and Analysis
Mock Final Examination Model Answer: Faculty of Computer Studies TM351 Data Management and Analysis
TM351
Data management and analysis
Instructions:
1. In SQL, a _________ subquery always returns a single column and a single row, that is, a single
value.
a. row b. scalar c. table d. none of the above
2. All SQL transaction Isolation levels guarantee that the _____________ problem will never occur.
a. Lost update b. Dirty read c. Non-repeatable read d. Phantom rows
3. An overriding disadvantage of ______________databases is that, although data interactions
within individual aggregates are optimised, interactions across aggregates are not.
a. document-databases b. key-value c.aggregate-oriented d. all of the above
4. The χ 2 (chi-squared) test does not show anything about correlation as it is used primarily for
___________ data. Instead it looks at how many items are there in each category.
a. categorical b. ordinal c. interval d. ratio
5. MongoDB’s find operation will only work at the level of single documents. More complex processing,
including grouping, aggregation functions, and data renaming, is available through MongoDB’s ______.
a. aggregation pipeline b. SQL queries c. SELECT statements d. table structure
6. Replication is mostly concerned with data resilience whereas ________ is concerned with capacity
and performance.
a. sharding b. centralisation c. the user interface d. none of the above
7. In a data warehouse hypercube, _________ measure values can be combined along some but not
all dimensions.
a. semi-additive b. additive c. non-additive d. all of the above
8. Classification tasks like identifying email spam and classifying credit applications, which use la-
belled training data to try to classify unseen instances, are known as _______________ tasks
a. regression b. unsupervised learn- c. supervised learning d. clustering
ing
9. The ____________clustering algorithm can have difficulty in detecting intuitively plausible clusters
in datasets which do not form natural clusters around their mean values.
a. density-based b. k-means c. neither a. and b. d. both a. and b. above.
above
10. Cosine similarity is an idea from geometry, which is used to measure the__________________as
a measure of document similarity.
a. angle between two b. the distance be- c. both the angle and d. none of the above.
vectors rather than the tween points rather the distance
distance between points than the angle be-
tween two vectors
Model Answers:
a. select patient_name, weight from patient where weight/height
< 0.5;
b.select patient_name, reason from patient inner join treatment
on patient.patient_id = treatment.patient_id where treat-
ment.staff_no = '156';
c.select avg(height) from patient where gender ='F';
d. select patient_name
from patient p1
where not exists (select * from patient p2 where p1.ward_no =
p2.ward_no and p1.height < p2 .height)
e.update ward
set number_of_beds = number_of_beds + 5
where ward_no = 'w2';
f. The relation patient would conform to 2NF only as its primary key consists of a single
attribute. It does not conform to 3NF as one none-prime attribute (gender) depends on
another non-prime attribute (name). To normalize the relation, we split is as follows:
patient1 (patient_id, patient_name, height, weight, staff_no, ward_no)
patient2 (patient_name, gender)
a. Find the Euclidean distance between a new student with id s20 and scores B in Math and
C in English, and each of the 4 students above. Letter grades translate to points as follows:
A =4; B = 3; C = 2. [4 marks]
c. In which group would a K nearest neighbors (KNN) classification algorithm with K = 3 and
which uses simple majority voting classify s20? And why? [5 marks]
d. In which group would a K nearest neighbors (KNN) classification algorithm with K = 3 and
which uses weighted voting classify s20? And why? [5 marks]
e. Suppose we applied the KNN algorithm to each member of a larger training set of stu-
dents in the two groups (s01-s18) and we obtained the results shown below. Based on
these results, which value of k would you choose? And why?: [5 marks]
Correctly classified?
student Id Group k=3 k=5 k=7
s01 A Y Y Y
s02 A Y Y Y
s03 A Y Y Y
s04 A Y Y Y
s05 A N N N
s06 A N N N
s07 A Y Y Y
s08 A Y Y Y
s09 A Y Y Y
s10 A Y Y Y
s11 B N Y N
s12 B Y Y Y
s13 B Y Y Y
s14 B N Y Y
s15 B Y Y Y
s16 B Y Y Y
s17 B Y Y Y
s18 B Y N N
c. KNN would classify s20 in group B because the K (=3) nearest neighbors to s20 are s01,
S03 and s04 and the majority of those students (s03 and s04) are in group B.
d. KNN with weighted voting would classify s05 in group B because the K (=3) nearest neigh-
1
bors to s05 are s01, S03 and s04 and the weight of group A is 1 = 1 whereas the weight of
1 1 2
group B is + = = √2 > 1.
√2 √2 √2
e. We would choose k = 5 because this value yields the smallest error as the error comes out
as follows:
k error
3 4/18
5 3/18
7 4/18
Below is a depiction of a data cube for sales figures of two types of pain-killer medicine for a
pharmacy chain. The sales are represented as a table. Examine this data cube and answer
the questions afterwards:
Quarter Product Store location
London York Norwich Bristol
Q1 Milk 3 1 2 2
Q1 Juice 2 1 2 2
Q1 Ham 3 2 3 1
Q1 Jam 2 1 2 2
Q2 Milk 2 1 3 1
Q2 Juice 3 1 1 2
Q2 Ham 4 1 3 2
Q2 Jam 3 2 2 2
Q3 Milk 2 2 1 3
Q3 Juice 1 1 1 1
Q3 Ham 3 2 2 3
Q3 Jam 2 1 3 1
Q4 Milk 3 1 1 2
Q4 Juice 2 1 1 2
Q4 Ham 2 3 2 3
Q4 Jam 2 2 1 1
a. Slice the cube by Store location and Quarter, showing only the data for Ham and show
the resulting cube as a table. [5 marks]
b. Dice by showing only the sales data for Juice and ham products sold by stores in York
and Norwich. [5 marks]
c. Roll up this data cube by product to show only the sales figures per Store location. [5
marks]
d. Drill down this data cube by adding a time of day dimension indicating the time of day
(AM or PM) during which the sales were made. Only show the structure of the resulting
table with all the headings (i.e. show an empty table without any numbers). [5 marks]
b.
Quarter Product Store location
York Norwich
Q1 Juice 1 2
Q1 Ham 2 3
Q2 Juice 1 1
Q2 Ham 1 3
Q3 Juice 1 1
Q3 Ham 2 2
Q4 Juice 1 1
Q4 Ham 3 2
c.
Store location
London York Norwich Bristol
39 23 30 30
d.
Store location
London London York York Norwich Norwich Bristol Bristol
Quarter Product AM PM AM PM AM PM AM PM
Q1 Milk
Q1 Juice
Q1 Ham
Q1 Jam
Q2 Milk
Q2 Juice
Q2 Ham
Q2 Jam
Q3 Milk
Q3 Juice
Q3 Ham
Q3 Jam
Q4 Milk
Q4 Juice
Q4 Ham
Q4 Jam
p1 = (1, 2, 3)
p2 = (1, 3, 5)
p3 = (2, 3, 4)
p4 = (1, 4, 5)
p5 = (3, 2, 1)
Calculate the silhouette coefficient of point Pi. Explain your steps. [5 marks]
Answer:
a.
Centroidx = (1 + 1 + 2 + 1 + 3)/5 = 1.6
Centroidy = (2 + 3 + 3 + 4 + 2)/5 = 2.8
Centroidz = (3 + 5 + 4 + 5 + 1)/5 = 3.6
So, the centroid = (1.6, 2.8, 3.6)
b.
Since point Pi is in cluster A, then a(Pi) = 24
Since the nearest cluster to Pi is cluster B, then b(Pi) is 48
The silhouette coefficient for Pi is then S(Pi) = 1 – (a(Pi)/b(Pi)) = 0.5
End of Questions