DM-I Q Paper 2024
DM-I Q Paper 2024
Semester : IV
P.T.0.
tnecanaratio) or isolation.
4192 2
3
4192
Section A dataset, which contains six
(d) Consider the given
attributes: Age and Salary.
objects, cach with two
(a) Differentiate between the unsupervised and cluster the given
K-means clustering is used to
1.
supervised evaluation measures used for cluster with applying K
objects. Do you see any issue
validity. If yes, then state the
(3) means to the given dataset?
preprocessing
(b) What is the
issue. Also apply the appropriate
anti-monotone property of the support explicitly
technique to overcome it. If no, state
measure in association rule mining? Does the required. (4)
that no preprocessing technique is
confidence measure follow anti-monotone
property?
(2+) (3) Age
(in years)
Salary
(in rupees)
62000 :43
(c) Consider a dataset with two class labels, News Object 1 40 2
and Entertainment, and six labeled documents D1 24 48000
Object 2
30 2 54000
D6. A new document, D7, is to be classified. The Object 3
Object 4 35 67000
similarity values of D7 with D1, D2, D3, D4, D5
Object 5 46 80000
and D6 are 0.75, 0.85, 0.66, 0.87, 0.70 and 0.84
Object 6 34 66000
respectively. Using the k-Nearest Neighbor
classifier, predict the class label that should be
assigned to D7 when k=3. Will the predicted class (e) Define the curse of dimensionality. The Iris flower
label change with k-5? (i++|+)(4) dataset comprises of 150 data points and four
features, namely sepal length, sepal width, petal
Document Class Label
D1 News
width, and petal length. Is it a high-dimensional
D2 Entertainment
data or low-dimensional data? Justify your answer.
D3 Entertainment (2t1) (4)
D4 News
DS
() Consider a decision tree to classify the health of
News
D6 Entertainment an individual as Fit or Unfit given below :
P.T.0.
4
4192
4192 5
Age < 30 ?
(ii) Grouping the customers of a company
Smo kes/
Drinks ?
Workout ? (i1i)) Finding a group of genes such that genes
in each group have related functionality. D
NO Yes No
Yes
Fit UnFit
(h) Given two objects X = (22, 1, 42, 10) and
Y =(20, 0, 36, 8), compute the distance between
these two objects using the following distance
(i) Extract all classification rules from the measures :
decision tree.
(i) Euclidean Distance670
(ii) Classify the following object:
(ii) Manhattan Distance (4)
Age = 50, Workout = No, Smokes/ Drinks =
No, Diet Control = No, Health =? (4)
Section B
P.T.0.
4192 4192 7
P.T.0.
ause th/
8
4192 4192
What is an outlier? Spot an outlier in
the (c) Enumerate all association rulcs gencrated from the
provided dataset.
(3) largest frequcnt itemset found in cach datasct scan.
Compute the confidence of cach generatcd rule.
(b) What is the need for sampling in data mining ? Assuming that the minimum con fidence threshold
What problems ariseAb
ifmthe
ud sample
g l size is too
small is 70%, find all the strong association rules.
(HI+) u
or too large? (6)
(3) H,co’ C
D P S S
H,Ca co
Co,’ H, ca
5 (a) A medical team develops classification models for
4. Consider the following transactional data of a grocery
predicting the occurrence of a "genetic disorder"
store :
using C1assifier A and Classifier B. Patients having
genetic disorders are considered positive
Transaction ID Items
instances. In contrast, negative instances are ones
Tl Boots, Hoodie, Gloves
T2 Boots, Hoodie with the absence of genetic disorders. The
T3 Hoodie, Coat, Cardigan classifiers were tested on data from 500 patients
T4 Cardigan, Coat and then obtained the result as:
T5 Cardigan, Gloves
T6 Hoodie, Coat, Cardigan Actual Label
Presence of Absence of
Genetic Genetic
Disorder Disorder
(a) What is the maximum number of rules that can be Classifier A, predicted
extracted from this data (including rules that have "presence of genetic 131 TP 155 FP|
disorder"
P.T.0.
NS
4192 10 4192 11
226 -652 (i) List the confusion matrix for "Classifier
AcA Age Fever
ID BD
A" and Classifier B'". Find the accuracy, Pl Young Yes High Outcorne
Ace P2 Young No
High
In IC
Hospital ized
precision, sensitivity, recall and specificity P3 Elderly Yes
High In ICU
P4 Middle Yes
aged Moderate In ICU
for each classifier. (8) P5 Middle No
R|SA 87- 3 aged
High Home Care
P6 Middle
SA SS. 7) (ii) What problem may occur if the provided aged
Yes
Moderate In ICU
P7 Elderly No Moderate In ICU
training dataset of 500 patients had only P8 Elderly No High
P9 Elderly Yes Deceased
15 positive instances and the remaining Pl0
High In ICU
R|S= sy6 Young No High
BD: Breathing Difficulty Hospitalized
negative instances? Which performance
Sha= 79.42 measure would you choose to evaluate the
Age (a) Compute the Gini Index of Age, Fever, and
BD
classifiers in such a scenario? Which is
attributes. Given that you construct a decision tree
the better classifier between Classifier A -qlG using the Gini Index as the splitting criteria, which
and Classifier B in such a scenario? of the three attributes would you choose at
the
root? Justify your choice.
(4) 0i36
BD 3x2stI:s) (9)
au
644 (b) Compute the Gini Index of ID. Why should it not
(b) Consider a categorical attribute Grade with three T496
be used as a splitting attribute for
values {A, B, and C}. Convert this attribute to constructing a
decision tree? (3)
asymmetric binary attributes. (3)
(c) Given ten objects in the dataset (Pl -P10),
mention all train and test distributions for
6. Consider the given COVID-19 dataset of ten
performing k-fold cross-validation. Assume the
patients.
value of k= 5. (3)
P.T.0.
4192 12
Sst B5.58
SStG q.q4
Q(d) 2marks for stating the issue and the preprocessing technique. 1 mark for normalized value of Age
attribute, 1mark for normalized value of Salary attribute
Yes, there would be an issue
applying K-means clustering on the give data. The range of the given attributes Age and
Salary are 24-46 and 48000- 80000 respectively. The salary attribute with larger values will dominate the computation
of Euclidean distance. Using K-means on the given dataset as it is would be incorrect because of its biasedness towards
"Salary" attribute. We can apply min- maxnonalization on the given data before using it for k-means clustering.
Min (Age) = 24; Max (Age) = 46; Min (Salary) = 48000 Max (Salary) = 80000
a'=
I min(c)
Min-Max Normalization max(z) - min(z)
O1 (e) 2 marks for defining curse of dimensionality and its associated issues; 1 mark for stating
dimensionality:
1mark for justification of dimensionality
Lowdimensional data as it has only 4 attributes with 150data points. The number of observations are
larger as compared to the number of features.
significantly
Q1()3 marks for classification Rules, 1 mark for correct prediction
Classification Rules
(Age < 30=Yes) ^ (Smokes/ Drinks = Yes) ’ Health = UnFit
(Age < 30 = Yes)A(Smokes/ Drinks = No) ’ Health = Fit
(Age < 30 = No) A(Workout =Yes) ’ Health - Fit
(Age < 30 - No) A(Workout = No) A(Diet Control = Yes) ’ Health = Fit
(Age < 30 = No) A(Workout = No) A(Diet Control = No) > Health UnFit
The object Age S0, Workout = No, Smokes/ DrinkS =No, Diet Control = No, will be
classified as Health = UnFit
O1 (g) (½ marks for correct answver and 2 marks for justification of each part)
() Predictive
(i) (iii) Descriptive
Descriptive (iv) Predictive
P(Education Level = PG| Low) * P(Career -Management | Low) * P(Years of Exp - 3to 10| Low) *P (Low) -2/3
* 2/6 * 2/6 * 3/5 = 0.04 / k (I mark)
As 0.056/k >0.04/k, the instance will be predicted with Salary "High" (l mark)
Q2. (b) 1mark for each part. There could be multiple applications for a particular data type.
() Market basket analysis
(ii) Weather Data
(ii) Molecule Data, Data for web page linking
Q3. (a) (i) ½ mark for correct type, ½ mark for justification
ID: Nominal; Dept. Name: Nominal; Location: Nominal: Established On: Interval; Size: Ordinal; Annual Budget:
Ratio
(i) 1%mark for correct answer for "Location" attribute, I½mark for eorrect answer for Annual Buduet
attribute.
Since, {Hoodie, Coat, Cardigan} is the largest frequent itemset, we will generate all the rules from the same.
Rule Confidence
(Hoodie} O(Coat, Cardigan} 2/4 = 0.5
2/2 = 1
(Hoodie, Coat} 0 (Cardigan}
2/2 = 1
{Coat) 0{Hoodie, Cardigan} 2/3 =0.66
{Coat, Cardigan} D{Hoodie}
{Cardigan} 0 {Hoodie, Coat} 2/4 = 0.5
|(Hoodie, Cardigan} D(Coat}) 2/2 = 1
As confidence threshold is 70%, the strong rules are: {Hoodie, Coat’ {Cardigan}; {Coat}->{Hoodie, Cardigan}:
(Hoodie, Cardigan}’{Coat}
Os. (a) (i)½mark each for confusion matrix; % mark each for accuracy; Imark if stated both recall and
sensitivity; Imark each for precision and specificity; Allof theabove for both classifier A and Classifier B
Classifier A
TP =131 FP 155
FN 19 TN = 193
TP+TN TP TP
Accuracy 65.2 Precision= = 45.8 Recall/ Sensitivity =87.3
TP+FN+FP+T TP+F TP+F
131t19s 32%
131ss+19411s
TN
Specificity TN+F 55.71
Classifier B
TP= 82
FN= 68 FP=72
TN= 278
TP4TN
Accuracy 72 Precision T
TP+FN4FP4T S3.24 Rccall/ Sernsitivity TP
54.6
Specificity TN
TN4FP 79.42
Q6. 2 ½marks for computing each attribute correctly 1%marks for correct choice at root
(a) Computation of Gini Index for Age Attribute
It has three possible values of Young (3 examples), Middle Aged (3 examples) and Elderly (4 examples).
For Age = Young, there are 2 examples with Hospitalized" and I with Admitted to ICU".
Gini(S) = 1 -[(2/3) +(V3)] =0.444
For Age = Middle Aged, there are 2 examples with Admitted to ICU" and Iexample with Home Care"
Gini(S) =l- [(2/3)² +(1/3)] =0.444
For Age = Elderly, there are 3 examples with Admitted to ICU" and I example with "Deceased'"
Gini(S) = 1 -[(3/4) + (1/4)] = 0.375
Weighted Average: 0.444 *(3/10)+ 0.444 *(3/10) +0.375 *(4/10) = 0.416
Computation of Gini Index for Fever Attribute
It has twO possible values of Yes (5 examples) and No (5 examples).
For Fever = Yes, there are 5 examples, all with Admitted to ICU",
Gini(S) = l -[(5/5)]=0
For Fever = No, there are 2 examples with "Hospitalized", 1example with Deceased", *Home Care" and *Admitted
to ICU" each
Gini(S) = 1-[(2/5) +(1/5) +(1/S)? +(1/5)]=0.72
Weighted Average: [Yes] 0 * (5/10) + [No] 0.72 * (5/10) =0.36
Computation of Gini Index forBreathing Difficulty' Attribute
It has two possible values of High (7 examples) and Moderate (3 examples).
For Breathing Difficulty = "Moderate", there are 3 examples with Admitted to ICU".
Gini(S) =1-[(3/3)]=0
Rreathing Difficulty = High, there arc 2
example with " Home
Gini(S) =1-[(2/7) +Care",
exaples with "Hospitalived.3
(3/7) *Deceased"
+(/7 + cach examples with "Admitted to IC0 aid
Weighted Average: 0 * (3/10) + 0.694 *(/7|-0.694
Fever attribute is selected as it (7/10) - 0.486
has the smallest Gini
(b)
Computation of Gini Index: 2 marks; Reason for index.
not
The Gini for each ID value is 0. sclccting ID: 1mark
new patients will be Therefore. the overall Gini for D is 0 The |D atribute has
allocated to new ID's. no predictive po
(c) Fold 1: (P1,P2), Fold 2:
(P3, P4), Fold 3: (P5, P6). Fold 4:
Train: Fold1,Fold2, Fold3, Fold4 (P7, P8), Fold 5: (P9, PT0)
Train: Fold2, Fold3, Fold4, FoldS Test: Fold 5
Train: Fold 1, Fold3, Fold4, Folds Test: Fold 1
Train: Fold1,Fold2, Fold4, Fold5 Test: Fold 2
Train:Fold1, Fold2, Fold3, Folds Test: Fold 3
Test: Fold 4
Q7. Iteration 1& resulting clusters: 5
marks: Computing new cluster centroids: 2 marks;
resulting clusters: 5 marks; Computing SSE: 3 marks Tteratlon z
Given K=2, Instance 1 :C1, Instance 2: C2
Iteration 1 (5 marks)
Instance Number of Annual Distance from C1 Distance from C2 Assigned Cluster
Clients Turnover
185 72 C1
12 170 56 C2
13 168 60 20.80 4.47 C2
14 179 68 7.21 15 C1
I5 182 72 3 20 C1
16 188 77 5.83 27.65 C1
The resulting clusters after first iteration
Cluster 1:I1, 14, I5, 16 Cluster 2: 12 and I3
Computation of New Centroids: (2 marks)
Cl=
185+179+182+1
72+68+72+77 ) =(183.5, 72.25)
4
C2 =(170+1
4
S6+60 (169, 58)
4
Iteration 2: (5 marks)
Instances Number of Annual Distance Distance Assigned Cluster
Clients Turnover from C1 from C2
185 72 1.52 21.26 C1
12 170 56 21.12 2.23 C2
13 168 60 19.75 2.23 C2
14 79 68 6.18 14.14 C1
182 72 1.52 19.105 C1
16 88 77 6.54 26.87 C1
The resulting clusters after second iteration
Cluster 1:I1,14, I5, 16 Cluster 2: 12 and 13 There is no change in the clusters. So we stop.
Computing the SSE (3 marks)
sSE of Cluster 1: -(Distance of ll from C1) +(Distance of 14 from Cl (Distance of I5 from C1 + (Distance of I6
from Cl =(1.52)² + (6.18) + (1.52) + (6.54) = 2.3104 +38.1924 +2.3 104 +42.7716 - 85.5848
SSE of Cluster 2: -(Distance of I2 from C2)' +(Distance of 13 from C2)² = (2.23)2 + (2.23)2 = 9.9458