0% found this document useful (0 votes)
19 views12 pages

DM-I Q Paper 2024

This document is a question paper for a Data Mining I course, containing various sections and questions related to data mining concepts, techniques, and applications. Students are required to answer a compulsory question and select additional questions from other sections, covering topics such as clustering, classification, and association rules. The paper includes practical tasks, theoretical questions, and data analysis scenarios to assess students' understanding of data mining principles.

Uploaded by

viyaasingh66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

DM-I Q Paper 2024

This document is a question paper for a Data Mining I course, containing various sections and questions related to data mining concepts, techniques, and applications. Students are required to answer a compulsory question and select additional questions from other sections, covering topics such as clustering, classification, and association rules. The paper includes practical tasks, theoretical questions, and data analysis scenarios to assess students' understanding of data mining principles.

Uploaded by

viyaasingh66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Moj

[This question paper contains 12 printed pages.]

Your Roll No...

Sr. No. of Question Paper : 4192 H

Unique Paper Code 2343012005

Name of the Paper Data Mining I

Name of the Course : B.Sc. (Hons.) Computer


Science

Semester : IV

Duration:3 Hours Maximum Marks : 90

Instructions for Candidates


30
1. Write your Roll No. on the top immediately on receipt
of this question paper.

2. Section A (Question No. 1) is compulsory.

3. Attempt any four questions from Section B


(Questions 2 to 7).

4. The use of a simple calculator is allowed.

S. Parts of the question must be answered together.

P.T.0.
tnecanaratio) or isolation.

4192 2
3
4192
Section A dataset, which contains six
(d) Consider the given
attributes: Age and Salary.
objects, cach with two
(a) Differentiate between the unsupervised and cluster the given
K-means clustering is used to
1.
supervised evaluation measures used for cluster with applying K
objects. Do you see any issue
validity. If yes, then state the
(3) means to the given dataset?
preprocessing
(b) What is the
issue. Also apply the appropriate
anti-monotone property of the support explicitly
technique to overcome it. If no, state
measure in association rule mining? Does the required. (4)
that no preprocessing technique is
confidence measure follow anti-monotone
property?
(2+) (3) Age
(in years)
Salary
(in rupees)
62000 :43
(c) Consider a dataset with two class labels, News Object 1 40 2
and Entertainment, and six labeled documents D1 24 48000
Object 2
30 2 54000
D6. A new document, D7, is to be classified. The Object 3
Object 4 35 67000
similarity values of D7 with D1, D2, D3, D4, D5
Object 5 46 80000
and D6 are 0.75, 0.85, 0.66, 0.87, 0.70 and 0.84
Object 6 34 66000
respectively. Using the k-Nearest Neighbor
classifier, predict the class label that should be
assigned to D7 when k=3. Will the predicted class (e) Define the curse of dimensionality. The Iris flower
label change with k-5? (i++|+)(4) dataset comprises of 150 data points and four
features, namely sepal length, sepal width, petal
Document Class Label
D1 News
width, and petal length. Is it a high-dimensional
D2 Entertainment
data or low-dimensional data? Justify your answer.
D3 Entertainment (2t1) (4)
D4 News
DS
() Consider a decision tree to classify the health of
News
D6 Entertainment an individual as Fit or Unfit given below :

P.T.0.
4
4192
4192 5

Age < 30 ?
(ii) Grouping the customers of a company

Yes NO according to their buying interests. )

Smo kes/
Drinks ?
Workout ? (i1i)) Finding a group of genes such that genes
in each group have related functionality. D
NO Yes No
Yes

(iv) Using historical data from previous financial


UnFit Fit Fit
Diet Control ? statements to project sales, revenue, and
expenses for a company. 2o
Yes No

Fit UnFit
(h) Given two objects X = (22, 1, 42, 10) and
Y =(20, 0, 36, 8), compute the distance between
these two objects using the following distance
(i) Extract all classification rules from the measures :

decision tree.
(i) Euclidean Distance670
(ii) Classify the following object:
(ii) Manhattan Distance (4)
Age = 50, Workout = No, Smokes/ Drinks =
No, Diet Control = No, Health =? (4)
Section B

(g) Classify the following tasks as "predictive" or


2. (a) Given the following training dataset, compute all
"descriptive". Justify your answer. (4)
class conditional and prior probabilities. Use the
(1) Foretelling whether an online user will shop
Naive Bayes approach to predict the class label
on Flipkart for a specific item.
(Salary) for the test instance : (12)

P.T.0.
4192 4192 7

Education Levcl = PG, Career


Years of Experience = 3 to 10 Management, ID
Dept. Name Location Establish Size Annual
Budget
ed On
5-01-2020 Large 460
Finance Nehru
DP12
Place
DP19 Marketing Nehru 8-08-2020 Medium 300
Education Career Years of Salary Place
Level Experience Hauz Khas 2-01-2020 Medium 240
Less than 3 DP21 Human
Management Low
UG Resource
290
Management 3 to 10 2-02-2020 Mediun
UG Low DP27 Production
4-07-2021 Small 90
PG Management Less than 3 High DP33 Research Nehru
Development Place
PG Service More than 10 Low Information Hauz Khas 6-08-2020 Mediun 210
Service 3 to 10 Low DP39
UG Technology
Service 3 to 10 9-09-2020 Large 510
PG High DP41 Sales Nehru
Management More than 10 Place
PG High Hauz Khas 2-10-2020 Medium
PG Service Less than 3 Low DP52 Customer
Service
UG Management More than 10 High DP55 Public Nehru 3-03-2021 Large 900
UG Service More than l0 Low Relations Place

Le * Annual Budget is In Lakhs


b) Adata mining application uses a particular type
of data. Give one application for each of the (i) Identify the type of attributes ID, Dept.
following type : (3) Name, Location, Established On, Size, and
6r(3+4) Annual Budget as nominal, ordinal, interval,
() Sparse dataset
or ratio. Give justification for each. (6)
(i) Spatio-Temporal data
(ii) Suggest a technique for dealing with
(ii) Graph-based data missing values in the attribute Location.
stt:) Will the same technique apply to the
3. (a) Consider the following dataset having details about attribute Annual Budget? Justify. (3)
different departments of a company :

P.T.0.
ause th/

8
4192 4192
What is an outlier? Spot an outlier in
the (c) Enumerate all association rulcs gencrated from the
provided dataset.
(3) largest frequcnt itemset found in cach datasct scan.
Compute the confidence of cach generatcd rule.
(b) What is the need for sampling in data mining ? Assuming that the minimum con fidence threshold
What problems ariseAb
ifmthe
ud sample
g l size is too
small is 70%, find all the strong association rules.
(HI+) u
or too large? (6)
(3) H,co’ C
D P S S

H,Ca co
Co,’ H, ca
5 (a) A medical team develops classification models for
4. Consider the following transactional data of a grocery
predicting the occurrence of a "genetic disorder"
store :
using C1assifier A and Classifier B. Patients having
genetic disorders are considered positive
Transaction ID Items
instances. In contrast, negative instances are ones
Tl Boots, Hoodie, Gloves
T2 Boots, Hoodie with the absence of genetic disorders. The
T3 Hoodie, Coat, Cardigan classifiers were tested on data from 500 patients
T4 Cardigan, Coat and then obtained the result as:
T5 Cardigan, Gloves
T6 Hoodie, Coat, Cardigan Actual Label
Presence of Absence of
Genetic Genetic
Disorder Disorder
(a) What is the maximum number of rules that can be Classifier A, predicted
extracted from this data (including rules that have "presence of genetic 131 TP 155 FP|
disorder"

(3) Classifier A, predicted


zero support). "absence of genetic 19 195 TN
disorder"
Classi£ier B, predicted
(b) Use the Apriori algorithm on the given transactional "presence of genetic
disorder"
82 72

dataset and compute the candidate and frequent Classifier B, predicted


"absence of genetic 68 278
itemsets for each dataset scan. Assume a support disorder"

threshold of 33.34%. (6)

P.T.0.

NS
4192 10 4192 11
226 -652 (i) List the confusion matrix for "Classifier
AcA Age Fever
ID BD
A" and Classifier B'". Find the accuracy, Pl Young Yes High Outcorne
Ace P2 Young No
High
In IC
Hospital ized
precision, sensitivity, recall and specificity P3 Elderly Yes
High In ICU
P4 Middle Yes
aged Moderate In ICU
for each classifier. (8) P5 Middle No
R|SA 87- 3 aged
High Home Care
P6 Middle
SA SS. 7) (ii) What problem may occur if the provided aged
Yes
Moderate In ICU
P7 Elderly No Moderate In ICU
training dataset of 500 patients had only P8 Elderly No High
P9 Elderly Yes Deceased
15 positive instances and the remaining Pl0
High In ICU
R|S= sy6 Young No High
BD: Breathing Difficulty Hospitalized
negative instances? Which performance
Sha= 79.42 measure would you choose to evaluate the
Age (a) Compute the Gini Index of Age, Fever, and
BD
classifiers in such a scenario? Which is
attributes. Given that you construct a decision tree
the better classifier between Classifier A -qlG using the Gini Index as the splitting criteria, which
and Classifier B in such a scenario? of the three attributes would you choose at
the
root? Justify your choice.
(4) 0i36
BD 3x2stI:s) (9)
au
644 (b) Compute the Gini Index of ID. Why should it not
(b) Consider a categorical attribute Grade with three T496
be used as a splitting attribute for
values {A, B, and C}. Convert this attribute to constructing a
decision tree? (3)
asymmetric binary attributes. (3)
(c) Given ten objects in the dataset (Pl -P10),
mention all train and test distributions for
6. Consider the given COVID-19 dataset of ten
performing k-fold cross-validation. Assume the
patients.
value of k= 5. (3)

P.T.0.
4192 12

7. Given a dataset with six records about startup


companies, cach record has two fields: Number of
Clients and Annual Turnover. Assuming that k 2
and initial cluster centres as the first two records.
compute the cluster centres of the resulting clusters
until the stopping criterion is met. Use Euclidean
distance as the distance metric. Also, compute the
SSE (Sum of Squared Error) of cach generated cluster.

Nunber of Annual Turnover


Clients (in Lakhs)
185 72
170 56 C
168 60
Ce
179 68
182 72
188 77

Sst B5.58
SStG q.q4

5+2t 5+3-’ eerSSE


usteva ltey
Custo
G(183-s, 72-2r)
G(69,) (1000)
-June 2024
B.Sc. (Hons.)Computer Science
Unique Paper Code: 2343012005
May
S. No of Ques Paper: 4192
Data Mining-I Solution Set
Section A
Q1 (a) UnsuperVised measures evaluate the goodness of a clustering
structure
They are often called internal indices becausc thev use only information presentwithout respect to external information.
in the dataset. Example: SSE, Cluster
Cohesion or compactness, Cluster separation or isolation.
Supervised measures evaluate the clustering structure discovered by aclustering algorithm with respect to Some external
Structure. They are often called external indices because they use information not present in the dataset. Example:
Entropy
2marks for proper difference, mark cach for a correct example
Q(b) A measuref possesses the antimonotone property iffor everyitemset X that is a proper subset of itemset Y, 1.¬.
ACY,we have f(Y)<f(X). Support (s) follows anti-monotone nroperty as the support of an itemset never exceeds the
support for its subsets.

VX,Y:(c )’s(X)2 s()(2 marks)


Confidence measure does not follow anti-monotone property. (1 mark)
Ql(c)When k=3, the nearest neighbors are D2, D4 and D6. The label *Entertainment" should be assigned to D/.(2
marks: 1 mark for correct nearest neighbors, l mark for correct
label)
When k5, the nearest neighbors are Di, D2, D4, D5 and D6. The label News" should be assigned to D7. (2 marks: I
mark for listing the correct nearest neighbors, 1 mark for correct label)

Q(d) 2marks for stating the issue and the preprocessing technique. 1 mark for normalized value of Age
attribute, 1mark for normalized value of Salary attribute
Yes, there would be an issue
applying K-means clustering on the give data. The range of the given attributes Age and
Salary are 24-46 and 48000- 80000 respectively. The salary attribute with larger values will dominate the computation
of Euclidean distance. Using K-means on the given dataset as it is would be incorrect because of its biasedness towards
"Salary" attribute. We can apply min- maxnonalization on the given data before using it for k-means clustering.
Min (Age) = 24; Max (Age) = 46; Min (Salary) = 48000 Max (Salary) = 80000
a'=
I min(c)
Min-Max Normalization max(z) - min(z)

Age Normalized Values of Age Salary Normalized Values of Salary


Object 1 40 0.727273 62000 0.4375
Object 2 24 48000
Object 3 30 0.272727 54000 0.1875
Object 4 35 0.5 67000 0.59375
Object 5 46 80000
Object 6 34 0.45454S 66000 0.5625

O1 (e) 2 marks for defining curse of dimensionality and its associated issues; 1 mark for stating
dimensionality:
1mark for justification of dimensionality
Lowdimensional data as it has only 4 attributes with 150data points. The number of observations are
larger as compared to the number of features.
significantly
Q1()3 marks for classification Rules, 1 mark for correct prediction
Classification Rules
(Age < 30=Yes) ^ (Smokes/ Drinks = Yes) ’ Health = UnFit
(Age < 30 = Yes)A(Smokes/ Drinks = No) ’ Health = Fit
(Age < 30 = No) A(Workout =Yes) ’ Health - Fit
(Age < 30 - No) A(Workout = No) A(Diet Control = Yes) ’ Health = Fit
(Age < 30 = No) A(Workout = No) A(Diet Control = No) > Health UnFit
The object Age S0, Workout = No, Smokes/ DrinkS =No, Diet Control = No, will be
classified as Health = UnFit
O1 (g) (½ marks for correct answver and 2 marks for justification of each part)
() Predictive
(i) (iii) Descriptive
Descriptive (iv) Predictive

Q1 (h)2 marks for Euclidean distance (formula +


computation). computation), 2 marks for Manhattan distance (iormula t
() Distance (X, Y) =/ (22-20)² + (1- 0)2 +(42-36)2 + (10 - 8)2 = V45 6.70
(ii) Distance (X, Y) = |22-20| + |1-0| + |42-36| + |10-8| =11
Section B
Q2 (a) 8marks if all the mentioned probabilities are present and correct (½ mark for each)
P (Salary =Low) =6/10 =3/5
P (Salary = High) = 4/10= 2/5
P(Education Level = UG|Salary = Low) =4/6=2/3
P (Education Level = UG|Salary = High) = 1/4
P (Education Level = PG|Salary =Low) =1/3 Pctow)
P (Education Level PG|Salary = High) =3/4 v P(Low
Ix) P(xlLow)
P (Career= Management | Salary = High) = 3/4 v
P (Career = Management | Salary = Low) = 1/3 -
P (Career= Service | Salary= High) = 1/4
P (Career -Service | Salary = Low) =4/6=2/3 /H
P (Years of Experience = Less than 3 | Salary -High) = 1/4 3j4 /3
P (Years of Experience =3 to 10|Salary -High) = 1/4
P(Years of Experience- Morethan 10 |Salary -High) =2/4 - 1/2
213
M
P(Years of Experience = Less than 3 | Salary =Low) =2/6 = 1/3
P (Years of Experience = 3 to 10| Salary -Low) = 2/6= 13 - LT3
213
P(Years of Experience = More than 10 | Salary -Low) =2/6 = 1/3 Btolo
Let P(Education Level =PG, Career =Management, Years of Experience -3 to, l0) =k(I mark)
P(EducationLevel= PG |High) * P(Career Management | High) * P(Years of Exp - 3to 10| High) * P(High) =%
* , * y * 2/5 =0.056 /k (1 mark)

P(Education Level = PG| Low) * P(Career -Management | Low) * P(Years of Exp - 3to 10| Low) *P (Low) -2/3
* 2/6 * 2/6 * 3/5 = 0.04 / k (I mark)
As 0.056/k >0.04/k, the instance will be predicted with Salary "High" (l mark)
Q2. (b) 1mark for each part. There could be multiple applications for a particular data type.
() Market basket analysis
(ii) Weather Data
(ii) Molecule Data, Data for web page linking
Q3. (a) (i) ½ mark for correct type, ½ mark for justification
ID: Nominal; Dept. Name: Nominal; Location: Nominal: Established On: Interval; Size: Ordinal; Annual Budget:
Ratio

(i) 1%mark for correct answer for "Location" attribute, I½mark for eorrect answer for Annual Buduet
attribute.

You can ignore the missing value in the attributeLocation"


For missing value in Annual Budget, youcan compute the missing value by replacing it with the mean value of the
atribute "Annual Budget" (Mcan to be used as replacement value can also be calculate for all the departments whose
size is "medium",)
iii) 1V, mark for definition of an outlier, 1% mark for spotting the outlier.An outlier is a point that
differs
sipniicantly fron other observations in the dataset. In the given dataset, DP55 can be regarded as an outlier with an
eNceptionally highvalue in the "annual budget" atribute.
03. (b) 1 mark for need of sampling. Imark for problem with large sample size, I mark for problems with
small
sample size
Sampling is used indata mining because procssing the entire of data ofinterest is too cxpcnsive or limc consuming
set
In some cases, using a sampling can reduce the
can be used.
dataset size tothe nint where abetter but more cxpensivc
algonn
Large sample size increases the probability that the sample will bhe representative but they liminate much of
the advantage of sampling.
IWith smaller sample sizes patterns nmay be misscd or crroneous patterns
Q4. (a)I mark for stating unique items,lmark for correct formula, 1mark may bc dctectea.
for correct caleulation
There are 5 unique items in the given transactional data (Boots Hoodie. Gloves. Coat, Cardigan). Let us call these
nique items d. Total number rules that can be made from these items can be computed using the 3d -24l + | = 180.
Thus, a total of 180 rules can be made.
(b) 2marks for each correct dataset scan
Support Threshold = 33.34% => At least 2 transactions
For k = I For k = 2
Item
Support Item Support Frequent
Count Count
Boots {Boots, Hoodie} Yes
Hoodie 4
(Boots, Gloves No
Gloves No
{Boots, Coat}
Coat 3 No
{Boots, Cardigan} 0
Cardigan 4 (Hoodie, Gloves} No
AlI items qualify the minimum (Hoodie, Coat} 2 Yes
support threshold.
{Hoodie, Cardigan} 2 Yes
{Gloves, Coat} No
{Gloves, Cardigan } 1 No

{Coat, Cardigan} 3 Yes


For k=3

Item Support Count Frequent


{Hoodie, Coat, Cardigan} 2 Yes
No need to go for k=4, since the largest transaction has only 3 items.
(c) 3 marks for generating all the rules, 2 marks for generating all the confidence scores, 1 mark for listing strong
rules

Since, {Hoodie, Coat, Cardigan} is the largest frequent itemset, we will generate all the rules from the same.
Rule Confidence
(Hoodie} O(Coat, Cardigan} 2/4 = 0.5
2/2 = 1
(Hoodie, Coat} 0 (Cardigan}
2/2 = 1
{Coat) 0{Hoodie, Cardigan} 2/3 =0.66
{Coat, Cardigan} D{Hoodie}
{Cardigan} 0 {Hoodie, Coat} 2/4 = 0.5
|(Hoodie, Cardigan} D(Coat}) 2/2 = 1

As confidence threshold is 70%, the strong rules are: {Hoodie, Coat’ {Cardigan}; {Coat}->{Hoodie, Cardigan}:
(Hoodie, Cardigan}’{Coat}
Os. (a) (i)½mark each for confusion matrix; % mark each for accuracy; Imark if stated both recall and
sensitivity; Imark each for precision and specificity; Allof theabove for both classifier A and Classifier B
Classifier A

TP =131 FP 155
FN 19 TN = 193
TP+TN TP TP
Accuracy 65.2 Precision= = 45.8 Recall/ Sensitivity =87.3
TP+FN+FP+T TP+F TP+F

131t19s 32%
131ss+19411s
TN
Specificity TN+F 55.71
Classifier B
TP= 82
FN= 68 FP=72
TN= 278
TP4TN
Accuracy 72 Precision T
TP+FN4FP4T S3.24 Rccall/ Sernsitivity TP
54.6

Specificity TN
TN4FP 79.42

( ) MMarking Seheme: 1mark for


calculating F1 scorc of Classificr Amentioning
class imbalance. I mark for
and Classifer B. 1 mark for mentioning PI-sere. ma
mentioning thebetter classer.
The problem of class imbalance would oceur ifin a
a scenario, dataset of S00 patients there are only 15positive instances.
F-measure/ F1-score should be evaluated.
Fl-score 2"Precision"Recall
Precision +R
F1-Score of Classifier A - 60.08 F1-Score of Classifier B 53.91
Classifier A is better in such a scenario.
Q5 (b) Conversion to asymmetric binary variables (3 marks)
Categorical Value X2 X3
A
1
B 0 (

Q6. 2 ½marks for computing each attribute correctly 1%marks for correct choice at root
(a) Computation of Gini Index for Age Attribute

It has three possible values of Young (3 examples), Middle Aged (3 examples) and Elderly (4 examples).
For Age = Young, there are 2 examples with Hospitalized" and I with Admitted to ICU".
Gini(S) = 1 -[(2/3) +(V3)] =0.444
For Age = Middle Aged, there are 2 examples with Admitted to ICU" and Iexample with Home Care"
Gini(S) =l- [(2/3)² +(1/3)] =0.444
For Age = Elderly, there are 3 examples with Admitted to ICU" and I example with "Deceased'"
Gini(S) = 1 -[(3/4) + (1/4)] = 0.375
Weighted Average: 0.444 *(3/10)+ 0.444 *(3/10) +0.375 *(4/10) = 0.416
Computation of Gini Index for Fever Attribute
It has twO possible values of Yes (5 examples) and No (5 examples).

For Fever = Yes, there are 5 examples, all with Admitted to ICU",
Gini(S) = l -[(5/5)]=0
For Fever = No, there are 2 examples with "Hospitalized", 1example with Deceased", *Home Care" and *Admitted
to ICU" each
Gini(S) = 1-[(2/5) +(1/5) +(1/S)? +(1/5)]=0.72
Weighted Average: [Yes] 0 * (5/10) + [No] 0.72 * (5/10) =0.36
Computation of Gini Index forBreathing Difficulty' Attribute
It has two possible values of High (7 examples) and Moderate (3 examples).
For Breathing Difficulty = "Moderate", there are 3 examples with Admitted to ICU".
Gini(S) =1-[(3/3)]=0
Rreathing Difficulty = High, there arc 2
example with " Home
Gini(S) =1-[(2/7) +Care",
exaples with "Hospitalived.3
(3/7) *Deceased"
+(/7 + cach examples with "Admitted to IC0 aid
Weighted Average: 0 * (3/10) + 0.694 *(/7|-0.694
Fever attribute is selected as it (7/10) - 0.486
has the smallest Gini
(b)
Computation of Gini Index: 2 marks; Reason for index.
not
The Gini for each ID value is 0. sclccting ID: 1mark
new patients will be Therefore. the overall Gini for D is 0 The |D atribute has
allocated to new ID's. no predictive po
(c) Fold 1: (P1,P2), Fold 2:
(P3, P4), Fold 3: (P5, P6). Fold 4:
Train: Fold1,Fold2, Fold3, Fold4 (P7, P8), Fold 5: (P9, PT0)
Train: Fold2, Fold3, Fold4, FoldS Test: Fold 5
Train: Fold 1, Fold3, Fold4, Folds Test: Fold 1
Train: Fold1,Fold2, Fold4, Fold5 Test: Fold 2
Train:Fold1, Fold2, Fold3, Folds Test: Fold 3
Test: Fold 4
Q7. Iteration 1& resulting clusters: 5
marks: Computing new cluster centroids: 2 marks;
resulting clusters: 5 marks; Computing SSE: 3 marks Tteratlon z
Given K=2, Instance 1 :C1, Instance 2: C2
Iteration 1 (5 marks)
Instance Number of Annual Distance from C1 Distance from C2 Assigned Cluster
Clients Turnover
185 72 C1
12 170 56 C2
13 168 60 20.80 4.47 C2
14 179 68 7.21 15 C1
I5 182 72 3 20 C1
16 188 77 5.83 27.65 C1
The resulting clusters after first iteration
Cluster 1:I1, 14, I5, 16 Cluster 2: 12 and I3
Computation of New Centroids: (2 marks)
Cl=
185+179+182+1
72+68+72+77 ) =(183.5, 72.25)
4

C2 =(170+1
4
S6+60 (169, 58)
4

Iteration 2: (5 marks)
Instances Number of Annual Distance Distance Assigned Cluster
Clients Turnover from C1 from C2
185 72 1.52 21.26 C1
12 170 56 21.12 2.23 C2
13 168 60 19.75 2.23 C2
14 79 68 6.18 14.14 C1
182 72 1.52 19.105 C1
16 88 77 6.54 26.87 C1
The resulting clusters after second iteration
Cluster 1:I1,14, I5, 16 Cluster 2: 12 and 13 There is no change in the clusters. So we stop.
Computing the SSE (3 marks)
sSE of Cluster 1: -(Distance of ll from C1) +(Distance of 14 from Cl (Distance of I5 from C1 + (Distance of I6
from Cl =(1.52)² + (6.18) + (1.52) + (6.54) = 2.3104 +38.1924 +2.3 104 +42.7716 - 85.5848
SSE of Cluster 2: -(Distance of I2 from C2)' +(Distance of 13 from C2)² = (2.23)2 + (2.23)2 = 9.9458

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy