Data Mining - Classification

Data Mining – Classification
• Classification is a task that involves dividing

up objects so that each is assigned to one of a
number of mutually exhaustive and exclusive
categories known as classes.
• The term ‘mutually exhaustive and exclusive’
simply means that each object must be
assigned to precisely one class, i.e. never to
more than one and never to no class at all.
Example
• customers who are likely to buy or not buy a particular product
in a supermarket
• people who are at high, medium or low risk of acquiring a
certain illness
• objects on a radar display which correspond to vehicles, people,
buildings or trees
• houses that are likely to rise in value, fall in value or have an
unchanged value in 12 months’ time
• people who are at high, medium or low risk of a car accident
• people who are likely to vote for each of a number of political
parties (or none)
• the likelihood of rain the next day for a weather forecast (very
likely, likely, unlikely, very unlikely).
Eager Learning vs Lazy Learning
• In eager learning systems the training data is

‘eagerly’ generalized into some representation or
model (eg. table of probabilities, decision tree,
neural net) without waiting for a new (unseen)
instance to be presented for classification.
• In lazy learning systems the training data is ‘lazily’
left unchanged until an unseen instance is
presented for classification. When it is, only those
calculations that are necessary to classify that
single instance are performed.
1. Naive Bayes Classifiers
1. Naive Bayes Classifiers
• It uses the branch of Mathematics known as

probability theory to find the most likely of the
possible classifications.
Example
• The 06.00 train from Stockholm to Goteborg is

schedulled to arrive at 09.00.
• Giving the condition that today is a weekday,
in the winter season, where the rain is falling
heavilly and the wind is blowing like a hell.
• What would be the the most likely outcome
for the given condition?
Example
• Usually we are not interested in just one event

but in a set of alternative possible events,
which are mutually exclusive and exhaustive.
• In the train example, we might define four
mutually exclusive and exhaustive events
– E1 – train cancelled
– E2 – train ten minutes or more late
– E3 – train less than ten minutes late
– E4 – train on time or early
From the historical
data, we have the out
come considering four
attributes (day,
season, wind, rain)
Given the condition

with the following
attributes:
1. weekday
2. winter
3. high
4. heavy
What would be the

the most likely
classification for the
given condition?
The straightforward
(but flawed) way is
to look at the
frequency of the
classifications and
choose the most
common one
(on time).
P(Cancelled) = 0.05
P(Very late) = 0.1
P(Late) = 0.15
P(On time) = 0.7
Example
• The flaw in this approach is, of course, that all unseen
instances will be classified in the same way, in this case as on
time.
• Such a method of classification is not necessarily bad: if the
probability of on time is 0.7 and we guess that every unseen
instance should be classified as on time, we could expect to
be right about 70% of the time.
• However, the aim is to make correct predictions as often as
possible, which requires a more sophisticated approach.
• The instances in the training set record not only the
classification but also the values of four attributes: day,
season, wind and rain. Presumably they are recorded because
we believe that in some way the values of the four attributes
affect the outcome.
Bayes’ Theorem
15
Bayes' rule
• prob (A and B) = prob (B and A); so

• prob (A |B) prob (B) = prob (B|A) prob (A)
-- just using the definition of prob (X|Y));
• hence
prob( B | A) prob( A)
prob( A | B) 
prob( B)
Bayes’ theorem
• In order to calculate a conditional probability P(A|B)
when we know the other conditional probability
P(B|A), a simple formula known as Bayes’ theorem
is useful
• Let {A1, A2, …, Ak} be partitions of a sample space S.
Let B be some fixed event. Then
P( Aj  B) P( B | Aj ) P( Aj )
P( Aj | B)  

k
P( B) P( B | Ai ) P( Ai )
i 1
17
Bayes’ theorem
• What does it mean?
• Bayes’ rule allows us to update our probabilities
when new information becomes available
• Usually in applications we are given (or know) a
priori probabilities P(Aj). We go out and collect some
data, which we represent by the event B. We want to
know: how do we update P(Aj) to P(Aj |B)? The
answer: Bayes’ Rule.
18
Example: Bayes’ theorem
• Jo has a test for a nasty disease. We denote Jo's state of health
by the variable a and the test result by b.
a = 1 Jo has the disease
a = 0 Jo does not have the disease.
• The result of the test is either `positive' (b = 1) or `negative' (b
= 0);
• The test is 95% reliable: in 95% of cases of people who really
have the disease, a positive result is returned, and in 95% of
cases of people who do not have the disease, a negative result
is obtained.
• The final piece of background information is that 1% of
people of Jo's age and background have the disease.
• Jo has the test, and the result is positive. What is the
probability that Jo has the disease?
19
Solution:
First, we write down all the provided probabilities.
The test reliability specifies the conditional
probability of b given a:
P(b=1 | a=1) = 0.95 P(b=1 | a=0) = 0.05
P(b=0 | a=1) = 0.05 P(b=0 | a=0) = 0.95
and the disease prevalence tells us about the
marginal probability of a:
P(a=1) = 0.01 P(a=0) = 0.99

20
o From the marginal probability P(a) and the

conditional probability P(b | a) we can deduce
the joint probability P(a, b) = P(a)P(b | a) and
any other probabilities we are interested in.
o For example, by the sum rule, the marginal
probability of b=1 (the probability of getting a
positive result) is
P(b=1) = P(b=1|a=1)P(a=1)+ P(b=1|a=0)P(a=0)
21
 Jo has received a positive result b=1
and is interested in how plausible it is
that she has the disease (i.e., that
a=1).
 The man in the street might be duped
by the statement `the test is 95%
reliable, so Jo's positive result implies
that there is a 95% chance that Jo has
the disease', but this is incorrect.
22
The correct solution to an inference problem is
found using Bayes' theorem.
P(b  1 | a  1) P(a  1)
P(a  1 | b  1) 
P(b  1 | a  1) P(a  1)  P(b  1 | a  0) P(a  0)
0.95  0.01
  0.16
0.95  0.01  0.05  0.99
So in spite of the positive result, the probability

that Jo has the disease is only 16%.
This is exactly what Bayes’ theorem does; it updates our prior belief
to the posterior belief, when new evidence becomes available.
23
Given the condition
with the following
attributes:
1. weekday
2. winter
3. high
4. heavy
What would be the

the most likely
classification for the
given condition?
P(class = on time |
day = weekday
and season = winter
and wind = high
and rain = heavy)
We have only two data:

1. late
2. very late
Which is not good enough
for us.
To obtain a reliable
estimate of the four
classifications a more
indirect approach
is needed.
We could start by using conditional probabilities
based on a single attribute.
From the data set:

P(class = on time | season = winter) = 2/6 = 0.33
P(class = late | season = winter) = 1/6 = 0.17
P(class = very late | season = winter) = 3/6 = 0.5
P(class = cancelled | season = winter) = 0/6 = 0
The third of these has the largest value, so we

could conclude that the most likely
classification is very late, a different result
from using the prior probability as before.
We could also using conditional probabilities
based on another attribute.
From the data set:

P(class = on time | day = weekday) = 9/13 = 0.69
P(class = late | day = weekday) = 1/13 = 0.08
P(class = very late | day = weekday) = 3/13 = 0.23
P(class = cancelled | day = weekday) = 0/13 = 0
The first of these has the largest value, so we

could conclude that the most likely classification is
on time, a similar result from using the prior
probability.
We could do a similar calculation with attributes

rain and wind. This might result in other
classifications having the largest value.
Which is the best one to take?
The method uses conditional probabilities, but the other way round from before.
Instead of (say), P(class = very late | season = winter), we use the conditional
probability P(season = winter | class = very late).
From the data set:
P(day = weekday | class = on time ) = 9/14 = 0.64

P(day = Saturday | class = on time ) = 2/14 = 0.14
P(day = holiday | class = on time ) = 2/14 = 0.14
P(day = Sunday | class = on time ) = 1/14 = 0.07
Given the condition with the following
attributes; weekday, winter, high wind,
heavy rain, the probability that the train
is on time can be calculated as follow:
P(Prior) x P(weekday | on time ) x

P(winter | on time ) x
P(high wind | on time ) x
P(heavy rain | on time )
0,70 x 0.64 x 0.14 x 0.29 x 0.07 = 0,0013
The probability that the train is on time:

0.0013
The probability that the train is on time: 0.0013
The probability that the train is late: 0.0125
The probability that the train is very late: 0.0222
The probability that the train is cancelled: 0.0000
The largest value is for class Very Late
Thus giving the condition that today is a weekday, in the

winter season, where the rain is falling heavilly and the wind
is blowing like a hell, the train is most likely Very late.
Drawbacks of Naive Bayes Classification
• It relies on all attributes being categorical

• estimating probabilities by relative frequencies
can give a poor estimate if the number of
instances with a given attribute/value
combination is small (in the extreme case
where it is zero, the posterior probability will
inevitably be calculated as zero)
2. k – Nearest Neighbour Classifiers
• Supposing we have a training set with just two instances
• given a third instance
• What should its classification be?

• It is mainly used when all attribute values are
continuous.
• The idea is to estimate the classification of an unseen
instance using the classification of the instance or
instances that are closest to it, in some sense that we
need to define.
• It is based on those of the k nearest neighbours
(where k is a small integer such as 3 or 5), not just
the nearest one.
• Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to
retrieve
Types of Distance Measures
• Euclidean distance
• Squared (or absolute) Euclidean distance
• City-block (Manhattan) distance
• Chebychev distance
• Mahalanobis distance (D2)
Euclidean distance
• To classify an unknown record:
– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to determine
the class label of unknown record (e.g., by taking
majority vote)
• How can we estimate
the classification for
an ‘unseen’ instance
where the first and
second attributes are
9.1 and 11.0,
respectively?
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
1 –ve
2 –ves, 3 +ves
8 –ves, 5 +ves
• Age: Age at first married (male)
• Pop G: Population Growth
• Corr: Corruption Index
• GNP: Gross National Product
• PPP: Purchasing Power Parity per Capita
• Lit: Literacy rate
• UR: Unemployment Rate
• Equal: Income equality
Negara Age Pop G Cor GNP PPP Lit UR Equal Class
Finland 32,5 0,08 90 263267 35981 100 7,6 5,6Developed
Australia 29,6 0,6 85 1515468 40847 99 5,2 12,5Developed
Jerman 33 -0,21 79 3604061 38077 99 5,4 6,9Developed
Jepang 30,5 -0,02 74 5870357 34748 99 4,5 4,5Developed
USA 28,9 0,86 73 14991300 48328 99 7,7 15,9Developed
China 31,1 0,51 39 7203784 8387 92,2 4,1 21,6BRICS
Brasil 28 1,13 43 2476651 11769 88,6 4,7 11BRICS
India 26 1,46 36 1897608 3663 74 3,8 8,6BRICS
Ghana 30 1,99 45 39200 3113 67,3 3,6 14,1Developing
Nigeria 29 2,27 27 245229 2582 61,3 23,9 17,8Developing
Indonesia 25,2 1,16 32 846834 4666 90,4 6,56 7,8

Negara GNP Class Negara PPP Class
USA 14991300Developed USA 48328Developed
China 7203784BRICS Australia 40847Developed
Jepang 5870357Developed Jerman 38077Developed
Jerman 3604061Developed Finland 35981Developed
Brasil 2476651BRICS Jepang 34748Developed
India 1897608BRICS Brasil 11769BRICS
Australia 1515468Developed China 8387BRICS
Indonesia 846834 Indonesia 4666
Finland 263267Developed India 3663BRICS
Nigeria 245229Developing Ghana 3113Developing
Ghana 39200Developing Nigeria 2582Developing
Negara Age Pop G Cor GNP PPP Lit UR Equal Class Distance
Finland 32,5 0,08 90 263267 35981 100 7,6 5,6Developed 584407
Nigeria 29 2,27 27 245229 2582 61,3 23,9 17,8Developing 601609
Australia 29,6 0,6 85 1515468 40847 99 5,2 12,5Developed 669612
Ghana 30 1,99 45 39200 3113 67,3 3,6 14,1Developing 807635
India 26 1,46 36 1897608 3663 74 3,8 8,6BRICS 1050774
Brasil 28 1,13 43 2476651 11769 88,6 4,7 11BRICS 1629832
Jerman 33 -0,21 79 3604061 38077 99 5,4 6,9Developed 2757429
Jepang 30,5 -0,02 74 5870357 34748 99 4,5 4,5Developed 5023613
China 31,1 0,51 39 7203784 8387 92,2 4,1 21,6BRICS 6356951
USA 28,9 0,86 73 14991300 48328 99 7,7 15,9Developed 14144533
Indonesia 25,2 1,16 32 846834 4666 90,4 6,56 7,8
Problem with scaling issues >>> We need to transform the data

Negara Age Pop G Cor GNP PPP Lit UR Equal Class Distance Weight
India 0,10 0,67 0,14 0,12 0,02 0,33 0,01 0,24BRICS 0,48 4,26
Brasil 0,36 0,54 0,25 0,16 0,20 0,71 0,05 0,38BRICS 0,49 4,15
Ghana 0,62 0,89 0,29 0,00 0,01 0,16 0,00 0,56Developing 1,03 0,95
China 0,76 0,29 0,19 0,48 0,13 0,80 0,02 1,00BRICS 1,23 0,66
Jepang 0,68 0,08 0,75 0,39 0,70 0,97 0,04 0,00Developed 1,33 0,56
Australia 0,56 0,33 0,92 0,10 0,84 0,97 0,08 0,47Developed 1,35 0,54
Nigeria 0,49 1,00 0,00 0,01 0,00 0,00 1,00 0,78Developing 1,44 0,48
Finland 0,94 0,12 1,00 0,01 0,73 1,00 0,20 0,06Developed 1,57 0,41
Jerman 1,00 0,00 0,83 0,24 0,78 0,97 0,09 0,14Developed 1,58 0,40
USA 0,47 0,43 0,73 1,00 1,00 0,97 0,20 0,67Developed 1,66 0,36
Indonesia 0,00 0,55 0,08 0,05 0,05 0,75 0,15 0,19
Home work
Given that tomorrow
the values of
Outlook,
Temperature,
Humidity and Windy
were sunny, 74◦F,
77% and false
respectively, what
would the decision
be?

Data Mining - Classification

Uploaded by

Copyright:

Available Formats

Data Mining - Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining - Classification

Uploaded by

Copyright:

Available Formats

Data Mining – Classification

• Classification is a task that involves dividing

• In eager learning systems the training data is

• It uses the branch of Mathematics known as

• The 06.00 train from Stockholm to Goteborg is

• Usually we are not interested in just one event

Given the condition

What would be the

• prob (A and B) = prob (B and A); so

P(a=1) = 0.01 P(a=0) = 0.99

o From the marginal probability P(a) and the

So in spite of the positive result, the probability

What would be the

We have only two data:

From the data set:

The third of these has the largest value, so we

From the data set:

The first of these has the largest value, so we

We could do a similar calculation with attributes

P(day = weekday | class = on time ) = 9/14 = 0.64

P(Prior) x P(weekday | on time ) x

0,70 x 0.64 x 0.14 x 0.29 x 0.07 = 0,0013

The probability that the train is on time:

The largest value is for class Very Late

Thus giving the condition that today is a weekday, in the

• It relies on all attributes being categorical

• given a third instance

• What should its classification be?

Indonesia 25,2 1,16 32 846834 4666 90,4 6,56 7,8

Problem with scaling issues >>> We need to transform the data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.