Data Mining - Classification
Data Mining - Classification
Data Mining - Classification
P(Cancelled) = 0.05
P(Very late) = 0.1
P(Late) = 0.15
P(On time) = 0.7
Example
• The flaw in this approach is, of course, that all unseen
instances will be classified in the same way, in this case as on
time.
• Such a method of classification is not necessarily bad: if the
probability of on time is 0.7 and we guess that every unseen
instance should be classified as on time, we could expect to
be right about 70% of the time.
• However, the aim is to make correct predictions as often as
possible, which requires a more sophisticated approach.
• The instances in the training set record not only the
classification but also the values of four attributes: day,
season, wind and rain. Presumably they are recorded because
we believe that in some way the values of the four attributes
affect the outcome.
Bayes’ Theorem
15
Bayes' rule
prob( B | A) prob( A)
prob( A | B)
prob( B)
Bayes’ theorem
• In order to calculate a conditional probability P(A|B)
when we know the other conditional probability
P(B|A), a simple formula known as Bayes’ theorem
is useful
• Let {A1, A2, …, Ak} be partitions of a sample space S.
Let B be some fixed event. Then
P( Aj B) P( B | Aj ) P( Aj )
P( Aj | B)
k
P( B) P( B | Ai ) P( Ai )
i 1
17
Bayes’ theorem
• What does it mean?
• Bayes’ rule allows us to update our probabilities
when new information becomes available
• Usually in applications we are given (or know) a
priori probabilities P(Aj). We go out and collect some
data, which we represent by the event B. We want to
know: how do we update P(Aj) to P(Aj |B)? The
answer: Bayes’ Rule.
18
Example: Bayes’ theorem
• Jo has a test for a nasty disease. We denote Jo's state of health
by the variable a and the test result by b.
a = 1 Jo has the disease
a = 0 Jo does not have the disease.
• The result of the test is either `positive' (b = 1) or `negative' (b
= 0);
• The test is 95% reliable: in 95% of cases of people who really
have the disease, a positive result is returned, and in 95% of
cases of people who do not have the disease, a negative result
is obtained.
• The final piece of background information is that 1% of
people of Jo's age and background have the disease.
• Jo has the test, and the result is positive. What is the
probability that Jo has the disease?
19
Example: Bayes’ theorem
Solution:
First, we write down all the provided probabilities.
The test reliability specifies the conditional
probability of b given a:
P(b=1 | a=1) = 0.95 P(b=1 | a=0) = 0.05
P(b=0 | a=1) = 0.05 P(b=0 | a=0) = 0.95
and the disease prevalence tells us about the
marginal probability of a:
21
Example: Bayes’ theorem
Jo has received a positive result b=1
and is interested in how plausible it is
that she has the disease (i.e., that
a=1).
The man in the street might be duped
by the statement `the test is 95%
reliable, so Jo's positive result implies
that there is a 95% chance that Jo has
the disease', but this is incorrect.
22
Example: Bayes’ theorem
The correct solution to an inference problem is
found using Bayes' theorem.
P(b 1 | a 1) P(a 1)
P(a 1 | b 1)
P(b 1 | a 1) P(a 1) P(b 1 | a 0) P(a 0)
0.95 0.01
0.16
0.95 0.01 0.05 0.99
To obtain a reliable
estimate of the four
classifications a more
indirect approach
is needed.
We could start by using conditional probabilities
based on a single attribute.
• Euclidean distance
• Squared (or absolute) Euclidean distance
• City-block (Manhattan) distance
• Chebychev distance
• Mahalanobis distance (D2)
Euclidean distance
• To classify an unknown record:
– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to determine
the class label of unknown record (e.g., by taking
majority vote)
• How can we estimate
the classification for
an ‘unseen’ instance
where the first and
second attributes are
9.1 and 11.0,
respectively?
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
1 –ve
2 –ves, 3 +ves
8 –ves, 5 +ves
• Age: Age at first married (male)
• Pop G: Population Growth
• Corr: Corruption Index
• GNP: Gross National Product
• PPP: Purchasing Power Parity per Capita
• Lit: Literacy rate
• UR: Unemployment Rate
• Equal: Income equality
Negara Age Pop G Cor GNP PPP Lit UR Equal Class
Finland 32,5 0,08 90 263267 35981 100 7,6 5,6Developed
Australia 29,6 0,6 85 1515468 40847 99 5,2 12,5Developed
Jerman 33 -0,21 79 3604061 38077 99 5,4 6,9Developed
Jepang 30,5 -0,02 74 5870357 34748 99 4,5 4,5Developed
USA 28,9 0,86 73 14991300 48328 99 7,7 15,9Developed
China 31,1 0,51 39 7203784 8387 92,2 4,1 21,6BRICS
Brasil 28 1,13 43 2476651 11769 88,6 4,7 11BRICS
India 26 1,46 36 1897608 3663 74 3,8 8,6BRICS
Ghana 30 1,99 45 39200 3113 67,3 3,6 14,1Developing
Nigeria 29 2,27 27 245229 2582 61,3 23,9 17,8Developing