Supervised Learning
Supervised Learning
No
|C |
Pr(c ) = 1,
j =1
j
j =1 | D |
entropy( D j )
5 5 5
entropyAge ( D) = entropy( D1 ) + entropy( D2 ) + entropy( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
= 0.971 + 0.971 + 0.722 middle 3 2 0.971
15 15 15
old 4 1 0.722
= 0.888
◼ Efficiency
❑ time to construct the model
❑ time to use the model
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability:
❑ understandable and insight provided by the model
◼ Compactness of the model: size of the tree, or the
number of rules.
TP TP
p= . r= .
TP + FP TP + FN
◼ Precision p is the number of correctly classified
positive examples divided by the total number of
examples that are classified as positive.
◼ Recall r is the number of correctly classified positive
examples divided by the total number of actual
positive examples in the test set.
CS583, Bing Liu, UIC 47
An example
◼ Then we have
100
90
Percent of total positive cases
80
70
60
lift
50
random
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100
Percent of testing cases
is maximal
Pr( A = a ,..., A
r =1
1 1 | A| = a| A| | C = cr ) Pr(C = cr )
Pr(C = cr ) Pr( Ai = ai | C = cr )
r =1 i =1
◼ We are done!
◼ How do we estimate P(Ai = ai| C=cj)? Easy!.
1 1 2 1
Pr(C = f ) Pr( A j = a j | C = f ) = =
j =1 2 5 5 25
|V |
N
t =1
it =| di |
t =1
Pr( wt | cj; ) = 1. (25)
Pr( wt | c j ; ˆ ) = . (27)
| V | + s =1 i =1 N si Pr(c j | d i )
|V | | D|
| D|
Pr( cj | di )
ˆ
Pr(c | ) =
j
i =1 (28)
|D|
=
r =1 Pr(cr | )k =1 Pr(wdi ,k | cr ; ˆ )
ˆ
|C | |d i |
|| w ||= w w = w1 + w2 + ... + wn
2 2 2 (37)
yi ( w x i + b 1, i = 1, 2, ..., r summarizes
w xi + b 1 for yi = 1
w xi + b -1 for yi = -1.
[ y (w x + b) − 1]
1
LP = w w − i i i (41)
2 i =1
where i 0 are the Lagrange multipliers.
◼ Optimization theory says that an optimal
solution to (41) must satisfy certain conditions,
called Kuhn-Tucker conditions, which are
necessary (but not sufficient)
◼ Kuhn-Tucker conditions play a central role in
constrained optimization.
Subject to : yi ( w x i + b) 1 − i , i = 1, 2, ..., r
i 0, i = 1, 2, ..., r
only require dot products (x) (z) and never the mapped
vector (x) in its explicit form. This is a crucial point.
◼ Thus, if we have a way to compute the dot product
(x) (z) using the input vectors x and z directly,
❑ no need to know the feature vector (x) or even itself.
◼ In SVM, this is done through the use of kernel
functions, denoted by K,
K(x, z) = (x) (z) (82)
= ( x1 , x 2 , 2 x1 x 2 ) ( z1 , z 2 , 2 z1 z 2 )
2 2 2 2
= (x) (z ) ,
◼ This shows that the kernel x z2 is a dot product in
a transformed feature space
CS583, Bing Liu, UIC 144
Kernel trick
◼ The derivation in (84) is only for illustration
purposes.
◼ We do not need to find the mapping function.
◼ We can simply apply the kernel function
directly by
❑ replace all the dot products (x) (z) in (79) and
(80) with the kernel function K(x, z) (e.g., the
polynomial kernel x zd in (83)).
◼ This strategy is called the kernel trick.
A new point
Pr(science| )?
◼ Testing
❑ Classify each new instance by voting of the k
classifiers (equal weights)
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
CS583, Bing Liu, UIC Bagging Predictors, Leo Breiman, 1996 159
Boosting
◼ A family of methods:
❑ We only study AdaBoost (Freund & Schapire, 1996)
◼ Training
❑ Produce a sequence of classifiers (the same base
learner)
❑ Each classifier is dependent on the previous one,
and focuses on the previous one’s errors
❑ Examples that are incorrectly predicted in previous
classifiers are given higher weights
◼ Testing
❑ For a test case, the results of the series of
classifiers are combined to determine the final
class of the test case.
training set
(x1, y1, w1) ◼ Build a classifier ht
(x2, y2, w2) whose accuracy on
… training set > ½
(xn, yn, wn)
(better than random)
Non-negative weights
sum to 1
Change weights
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
❑ Genetic algorithms
❑ Fuzzy classification