7-Decision Trees Learning
7-Decision Trees Learning
7-Decision Trees Learning
Outline
2
Decision Tree for PlayTennis
3
Decision Tree for PlayTennis
Outlook
No Yes No Yes
4
Decision Tree for PlayTennis
Outlook
No Yes No Yes
6
Decision Tree for Conjunction
Outlook=Sunny Wind=Weak
Outlook
Wind No No
Strong Weak
No Yes
7
Decision Tree for Disjunction
Outlook=Sunny Wind=Weak
Outlook
No Yes No Yes
8
Decision Tree for XOR
Outlook=Sunny XOR Wind=Weak
Outlook
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
10
When to consider Decision Trees
12
Which Attribute is ”best”?
13
Entropy
15
Information Gain
Gain(S,A): expected reduction in entropy due to sorting S
on attribute A
Over
Sunny Rain
cast
Temperature ? S=[9+,5-]
E=0.940
Temperature
Note: 0Log20 =0
22
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]
Outlook
No Yes No Yes
+ - +
A2
A1
+ - + + - -
+ - + - - +
A2 A2
- + - + -
A3 A4
+ - - + 25
Hypothesis Space Search ID3
26
Converting a Tree to Rules
Outlook
29
Attributes with Cost
Consider:
Medical diagnosis : blood test costs 1000 SEK( 瑞典克
朗)
Robotics: width_from_one_feet has cost 23 secs.
How to learn a consistent tree with low expected cost?
Replace Gain by :
Gain2(S,A)/Cost(A) [Tan, Schimmer 1990]
2Gain(S,A)-1/(Cost(A)+1)w w [0,1] [Nunez 1988]
30
Unknown Attribute Values
31
Occam’s Razor
32
Overfitting
33
Overfitting in Decision Tree Learning
34
Avoid Overfitting
35
Reduced-Error Pruning
36
Reduced-Error Pruning
Split data into training and validation Outlook
sets.
37
Effect of Reduced Error Pruning
38
Rule Post-Pruning
39
Outlook
No Yes No Yes
40
Why convert the decision tree to rules
before pruning?
Allows distinguishing among the different contexts
in which a decision node is used
Removes the distinction between attribute tests
near the root and those that occur near leaves
Enhances readability
41
Evaluation
Training accuracy
– How many training instances can be correctly classify based on
the available data?
– Is high when the tree is deep/large, or when there is less
confliction in the training instances.
– however, higher training accuracy does not mean good
generalization
Testing accuracy
– Given a number of new instances, how many of them can we
correctly classify?
– Cross validation
42
Strengths
43
Weakness
Not suitable for prediction of continuous attribute.
Perform poorly with many class and small data.
Computationally expensive to train.
– At each node, each candidate splitting field must be sorted before
its best split can be found.
– In some algorithms, combinations of fields are used and a search
must be made for optimal combining weights.
– Pruning algorithms can also be expensive since many candidate
sub-trees must be formed and compared.
Do not treat well non-rectangular regions.
44
Cross-Validation
45
Holdout Method
Partition data set D = {(v1,y1),…,(vn,yn)} into training Dt and
validation set Dh=D\Dt
D1 D 2 D3 D4 D1 D2 D 3 D4
D1 D 2 D3 D4 D1 D2 D 3 D4
49
信息论基础
香农提出的信息量公式:
I (X) = log2[ 1/ p(X) ] = - log2 p(X) (bit)
[ 例 ] 消息 X = (X1X2X3) , 各符号出现的概率分别为 p(X1),p(X2),p(X3) 。
求 I(X) 。
– 解: I (X) = I (X1X2X3)
– = -log [ p(X1X2X3) ]
– = -log[ p(X1) p(X2) p(X3) ] (各 Xi 独立)
– = -log p(X1) -log p(X2) -log p(X3)
– = I (X1) + I (X2) + I (X3)
[ 例 ] 有两个同样形式的消息 (X1X2X3X4), 来自两个不同的信源。
– 信源 1 : p(X1) = 1/2, p(X2) = 1/4, p(X3) = 1/8, p(X4) = 1/8
– 信源 2 : p(X1) = 1/4, p(X2) = 1/4, p(X3) = 1/4, p(X4) = 1/4
– 解:信源 1 : I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4)
– = -log(1/2) -log(1/4) -log(1/8) - log(1/8)
– = 1 + 2 + 3 +3 = 9 (bit)
– 信源 2 : I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4)
– = -log(1/4) -log(1/4) -log(1/4) - log(1/4)
– = 2 + 2 + 2 +2 = 8 (bit) ( Imin )
50
信息论基础
熵 —— 信源中每个符号的平均信息量 ( 数学期望 ) , 记作 H(X)
。 n n
H ( X ) p ( Xi) I ( Xi) p( Xi) log p ( Xi)
i 1 i 1
51