Decision Trees
Decision Trees
http://chem-eng.utoronto.ca/~datamining/ 1
Decision Tree
A set of training examples is broken down into
smaller and smaller subsets while at the same
time an associated decision tree is
incrementally developed. At the end of the
learning process, a decision tree covering the
training set is returned.
Mitchell, 1997
http://chem-eng.utoronto.ca/~datamining/ 2
Decision Tree - Classification
http://chem-eng.utoronto.ca/~datamining/ 3
Dataset
Predictors Target
http://chem-eng.utoronto.ca/~datamining/ 4
Decision Tree
Outlook
http://chem-eng.utoronto.ca/~datamining/ 5
Entropy
c
E ( S ) pi log 2 pi
i 1
http://chem-eng.utoronto.ca/~datamining/ 7
Entropy - Target
Play Golf Play Golf
No No
No No
Yes No 5 / 14 = 0.36
Yes No
Yes No
No Yes
Yes Yes
Sort
No Yes
Yes Yes
Yes Yes 9 / 14 = 0.64
Yes Yes
Yes Yes
Entropy(PlayGolf) = Entropy (5,9)
Yes Yes
= Entropy (0.36, 0.64)
No Yes
= - (0.36 log2 0.36) - (0.64 log2 0.64)
= 0.94
http://chem-eng.utoronto.ca/~datamining/ 8
Frequency Tables
http://chem-eng.utoronto.ca/~datamining/ 9
Entropy – Frequency Table
Play Golf
Yes No
Sunny 3 2 5
Outlook Overcast 4 0 4
Rainy 2 3 5
14
E (T , X ) P(c) E (c)
cX
E(PlayGolf, Outlook) = P(Sunny)*E(3,2) + P(Overcast)*E(4,0) + P(Rainy)*E(2,3)
= 0.693
http://chem-eng.utoronto.ca/~datamining/ 10
Information Gain
http://chem-eng.utoronto.ca/~datamining/ 11
Information Gain – the best predictor?
http://chem-eng.utoronto.ca/~datamining/ 12
Decision Tree – Root Node
Outlook
http://chem-eng.utoronto.ca/~datamining/ 13
Dataset – Sorted by Outlook
http://chem-eng.utoronto.ca/~datamining/ 14
Subset (Outlook = Overcast)
Temp. Humidity Windy Play Golf
Hot High FALSE Yes
Cool Normal TRUE Yes
Mild High TRUE Yes
Outlook
Hot Normal FALSE Yes
Hot High FALSE Yes
Play=Yes
http://chem-eng.utoronto.ca/~datamining/ 15
Subset (Outlook = Sunny)
Temp. Humidity Windy Play Golf
Mild High FALSE Yes
Cool Normal FALSE Yes
Cool Normal TRUE No
Mild Normal FALSE Yes
Mild High TRUE No
http://chem-eng.utoronto.ca/~datamining/ 16
Subset (Outlook = Sunny)
Temp. Humidity Windy Play Golf
Outlook
Mild High FALSE Yes
Cool Normal FALSE Yes
Mild Normal FALSE Yes
Cool Normal TRUE No
Sunny Overcast Rainy
Mild High TRUE No
Windy Play=Yes
FALSE TRUE
Play=Yes Play=No
http://chem-eng.utoronto.ca/~datamining/ 17
Subset (Outlook = Rainy)
Temp. Humidity Windy Play Golf
Hot High FALSE No
Hot High TRUE No
Mild High FALSE No
Cool Normal FALSE Yes
Mild Normal TRUE Yes
Play Golf
Yes No Play Golf Play Golf
Hot 0 2 Yes No Yes No
Temp. Mild 1 1 High 0 3 False 1 2
Humidity Windy
Cool 1 0 Normal 2 0 True 1 1
Gain = 0.57 Gain = 0.97 Gain = 0.02
http://chem-eng.utoronto.ca/~datamining/ 18
Subset (Outlook = Rainy)
Outlook
http://chem-eng.utoronto.ca/~datamining/ 19
Decision Rules
R1: IF (Outlook=Sunny) AND
(Windy=FALSE) THEN Play=Yes Outlook
http://chem-eng.utoronto.ca/~datamining/ 20
Decision Tree - Issues
Working with Continuous Attributes
Overfitting and Pruning
Super Attributes (attributes with many values)
Working with Missing Values
Attributes with Different Costs
http://chem-eng.utoronto.ca/~datamining/ 21
Numeric Variables - Binning
Temp B_Temp Play Golf
85 80-90 No
80 80-90 No
83 80-90 Yes
70 70-80 Yes Play Golf
68 60-70 Yes Yes No
65 60-70 No 60-70 3 1
64 60-70 Yes B_Temp 70-80 4 2
72 70-80 No 80-90 2 2
69 60-70 Yes
75 70-80 Yes
75 70-80 Yes
72 70-80 Yes
81 80-90 Yes
71 70-80 No
http://chem-eng.utoronto.ca/~datamining/ 22
Continuous Attributes - Discretization
Equal Frequency
This strategy creates a set of N intervals with the same
number of elements.
Equal width
The original range of values is divided into N intervals with the
same range.
Entropy based
For each numeric attribute, instances are sorted and, for each
possible threshold, a binary <, >= test is considered and
evaluated in exactly the same way that a categorical attribute
would be.
http://chem-eng.utoronto.ca/~datamining/ 23
Avoid Overfitting
Overfitting when our learning algorithm continues develop
hypotheses that reduce training set error at the cost of an
increased test set error.
Stop growing when data split not statistically significant
(Chi2 test)
Grow full tree then post-prune
Minimum description length (MDL):
Minimize: size(tree) + size(misclassifications(tree))
http://chem-eng.utoronto.ca/~datamining/ 24
Avoid Overfitting - Post- Pruning
o First, build full tree then prune it.
Fully-grown tree shows all attribute interactions
Problem: some subtrees might be due to chance effects
o Two pruning operations:
Subtree replacement
Subtree raising
o Possible strategies:
error estimation
significance testing
MDL principle
http://chem-eng.utoronto.ca/~datamining/ 25
Error Estimation
• Transformed value for f : f p
p(1 p) / N
(i.e. subtract the mean and divide by the standard deviation)
• Resulting equation:
f p
Pr z z c
p(1 p) / N
• Solving for p:
z2 f f2 z2 z2
p f z 1
2
2N N N 4N N
http://chem-eng.utoronto.ca/~datamining/ 26
witten & eibe
Error Estimation
• Error estimate for subtree is weighted sum of error
estimates for all its leaves
• Error estimate for a node (upper bound):
z 2
f f 2
z 2
z2
e f z
2
1
2N N N 4N N
• If c = 25% then z = 0.69 (from normal distribution)
• f is the error on the training data
• N is the number of instances covered by the leaf
http://chem-eng.utoronto.ca/~datamining/ 27
witten & eibe
Error Estimation
f = 5/14
e = 0.46
e < 0.51
so prune!
http://chem-eng.utoronto.ca/~datamining/ 29
witten & eibe
Super Attributes
The information gain equation, G(T,X) is biased toward
attributes that have a large number of values over
attributes that have a smaller number of values.
Theses ‘Super Attributes’ will easily be selected as the
root, result in a broad tree that classifies perfectly but
performs poorly on unseen instances.
We can penalize attributes with large numbers of values
by using an alternative method for attribute selection,
referred to as GainRatio.
http://chem-eng.utoronto.ca/~datamining/ 30
Super Attributes
Play Golf
Yes No total
Sunny 3 2 5
Outlook Overcast 4 0 4
Rainy 2 3 5
Gain = 0.247
http://chem-eng.utoronto.ca/~datamining/ 31
Super Attributes
Play Golf
Yes No total
id1 1 0 1
id2 0 1 1 Entropy(Play,ID) = 0
id3 1 0 1 Gain(Play,ID) = 0.94
id4 1 0 1
id5 0 1 1
Split (Play,ID) = - (1/14*log2(1/14)*14=3.81
id6 0 1 1
Gain Ratio (Play,ID) = 0.94/3.81= 0.247
id7 1 0 1
ID
id8 1 0 1
id9 0 1 1
id10 1 0 1
id11 1 0 1
id12 0 1 1
id13 1 0 1
id14 1 0 1
http://chem-eng.utoronto.ca/~datamining/ 32
Attributes with Different Costs
G (T , X )
G' (T , X )
Cost ( X )
http://chem-eng.utoronto.ca/~datamining/ 33
Numeric Variables and Missing Values
Outlook Temp Humidity Windy Play Golf
Rainy 85 High False No
Rainy 80 High True No
Overcast ? High False Yes
Sunny 70 High False Yes
Sunny 68 ? False Yes
Sunny 65 Normal True No
Overcast 64 Normal True Yes
Rainy 72 High ? No
Rainy 69 Normal False Yes
Sunny ? Normal False Yes
Rainy 75 Normal True Yes
? 72 High True Yes
Overcast 81 Normal False Yes
Sunny 71 High True No
http://chem-eng.utoronto.ca/~datamining/ 34
Missing Values
• For the numerical variables replace the
missing value with the average or median.
http://chem-eng.utoronto.ca/~datamining/ 36
Dataset
Predictors Target
http://chem-eng.utoronto.ca/~datamining/ 37
Decision Tree - Regression
Outlook
Sunny Rainy
Overcast
http://chem-eng.utoronto.ca/~datamining/ 38
Entropy versus Standard Deviation
c
Entropy E pi log 2 pi
i 1
Classification
Chi2 Test
r
2
c O
ij Eij
2
Regression StDev S
( x ) 2
http://chem-eng.utoronto.ca/~datamining/ 39
Target – Standard Deviation & Average
Golf Players
25
30
46
45
52
23 StDev = 9.32
43
Avg = 39.79
35
38
46
48
52
44
30
http://chem-eng.utoronto.ca/~datamining/ 40
Standard Deviation Tables
http://chem-eng.utoronto.ca/~datamining/ 41
Standard Deviation
Golf Players
Count
(StDev)
Overcast 3.49 4
Outlook Rainy 7.78 5
Sunny 10.87 5
14
S (T , X ) P(c) S (c)
cX
= 7.66
http://chem-eng.utoronto.ca/~datamining/ 42
Standard Deviation Reduction (SDR)
SDR(T , X ) S (T ) S (T , X )
http://chem-eng.utoronto.ca/~datamining/ 43
Standard Deviation Reduction
the best predictor?
Golf Players Golf Players
(StDev) (StDev)
Overcast 3.49 Cool 10.51
Outlook Rainy 7.78 Temp. Hot 8.95
Sunny 10.87 Mild 7.65
SDR=1.66 SDR=0.17
http://chem-eng.utoronto.ca/~datamining/ 44
Decision Tree – Root Node
Outlook
http://chem-eng.utoronto.ca/~datamining/ 45
Dataset – Sorted by Outlook
http://chem-eng.utoronto.ca/~datamining/ 46
Subset (Outlook = Sunny)
Temp. Humidity Windy Golf Players
http://chem-eng.utoronto.ca/~datamining/ 47
Subset (Outlook = Sunny)
Temp. Humidity Windy Golf Players
Mild High FALSE 45
Cool Normal FALSE 52 Outlook
Mild Normal FALSE 46
Cool Normal TRUE 23
Mild High TRUE 30
Sunny Overcast Rainy
Windy
FALSE TRUE
47.7 26.5
http://chem-eng.utoronto.ca/~datamining/ 48
Subset (Outlook = Overcast)
Outlook
Windy 46.3
FALSE TRUE
47.7 26.5
http://chem-eng.utoronto.ca/~datamining/ 49
Subset (Outlook = Rainy)
Temp. Humidity Windy Golf Players
Hot High FALSE 25
Hot High TRUE 30
Mild High FALSE 35
Cool Normal FALSE 38
Mild Normal TRUE 48
StDev=7.78
Golf Players
(StDev) Golf Players Golf Players
Cool 0 (StDev) (StDev)
SDR = 7.78 - ((2/5)*2.5 + (2/5)*6.5) SDR = 7.78 - ((3/5)*4.1 + (2/5)*5.0) SDR = 7.78 - ((3/5)*5.6 + (2/5)*9.0)
http://chem-eng.utoronto.ca/~datamining/ 50
Subset (Outlook = Rainy)
Temp. Humidity Windy Golf Players
Cool Normal FALSE 38
Hot High FALSE 25
Hot High TRUE 30
Mild High FALSE 35
Outlook
Mild Normal TRUE 48
http://chem-eng.utoronto.ca/~datamining/ 51
http://chem-eng.utoronto.ca/~datamining/ 52