Enhancements To Basic Decision Tree Induction, C4.5
Enhancements To Basic Decision Tree Induction, C4.5
Enhancements To Basic Decision Tree Induction, C4.5
p(risk is low)=5/14
6 6 3 3 5 5
I(credit _ table) log 2 log 2 log 2
14 14 14 14 14 14
I(credit _ table) 1.531 bits
gain(income)=I(credit_table)-E(income)
gain(income)=1.531-0.564
gain(income)=0.967 bits
gain(credit history)=0.266
gain(debt)=0.581
gain(collateral)=0.756
Overfiting
Reduced-Error Pruning
C4.5
From Trees to Rules
Contigency table (statistics)
continous/unknown attributescross-
validation
Overfiting
The ID3 algorithm grows each branch of the
tree just deeply enough to perfectly classify the
training examples
Difficulties may be present:
When there is noise in the data
When the number of training examples is too small
to produce a representative sample of the true target
function
The ID3 algorithm can produce trees that
overfit the training examples
We will say that a hypothesis overfits the
training examples if some other
hypothesis that fits the training examples
less well actually performs better over the
entire distribution of instances (included
instances beyond training set)
Overfitting
Consider error of hypothesis h over
Training data: errortrain(h)
Entire distribution D of data: errorD(h)
Hypothesis hH overfits training data if there is
an alternative hypothesis h’H such that
errortrain(h) < errortrain(h’)
and
errorD(h) > errorD(h’)
Overfitting
How can it be possible for a tree h to fit
the training examples better than h’
But to perform more poorly over
subsequent examples
One way this can occur when the training
examples contain random errors or noise
Training Examples
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Tree for PlayTennis
Outlook
No Yes No Yes
Consider of adding the following positive
training example, incorrectly labaled as
negative
Outlook=Sunny, Temperature=Hot,
Humidty=Normal, Wind=Strong, PlayTenis=No
The addition of this incorrect example will now
cause ID3 to construct a more complex tree
Because the new example is labaled as a
negative example, ID3 will search for further
refinements to the tree
As long as the new errenous example differs in
some attributes, ID3 will succeed in finding a
tree
ID3 will output a decision tree (h) that is more
complex then the orginal tree (h‘)
Given the new decision tree a simple
consequence of fitting nosy training
examples,h‘ will outpreform h on the test set
Avoid Overfitting
How can we avoid overfitting?
Stop growing when data split not statistically
significant
Grow full tree then post-prune
Blond Brown
Red
Alex
Lotion Used Emily
Pete
John
No Yes
No change Sunburned
Person is blonde (uses 2 0
lotion)
Person is not blonde 1 0
(uses lotion)
Check for lotion for the same rule
No change Sunburned
Person uses lotion 2 0
Person uses no lotion 0 2
(o e ) 2
2 ij ij
i i eij
if the person's hair color is blond
and the person uses no lotion
then the person turns red
Actual
Expected
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Sample degrees of freedom calculation:
df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1
gain(P)
gain _ ratio(P)
split (P)
| Ci | | Ci |
n
split (P) log
i1 | C |
| C |
Other Enhancements
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are sparsely
represented
This reduces fragmentation, repetition, and replication
Continuous Valued Attributes
Create a discrete attribute to test continuous
Temperature = 24.50C
(Temperature > 20.00C) = {true, false}
Where to set the threshold?
Problems:
• makes insufficient use of data
• training and validation set are correlated
Cross-Validation
k-fold cross-validation splits the data set D into
k mutually exclusive subsets D1,D2,…,Dk
D1 D2 D3 D4
D1 D2 D3 D4 D1 D2 D3 D 4
Cross-Validation
Uses all the data for training and testing
Complete k-fold cross-validation splits the
dataset of size m in all (m over m/k) possible
ways (choosing m/k instances out of m)
Leave n-out cross-validation sets n instances
aside for testing and uses the remaining ones
for training (leave one-out is equivalent to n-
fold cross-validation)
In stratified cross-validation, the folds are
stratified so that they contain approximately
the same proportion of labels as the original
data set
Overfiting
Reduced-Error Pruning
C4.5
From Trees to Rules
Contigency table (statistics)
continous/unknown attributescross-
validation
Neural Networks
Perceptron