Class 16 Decision Tree
Class 16 Decision Tree
+
When to use Decision Trees
Problem characteristics:
Instances can be described by attribute value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data samples
Robust to errors in training data
Missing attribute values
Different classification problems:
Equipment or medical diagnosis
Credit risk analysis
Several tasks in natural language processing
Top-down induction of Decision Trees
ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
The construction of the tree is top-down. The algorithm is greedy.
The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
Select the best attribute
A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A best attribute; decision attribute for Root A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs {A})
return Root
Inductive bias in decision tree learning
(Outlook=Sunny)(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
Each distinct path produces a different rule: a condition
removal may be based on a local (contextual) criterion. Node
pruning is global and affects all the rules
In rule form, tests are not ordered and there is no book-
keeping involved when conditions (nodes) are removed
Converting to rules improves readability for humans
Dealing with continuous-valued attributes
So far discrete values for attributes and for outcome.
Given a continuous-valued attribute A, dynamically create a new
attribute Ac
Ac = True if A < c, False otherwise
How to determine threshold value c ?
Example. Temperature in the PlayTennis example
Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
Determine candidate thresholds by averaging consecutive values where
there is a change in classification: (48+60)/2=54 and (80+90)/2=85
Evaluate candidate thresholds (attributes) according to information gain.
The best is Temperature>54.The new attribute competes with the other
ones
Problems with information gain
Natural bias of information gain: it favours attributes with
many possible values.
Consider the attribute Date in the PlayTennis example.
Date would have the highest information gain since it perfectly
separates the training data.
It would be selected at the root resulting in a very broad tree
Very good on the training, this tree would perform poorly in predicting
unknown instances. Overfitting.
The problem is that the partition is too specific, too many
small classes are generated.
We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A) − log2
|S |
i=1 |S |
S are the sets obtained by partitioning on value i of A
i
SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A)
SplitInformation(S, A)
GainRatio penalizes attributes that split examples in many small classes such
as Date. Let |S |=n, Date splits examples in n classes
SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
Compare with A, which splits data in two even classes:
SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
Problem: SplitInformation(S, A) can be zero or very small
when |Si | ≈ |S | for some value i
To mitigate this effect, the following heuristics has been
used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
Handling incomplete training data
How to cope with the problem that the value of some
attribute may be missing?
Example: Blood-Test-Result in a medical diagnosis problem
The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training examples at
the node
2. Assign a probability to each value, based on frequencies, and assign
values to missing attribute, according to this probability distribution
Missing values in new instances to be classified are treated
accordingly, and the most probable classification is chosen
(C4.5)
Handling attributes with different costs
Instance attributes may have an associated cost: we would
prefer decision trees that use low-cost attributes
ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) 1