Unit II Part 1
Unit II Part 1
• Prone to Over-fitting
4
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Tree Uses Nodes, and Leaves
Example of a Decision Tree
Tid Refund Marital Taxable
Status Income Cheat
Training Data
Example of a Decision Tree
Tid Refund Marital Taxable
Status Income Cheat
Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
When to consider Decision Trees
• Instances describable by attribute-value pairs
• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Missing attribute values
• Examples:
– Medical diagnosis
– Credit risk analysis
– Object classification for robot manipulator (Tan 1993)
17
Top-Down Induction of Decision Trees
ID3 (Iterative Dichotomiser 3)
In DTL, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross
Quinlan used to generate a DT from a dataset. ID3 is the precursor to the C4.5
algorithm, and is typically used in the machine learning and natural language
processing domains.
19
Pseudocode-ID3
9 9 5 5
H (S ) log 2 log 2 0.940
14 14 14 14
Information Gain
Information Gain is a statistical measure that indicates how
well a given feature F separates (discriminates) instances
according to the target classes for an arbitrary collection of
examples = S. |S|= cardinality of S=Number of the
elements in the set.
Sv
GainS , F H S H S v
v Values ( F ) S
S v = subsets of sets with value v of feature F.
Entropy is a statistical measure from information theory that
characterizes impurity of an arbitrary collection of examples = S.
Information Gain for Wind feature
S for Weak value ={D1,D3,D4,D5,D8,D9,D10,D13}=[6+,2−]
S for Strong value ={D2,D6,D7,D11,D12,D4}=[3+,3−]
S
H S Strong
S Strong
GainS ,Wind H S Weak H SWeak
Sv
GainS , F H S H Sv
v Values( F ) S S S
Information Gain
Sv
GainS , F H S H S v
v Values ( F ) S
SWeak
H S Strong
S Strong
GainS ,Wind H S H SWeak
S S
8 6
GainS ,Wind 0.940 0.811 1 0.940 [0.463 0.428] 0.048
14 14
Similarly compute for others. Information gains for the four features:
GainS ,Wind 0.048; GainS , Outlook 0.246
GainS , Humidity 0.151; GainS , Temperature 0.029
Outlook has the highest Information Gain and is the preferred feature to discriminate among
data-items. Outlook attribute provides the best prediction of the target attribute. Play tennis,
over the training examples. Therefore, outlook is selected as the decision attribute for the root
node, and branches are created below the root for each of its possible values (i.e., Sunny,
Overcast, and Rain).
Outlook
? Yes ?
The overcast descendant has only positive examples and therefore becomes a leaf
node with classification Yes. The other two nodes(Sunny and Rainy) will be further
expanded by selecting the attribute with highest information gain relative to the
new subsets of examples.
2 2 3 3
H ( S Sunny ) log 2 log 2 0.4(1.32) 0.6(0.73) 0.970
5 5 5 5
[Sunny,Humidity]=High 3(0+,3-), Normal 2(2+,0-)
3 3
H ( S High ) log 2 1(0) 0
3 3
2 2
H ( S Normal ) log 2 1(0) 0
2 2
S High
GainS Sunny , Humidity H ( S Sunny ) H S High
S Normal
H S Normal
Humidity
S S Humidity
2 2
H ( S Hot ) log 2 1(0) 0
2 2
1 1 1 1
H ( S Mild ) log 2 log 2 0.5(1) 0.5(1) 1
2 2 2 2
1 1
H ( SCool ) log 2 1(0) 0
1 1
S Hot
GainS Sunny , Temperature H ( S Sunny )
S Mild SCool
H S Hot H S Mild H SCool
STemp STemp STemp
Hence, Humidity has the highest Information Gain and is the preferred feature to
discriminate among data-items.
Outlook
Humidity
Yes ?
High Normal
[0+,3-] [2+,0-]
No
Yes
•Every attribute has already been included along this path through the tree
•The training examples associated with this leaf node all have the same target
attribute value (i.e., their entropy is zero). Hence, a decision tree for the concept play
tennis is given below.
Outlook
Humidity
Yes Wind
High Normal Strong Weak
[0+,3-] [2+,0-] [0+,2-] [3+,0-]
No No Yes
Yes
C4.5-Gain Ratio
Wind Attribute
Gain( S , A) c
Si Si
GainRatio ( S , A) Split inf( S , A) log 2
Split inf( S , A) i 1 S S
GainS ,Wind 0.048; GainS , Outlook 0.246 GainS , Humidity 0.151; GainS , Temperature 0.029
S for Weak value ={D1,D3,D4,D5,D8,D9,D10,D13}=[6+,2−] Day Wind Play Tennis
S for Strong value ={D2,D6,D7,D11,D12,D14}=[3+,3−]
There are 8decisions for Weak and 6 decisions for Strong D1 Weak No
c
D2 Strong No
Si S 8 8 6 6
Split inf( S ,Wind ) log 2 i log 2 log 2 0.985 D3 Weak Yes
i 1 S S 14 14 14 14 D4 Weak Yes
D5 Weak Yes
Gain( S ,Wind ) 0.048
GainRatio ( S , A) 0.049 D6 Strong No
Split inf( S ,Wind ) 0.985 D7 Strong Yes
D8 Weak No
D9 Weak Yes
D10 Weak Yes
D11 Strong Yes
D12 Strong Yes
D13 Weak Yes
D14 Strong No
C4.5-Training Examples
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny 85 85 Weak No
D2 Sunny 80 90 Strong No
D3 Overcast 83 78 Weak Yes
D4 Rain 70 96 Weak Yes
D5 Rain 68 80 Weak Yes
D6 Rain 65 70 Strong No
D7 Overcast 64 65 Strong Yes
D8 Sunny 72 95 Weak No
D9 Sunny 69 70 Weak Yes
D10 Rain 75 80 Weak Yes
D11 Sunny 75 70 Strong Yes
D12 Overcast 72 90 Strong Yes
D13 Overcast 81 75 Weak Yes
D14 Rain 71 80 Strong No
Outlook Attribute
GainS , Outlook 0.246 S for Sunny value ={D1,D2, D8,D9, D11}=[2+,3−]
S for Overcast value ={D3, D7, D12,D13}=[4+,0−]
S for Rainy value ={D4,D5,D6,D10,D14}=[3+,2−]
c
Si S 5 5 4 4 5 5
Split inf( S , Outlook ) log 2 i log 2 log 2 log 2 1.577
i 1 S S 14 14 14 14 14 14
Day Outlook Play Tennis
D1 Sunny No
Gain( S , Outlook ) 0.246
GainRatio ( S , A) 0.155 D2 Sunny No
Split inf( S , Outlook ) 1.577 D3 Overcast Yes
D4 Rain Yes
D5 Rain Yes
D6 Rain No
D7 Overcast Yes
D8 Sunny No
D9 Sunny Yes
D10 Rain Yes
D11 Sunny Yes
D12 Overcast Yes
D13 Overcast Yes
D14 Rain No
Humidity Attribute
Humidity is a continuous attribute. We need to convert continuous values to
nominal ones. C4.5 proposes to perform binary split based on a threshold value.
Threshold should be a value which offers maximum gain for that attribute. Sort
humidity values smallest to largest. Day Humidity Play Tennis
D7 65 Yes
D6 70 No
D9 70 Yes
D11 70 Yes
D13 75 Yes
D3 78 Yes
D5 80 Yes
D10 80 Yes
D14 80 No
D01 85 No
D02 90 No
D12 90 Yes
D8 95 No
D4 96 Yes
Separate dataset into two parts as instances less than or equal to current value and
instances greater than current value. Calculate the gain or gain ratio for every step.
This value is the threshold.
Step 1: Consider threshold 65
<=65 - 01(1+,0-)
>65 - 13(8+,5-)
H ( S 65 ) p log 2 p p log 2 p log 2 log 2 0
1 1 0 0
1 1 1 1
8 8 5 5
H ( S 65 ) log 2 log 2 0.961
13 13 13 13
S S
GainS , Humidity65 H S 65 H S 65 65 H S 65
S S
1
GainS , Humidity65 0.940 0 0.961 0.048
13
14 14
c
Si S
Split inf S , Humidity65
1 1 13 13
log 2 i log 2 log 2 0.371
i 1 S S 14 14 14 14
GainS , Humidity65
GainRatioS , Humidity65
0.048
0.126
Split inf S , Humidity65 0.371
Step 2: Consider threshold 70
<=70 - 04(3+,1-)
>70 - 10(6+,4-)
H ( S 70 ) p log 2 p p log 2 p log 2 log 2 0.811
3 3 1 1
4 4 4 4
Day Humidity Play Tennis 6 6 4 4
H ( S 70 ) log 2 log 2 0.970
D7 65 Yes 10 10 10 10
D6 70 No S S
D9 70 Yes GainS , Humidity70 H S 70 H S 70 70 H S 70
D11 70 Yes S S
4
GainS , Humidity70 0.940 0.811 0.970 0.014
D13 75 Yes 10
D3 78 Yes 14 14
D5 80 Yes c
Si S
Split inf S , Humidity70
4 4 10 10
log 2 i log 2 log 2 0.863
D10 80 Yes i 1 S S 14 14 14 14
D14 80 No GainS , Humidity70
GainRatioS , Humidity70
0.014
0.016
D01 85 No Split inf S , Humidity70 0.863
D02 90 No
D12 90 Yes
D8 95 No
D4 96 Yes
Above procedure is applied for all the thresholds
ID3 uses information gain for splitting. C4.5 uses gain ratio for
splitting. CART can handle both classification and regression tasks.
It uses gini index to create decision points for splitting.
M
Gini 1 Pi
2
i 1
where i= 1 to M number of classes.
Outlook Attribute:
Outlook Yes No Number of Instances
Day Outlook Play Tennis
Sunny 2 3 5
D1 Sunny No Overcast 4 0 4
D2 Sunny No Rain 3 2 5
D3 Overcast Yes
2 2
M
Gini Wind PWind i Gini Windi
8 6
0.375 0.5 0.428
i 1 14 14
Consolidated table is
.
Feature Gini index
Outlook 0.342(Lowest)
Temperature 0.439
Humidity 0.367
Wind 0.428
The winner will be outlook feature because its cost is the lowest.
Outlook
Sunny Rainy
Overcast
Important Algorithm Types
Algorithm
Types
Datasets may have missing values, and this can cause problems
for many machine learning algorithms. As such, it is good practice
to identify and replace missing values for each column in your
input data prior to modeling your prediction task. This is called
missing data imputation, or imputing for short.
EXAMPLE-Construct DT-Information Gain
H SHigh
RGB E RGB
High 1 log 2 1 0
Solution 2
Channel Variance Image Type
Monochrome Low BW
RGB Low BW
RGB High Color
InformationGainS , Channel 0.256 InformationGainS , Variance 0.9183
Satellite Image Classification using Decision Tree
High Resolution Images