07_Decision tree
07_Decision tree
Decision Tree
Leaf node
Choosing a good attribute
• Would we prefer to split on X1 or X2?
𝐼𝐺 𝑆, 𝐴 = 𝐻(𝑆) − 𝑃 𝑡 𝐻(𝑡)
𝑡∈𝑇
9 9 5 5
= − log 2 − log 2
14 14 14 14
4 4
• 𝐻 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = − log 2 −0=0 4 out of 4 overcast is “yes”
4 4
3 3 2 2
• 𝐻 𝑟𝑎𝑖𝑛𝑦 = − log 2 − log 2 = 0.971 3 out of 5 rainy is “yes” and 2 out of 5 rainy is “no”
5 5 5 5
3 3 1 1
• 𝐻 𝑐𝑜𝑙𝑑 = − log 2 − log 2 = 0.811 3 out of 4 cold is “yes” and 1 out of 4 cold is “no”
4 4 4 4
4 4 2 2
• 𝐻 𝑚𝑖𝑙𝑑 = − log 2 − log 2 = 0.9179 4 out of 6 mild is “yes” and 2 out of 6 mild is “no”
6 6 6 6
6 6 1 1
• 𝐻 𝑛𝑜𝑟𝑚𝑎𝑙 = − log 2 − log 2 = 0.591 6 out of 7 normal is “yes” and 1 out of 7 normal is “no”
7 7 7 7
3 3 3 3
• 𝐻 𝑠𝑡𝑟𝑜𝑛𝑔 = − log 2 − log 2 =1 3 out of 6 strong is “yes” and 3 out of 6 strong is “no”
6 6 6 6
Outlook
? Yes ?
Solution
• Next, from the remaining three features temp., humidity, and wind, we decide which
one is the best for the left branch of outlook.
• Since the left branch of outlook denotes sunny, we will work with the set of rows
having sunny as the value in the outlook column.
• Calculate entropy for this subset (outlook = sunny):
• 𝐻 𝑐𝑜𝑙𝑑 = −1 log 2 1 − 0 = 0
1 1 1 1
• 𝐻 𝑚𝑖𝑙𝑑 = − log 2 − log 2 =1
2 2 2 2
2 2
• 𝐻 𝑛𝑜𝑟𝑚𝑎𝑙 = − log 2 −0=0
2 2
1 1 1 1
• 𝐻 𝑠𝑡𝑟𝑜𝑛𝑔 = − log 2 − log 2 =1
2 2 2 2
Outlook
Humidity Yes ?
High Normal
No Yes
Solution
• Next, from the remaining two features temp. and wind, we decide which one is the
best for splitting the data.
• Since the remaining branch of outlook denotes rainy, we will work with the set of
rows having rainy as the value in the outlook column.
• Calculate entropy for this subset (outlook = rainy):
2 2 1 1
• 𝐻 𝑚𝑖𝑙𝑑 = − log 2 − log 2 = 0.918
3 3 3 3
2 2
• 𝐻 𝑠𝑡𝑟𝑜𝑛𝑔 = 0 − log 2 =0
2 2
Outlook
No Yes Yes No
Real-valued features
• The real-life data often contains numeric information or a mixture of different
feature types while decision trees work with categorical values.
• Discretization is a pre-processing step that changes numeric values to categorical
ones by finding sub-intervals.
• Binary split is a discretization method based on a threshold value (“greater than or
equal to” and “less than”).
• Splitting on feature 𝑥 at value 𝑡:
• One branch: 𝑥 ≥ 𝑡
• Other branch: 𝑥 < 𝑡
• In binary split, the aim is to maximize 𝐼𝐺 𝑆|𝑥: 𝑡
• i.e. threshold 𝑡 should maximize information gain for feature 𝑥 in dataset 𝑆.
Example 2
Day Outlook Temp. Humidity Wind Play tennis
• For the shown dataset (𝑠), it is 1 Sunny 85 85 Weak No
required to construct a decision 2 Sunny 80 90 Strong No
tree to decide whether to play 3 Overcast 83 78 Weak Yes
tennis or not based on the 4 Rainy 70 96 Weak Yes
weather conditions. 5 Rainy 68 80 Weak Yes
6 Rainy 65 70 Strong No
7 Overcast 64 65 Strong Yes
8 Sunny 72 95 Weak No
9 Sunny 69 70 Weak Yes
10 Rainy 75 80 Weak Yes
11 Sunny 75 70 Strong Yes
12 Overcast 72 90 Strong Yes
13 Overcast 81 75 Weak Yes
14 Rainy 71 80 Strong No
Solution
Day Humidity Play tennis
1 65 Yes
• Continuous values of humidity and temp. features need to 2 70 No
be converted to categorical ones. 3 70 Yes
• We will convert the humidity values using binary 4 70 Yes
discretization. 5 75 Yes
6 78 Yes
• Binary discretization steps:
7 80 Yes
1. Sort values from smallest to largest. 8 80 Yes
2. Iterate on all values and separate the dataset into two 9 80 No
parts. 10 85 No
3. Calculate the gain for every step (value). 11 90 No
12 90 Yes
4. The value which maximizes the gain would be the 13 95 No
threshold. 14 96 Yes
Solution
𝐻 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 ≤ 65 = −𝑝(𝑛𝑜) log 2 𝑝(𝑛𝑜) − 𝑝(𝑦𝑒𝑠) log 2 𝑝(𝑦𝑒𝑠)
0 0 1 1
= − log 2 − log 2 =0
1 1 1 1
5 5 8 8
=− log 2 − log 2 = 0.53 + 0.431 = 0.961
13 13 13 13
1 13
𝐼𝐺 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦, 65 = 0.94 − ×0 + × 0.961 = 0.94 − 0.892 = 0.048
14 14
Solution
• IG maximizes when humidity is equal to 80. Day Outlook Temp. Humidity>80 Wind Play tennis
• Hence, threshold is equal to 80. 1 Sunny 85 yes Weak No
2 Sunny 80 yes Strong No
3 Overcast 83 no Weak Yes
Humidity 𝐼𝐺
4 Rainy 70 yes Weak Yes
65 0.048
5 Rainy 68 no Weak Yes
70 0.014 6 Rainy 65 no Strong No
75 0.045 7 Overcast 64 no Strong Yes
78 0.090 8 Sunny 72 yes Weak No
80 0.101 9 Sunny 69 no Weak Yes
85 0.024 10 Rainy 75 no Weak Yes
11 Sunny 75 no Strong Yes
90 0.010
12 Overcast 72 yes Strong Yes
95 0.048
13 Overcast 81 no Weak Yes
Humidity cannot be
96 greater than this value 14 Rainy 71 no Strong No
Solution
• If you change the continuous values of temp. to categorical values and continue
solving, you will get:
Outlook
No Yes Yes No
Regression tree
• Standard deviation reduction (𝑆𝐷𝑅) is used instead of IG for constructing a
regression decision tree.
• It involves partitioning the data into subsets that contain instances with nearly
similar values (homogenous).
• Standard deviation (𝑆𝐷) is used to calculate the homogeneity of numerical
samples.
σ 𝑥 − 𝑥ҧ 2
• If the numerical samples are completely homogeneous their standard 𝑆𝐷 =
deviation is zero. 𝑛
• Branching termination criteria are:
• when coefficient of variation (𝐶𝑉) for a branch becomes smaller than a 𝑆𝐷
certain threshold. 𝐶𝑉 = × 100%
𝑥ҧ
• when too few instances (𝑛) remain in the branch.
Standard deviation
• For each feature of the dataset, calculate 𝑆𝐷 for all its values then calculate 𝑆𝐷𝑅
for the feature.
𝑆𝐷 𝑆, 𝐴 = 𝑃 𝑡 𝑆𝐷(𝑡)
𝑡∈𝑇
𝑆𝐷𝑅 𝑆, 𝐴 = 𝑆𝐷 𝑆 − 𝑆𝐷 𝑆, 𝐴
Example 3
Hours
Day Outlook Temp. Humidity Wind
played
• For the shown dataset (𝑠), it is 1 Sunny Hot High Weak 25
required to construct a regression 2 Sunny Hot High Strong 30
tree to decide hours to play tennis 3 Overcast Hot High Weak 46
based on the weather conditions. 4 Rainy Mild High Weak 45
5 Rainy Cold Normal Weak 52
6 Rainy Cold Normal Strong 23
7 Overcast Cold Normal Strong 43
8 Sunny Mild High Weak 35
9 Sunny Cold Normal Weak 38
10 Rainy Mild Normal Weak 46
11 Sunny Mild Normal Strong 48
12 Overcast Mild High Strong 52
13 Overcast Hot Normal Weak 44
14 Rainy Mild High Strong 30
Solution
𝑆𝐷 ℎ𝑜𝑢𝑟𝑠, 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑃 𝑠𝑢𝑛𝑛𝑦 × 𝑆𝐷 𝑠𝑢𝑛𝑛𝑦 + 𝑃 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 × 𝑆𝐷 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 + 𝑃 𝑟𝑎𝑖𝑛𝑦 × 𝑆𝐷 𝑟𝑎𝑖𝑛𝑦
5 4 5
𝑆𝐷 ℎ𝑜𝑢𝑟𝑠, 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = × 7.78 + × 3.49 + × 10.87 = 7.66
14 14 14
𝑆𝐷 ℎ𝑜𝑢𝑟𝑠 = 9.32
Hours
𝑆𝐷𝑅 ℎ𝑜𝑢𝑟𝑠, 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝐷 ℎ𝑜𝑢𝑟𝑠 − 𝑆𝐷 ℎ𝑜𝑢𝑟𝑠, 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 played 𝑛
(𝑆𝐷)
𝑆𝐷𝑅 ℎ𝑜𝑢𝑟𝑠, 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 9.32 − 7.66 = 1.66 Sunny 7.78 5
Outlook Overcast 3.49 4
Rainy 10.87 5
It is supposed that you
know how to calculate
SD of sunny for example
Solution
SD(Hours) 𝑛 SD(Hours) 𝑛
Sunny 7.78 5 Cold 10.51 4
Outlook Overcast 3.49 4 Temp. Hot 8.95 4
Rainy 10.87 5 Mild 7.65 6
SD(hours, outlook)= 7.66 SD(hours, Temp.)= 8.84
SDR(hours, outlook)= 9.32 - 7.66 = 1.66 SDR(hours, Temp.)= 9.32 – 8.84 = 0.48
SD(Hours) 𝑛 SD(Hours) 𝑛
High 9.36 7 Weak 7.87 8
Humidity Wind
Normal 8.73 7 Strong 10.59 6
SD(hours, humidity)= 9.05 SD(hours, wind)= 9.03
SDR(hours, humidity)= 9.32 – 9.04 = 0.28 SDR(hours, wind)= 9.32 – 9.03 = 0.29
Solution
• The feature with the largest SDR is outlook which is selected to be the root for the
tree.
• The dataset is divided based on the values of the selected feature.
• This process is run recursively on the non-leaf branches until all data is processed.
• Termination criteria are:
• 𝐶𝑉 ≤ 10%
• and/or 𝑛 ≤ 3.
Solution
• Calculate the average of hours (AVG) and CV for all values (sunny, overcast, rainy) of the
outlook feature.
• Overcast subset does not need splitting because its 𝐶𝑉 = 8% is less than the threshold
(10%).
• The related leaf node of the overcast gets the average of the overcast subset.
SD AVG CV Outlook
𝑛
(hours) (hours) (hours) Initial tree Rainy Sunny
Sunny 7.66 35.2 22% 5
Overcast
Outlook Overcast 3.49 46.3 8% 4
Rainy 10.87 39.2 28% 5 ? 46.3 ?
Solution
• Rainy branch has 𝐶𝑉 = 28% which is greater than the given threshold (10%).
Hence, this branch needs further splitting.
• 𝑆𝐷 ℎ𝑜𝑢𝑟𝑠, 𝑟𝑎𝑖𝑛𝑦 = 10.87, this represents the SD of the remaining sub dataset
when outlook = rainy.
• Then, calculate SDR for each of the features temp., humidity, and wind.
Outlook Temp. Humidity Wind Hours played
Rainy Mild High Weak 45
Rainy Cold Normal Weak 52
Rainy Cold Normal Strong 23
Rainy Mild Normal Weak 46
Rainy Mild High Strong 30
Solution
SD(Hours) 𝑛 SD(Hours) 𝑛
Cold 14.5 2 High 7.5 2
Temp. Humidity
Mild 7.32 3 Normal 12.5 3
SD(hours, Temp.)= 10.19 SD(hours, humidity)= 10.5
SDR(hours, Temp.)= 10.87 – 10.19 = 0.678 SDR(hours, humidity)= 10.87 – 10.5 = 0.37
SD(Hours) 𝑛
Weak 3.09 3
Wind
Strong 3.5 2
SD(hours, wind)= 3.25
SDR(hours, wind)= 10.87 – 3.25 = 7.62
Solution
• Wind has the largest SDR.
• Because the number of instances for both branches (weak and strong) are all equal
or less than 3 𝑛 ≤ 3 we stop further branching and assign the average of each
branch to the related leaf node.
• 𝐴𝑉𝐺(𝑤𝑖𝑛𝑑 = 𝑤𝑒𝑎𝑘) = 47.7 Outlook
• 𝐴𝑉𝐺(𝑤𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔) = 26.5 Rainy Sunny
Overcast
Wind 46.3 ?
Weak Strong
47.7 26.5
Solution
• Sunny branch has 𝐶𝑉 = 22% which is greater than the given threshold (10%). Hence,
this branch needs further splitting.
• 𝑆𝐷 ℎ𝑜𝑢𝑟𝑠, 𝑠𝑢𝑛𝑛𝑦 = 7.78, this represents the SD of the remaining sub dataset when
outlook = sunny.
• Then, calculate SDR for each of the features temp., humidity, and wind.
SD(Hours) 𝑛
Weak 5.6 3
Wind
Strong 9 2
SD(hours, wind)= 6.96
SDR(hours, wind)= 7.78 – 6.96 = 0.82
Solution
• Temp. has the largest SDR.
• Because the number of instances for temp.’s branches (cold, hot, and mild) are all
equal or less than 3 𝑛 ≤ 3 we stop further branching and assign the average of
each branch to the related leaf node.
• 𝐴𝑉𝐺(𝑡𝑒𝑚𝑝. = 𝑐𝑜𝑙𝑑) = 38
• 𝐴𝑉𝐺(𝑡𝑒𝑚𝑝. = ℎ𝑜𝑡) = 27.5
• 𝐴𝑉𝐺(𝑡𝑒𝑚𝑝. = 𝑚𝑖𝑙𝑑) = 41.5
Solution
Outlook
Rainy Sunny
Overcast