06 - Decision Trees
06 - Decision Trees
202223
06
Decision Trees
A
G
E
N Decision Trees
D
A
Classification Trees
Regression Trees
- In some problems we are just interested in achieving the best precision possible. In others we are
more interested in understanding the results and the way the model is producing the estimates
- Sometimes the reasons that underlie certain decisions are of paramount importance.
Interpretability
1 Decision Trees
Income
Age
1 Decision Trees
Age≥3
No Yes
(4,0) Income ≥ 3
No Yes
Age ≥ 6 Age ≥ 5
No Yes No Yes
No Yes
(4,0) Income ≥ 3
No Yes
Age ≥ 6 Age ≥ 5
No Yes No Yes
Age≥3
Leaf node
1.1 Decision Trees Classification Tree
Classification Trees vs Regression Trees
Age≥3
No Yes
(4,0) Income ≥ 3
No Yes
Age ≥ 6 Age ≥ 5
No Yes No Yes
x ≥ 3.2
Income
No Yes
2.0 x ≥ 5.1
No Yes
3.5 x ≥ 6.1
No Yes
Age
4.5 5.0
1.2 Decision Trees
Rule extraction from trees
Age≥3
No Yes
- Each edge adds a conjunction (∧)
(4,0) Income ≥ 3
- Each new leaf adds a disjunction (∨)
No Yes
⇔ 𝑎𝑔𝑒 < 3
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ (𝑖𝑛𝑐𝑜𝑚𝑒 < 3) ∧ (𝑎𝑔𝑒 < 6)
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ (𝑖𝑛𝑐𝑜𝑚𝑒 < 3) ∧ (𝑎𝑔𝑒 ≥ 6) Age ≥ 6 Age ≥ 5
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 < 5
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 ≥ 5 No Yes No Yes
⇔ 𝑎𝑔𝑒 < 3 ∨ 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 < 3 ∧ 𝑎𝑔𝑒 < 6 ∨ ( 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 ≥ 5 )
Interpretation
- Easily understand the underlying reason to the decision
Automatic definition of the attributes that are more relevant in each case
- The most relevant attributes appear in the top part of the tree
Training Set
1.4 Decision Trees
Decision Trees induction (or how to build trees)
When to stop?
Which variable to
query?
1.4 Decision Trees
Decision Trees induction (or how to build trees)
DECISION TREES
CART CHAID
DDT ID3, C4.5, C5
Classification and Chi-Squared Automatic
Divisive decision tree Iterative Dichotomizer 3 …
Regression Trees Interaction Detection
(Hunt 62) (Quinlan 86, 93)
(Breiman 84) (Hartigan 75)
2 Classification Trees
What are
Classification Trees?
Tree models where the target variable can take
a discrete set of values
2.1 Classification Trees DDT
DDT
Major characteristics
- It is a greed search
- There is no “backtracking” (once a partition is done there is no re-evaluation)
- It can become stuck in a local minimum
- It uses discriminate power as selection measure
Lethargia
The class
(dependent
Independent variables
variable) Burpoma The quantity of nuclei (1 or 2)
1
Discriminate power = 𝐶𝑖
𝑛
𝑖=1
Choice: # tails
2.1 Classification Trees DDT
DDT
Tails
one two
Lethargia 3 2
Healthy 1 1
2.1 Classification Trees DDT
DDT
Tails
one two
Tails
one two
Tails
one two
Light Dark
Lethargia (1) Healthy (2)
2.1 Classification Trees DDT
DDT
Tails
one two
Nuclei Nuclei
one two
one two
Major characteristics
- It uses entropy to measure the “disorder” in each independent variable
- From entropy, we can calculate the information gain, the selection measure
- ID3 handles only categorical attributes while C4.5 is able to deal also with numeric values
The nomenclature
Let D be a training set of class-labeled tuples
Let 𝑝𝑖 be the probability that an arbitrary tuple in D belongs to class 𝐶𝑖 , estimated by 𝐶𝑖,𝐷 / 𝐷
Let 𝐶𝑖,𝐷 be the set of tuples of class 𝐶𝑖 in D.
Let 𝐶𝑖,𝐷 and 𝐷 be the number of tuples in D and 𝐶𝑖,𝐷 and D
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) = − 𝑝𝑖 log 2 ( 𝑝𝑖 )
𝑖=1
If we choose attribute A to split the node with the tuples in D in v partitions, then the Information needed to classify D is:
𝑣
|𝐷𝑗 |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝐷) = × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑗 )
|𝐷|
𝑗=1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) = − 𝑝𝑖 log 2 ( 𝑝𝑖 )
𝑖=1
Buy Computer
9 9 5 5
Yes No 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) = 𝐸(9,5) = − log 2 ( ) − log 2 ( ) = 0.940
14 14 14 14
9 5
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5
<=30 2 3 5 0.971 2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴𝑔𝑒<30 (𝐷) = 𝐸(2,3) = − log 2 ( ) − log 2 ( ) = 0.971
5 5 5 5
4 4 0 0
[31…40] 4 0 4 0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴𝑔𝑒 31..40 (𝐷) = 𝐸(4,0) = − log 2 ( ) − log 2 ( ) = 0
4 4 4 4
3 3 2 2
>40 3 2 5 0.971 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴𝑔𝑒>40 (𝐷) = 𝐸(3,2) = − log 2 ( ) − log 2 ( ) = 0.971
5 5 5 5
𝑣
|𝐷𝑗 |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝐷) = × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑗 )
|𝐷|
𝑗=1
5 4 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑎𝑔𝑒 (𝐷) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(2,3) + 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(4,0) + 𝐸(3,2) = 0.694
14 14 14
5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 2,3 means “age <=30” has 5 out of 14 samples, with 2 yes and 3 no
14
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5
Since age has the highest information gain it is the selected attribute
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5
Age
age income student credit_rating buys_computer age income student credit_rating buys_computer
<=30 high no fair no >40 medium no fair yes
<=30 high no excellent no >40 low yes fair yes
<=30 medium no fair no >40 low yes excellent no
<=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes >40 medium no excellent no
age
But what about if I want to use the original age (algorithm C4.5)?
18
Originally Age is a continuous-valued attribute 20
21
22
26
Must determine the best split point for Age
27
- Sort the values of Age 29
31
- Every midpoint between each pair of adjacent values is a possible split point 35
41
𝐷𝑗
- Evaluate 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 𝐷 = σ𝑣𝑗=1 × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐷𝑗 with two partitions 42
𝐷
45
(Partition 1<split point<Partition 2) 48
49
- The point with the minimum entropy for Age is selected as the split-point …
2.3 Classification Trees CART
Gini Index - CART
- If a data set D contains examples from n classes, Gini index is defined as:
𝑔𝑖𝑛𝑖 𝐷 = 1 − 𝑝𝑗2
𝑗=1
- If a data set D is split on Age into two subsets D1 and D2, the gini index is defined as:
𝐷1 𝐷2
𝑔𝑖𝑛𝑖𝐴𝑔𝑒 𝐷 = 𝑔𝑖𝑛𝑖 𝐷1 + 𝑔𝑖𝑛𝑖 𝐷2
𝐷 𝐷
- The attribute that maximizes the reduction in impurity is selected as splitting attribute
2.3 Classification Trees CART
Gini Index - CART
𝑔𝑖𝑛𝑖 𝐷 = 1 − 𝑝𝑗2
𝑗=1
For the AllElectronics example:
Buy Computer
Yes No 9 2 5 2
g𝑖𝑛𝑖(𝐷) = 1 − − = 0.459
14 14
9 5
𝐷1 𝐷2
𝑔𝑖𝑛𝑖𝐴𝑔𝑒 𝐷 = 𝑔𝑖𝑛𝑖 𝐷1 + 𝑔𝑖𝑛𝑖 𝐷2
𝐷 𝐷
In a regression problem…
There are some problems that are easily fitted with a linear regression…
100
Money Spent 75
50
20
20 30 40 50
Age
1 Regression Trees
55
As an example, if someone has
45 the age of 64…
35 we will predict that the money spent
25 would be around 50€…
- In some datasets we should use other methods than using straight lines to make predictions
- A Regression tree is a type of a Decision tree where each leaf represents a numeric value, and not a
discrete category like in Classification Decision Trees.
3 Regression Trees
Yes No
Yes No
Yes No
However, I could obtain the same answer just by looking to the plot!
3 Regression Trees
Age < 24
Yes No
#Kids < 2 …
Yes No
#months < 7 …
Yes No
6€ spent …
1 Regression Trees
65
55
25
15
5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3 Regression Trees
65
Age < 20
55
Yes No
45
25
15
5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees Age < 20 MSE
Using MSE
Yes No
22 7 7 + 6 + … + 14
23 6 𝑦ത = = 39.7
11
… … 1
𝑀𝑆𝐸 (𝐴𝑔𝑒 ≥ 20) = × ((7 − 39.7)2 + (6 − 39.7)2 + … + (14 − 39.7)2 ) = 733.3
11
64 14
65
Age < 22.5
55
Yes No
45
25
15
5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees MSE
Using MSE
65
Age < 25
55
Yes No
45
6€ spent 47.1€ spent
35
25
15
5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees MSE
Using MSE
Age < 25
Yes No
Splitting criteria Total MSE
Only 3 samples! 9 samples…
I finish here I need to check the < 30.5 609.5
MSE for each split < 36.0 577.1
6€ spent
< 40.5 514.4
< 44.0 219.3
< 46.5 226.9
< 50.5 169.9
< 55.0 414.0
< 60.5 516.7
3.1 Regression Trees MSE
Using MSE
65 Yes No
55
6€ spent Age < 50.5
45
Yes No
35
62.7€ spent 16€ spent
25
15
5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees MSE
Using MSE
- MSE
- MAE
- Friedman MSE
- Sum of residuals
- …
22 7 Median (y)= 42
23 6 1
… … 𝑀𝐴𝐸 𝐴𝑔𝑒 ≥ 20 = × ( 7 − 42 + 6 − 42 + … + 14 − 42 ) = 24.8
11
64 14
Yes No
53 16 39.7
57 18 39.7
5€ spent 42€ spent
64 14 39.7
TOTAL MAE 24.8
3.2 Regression Trees MAE
Using MAE
Age < 25
Yes No
Splitting criteria Total MAE
Only 3 samples! 9 samples…
I finish here I need to check the < 30.5 22.0
MAE for each split < 36.0 22.9
6€ spent
< 40.5 21.2
< 44.0 15.0
< 46.5 14.1
< 50.5 11.3
< 55.0 18.4
< 60.5 20.3
3.2 Regression Trees MAE
Using MAE
65 Yes No
55
6€ spent Age < 50.5
45
Yes No
35
68.5€ spent 16€ spent
25
15
5 10 15 20 25 30 35 40 45 50 55 60 65
Age
4 Overfitting in Decision Trees