100% found this document useful (1 vote)
153 views

06 - Decision Trees

This document provides an overview of decision trees, including classification trees and regression trees. It discusses key algorithms for building decision trees such as ID3, C4.5, CART, and CHAID. The advantages of decision trees include their interpretability, ability to handle different data types, insensitivity to scale factors, and ability to automatically determine relevant attributes. Disadvantages include sensitivity to small data variations and worse performance with many classes. Decision trees are constructed recursively in a top-down manner by selecting attributes to partition the data at each node until reaching pure leaf nodes.

Uploaded by

sidra shafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
153 views

06 - Decision Trees

This document provides an overview of decision trees, including classification trees and regression trees. It discusses key algorithms for building decision trees such as ID3, C4.5, CART, and CHAID. The advantages of decision trees include their interpretability, ability to handle different data types, insensitivity to scale factors, and ability to automatically determine relevant attributes. Disadvantages include sensitivity to small data variations and worse performance with many classes. Decision trees are constructed recursively in a top-down manner by selecting attributes to partition the data at each node until reaching pure leaf nodes.

Uploaded by

sidra shafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Master in Information Management

202223

06
Decision Trees
A
G
E
N Decision Trees
D
A
Classification Trees

Regression Trees

Overfitting in Decision Trees


1 Decision Trees

What are Decision Trees?


Non-parametric supervised learning algorithm used for classification and regression
1 Decision Trees

- Decision tree can be though of as classification and estimation tools


- One of its major advantages has to do with the fact that they represent rules, which are fairly
simple to interpret

- In some problems we are just interested in achieving the best precision possible. In others we are
more interested in understanding the results and the way the model is producing the estimates
- Sometimes the reasons that underlie certain decisions are of paramount importance.

Interpretability
1 Decision Trees

Income

Age
1 Decision Trees

Age≥3

No Yes

(4,0) Income ≥ 3

No Yes

Age ≥ 6 Age ≥ 5

No Yes No Yes

(4,1) (0,6) (0,4) (4,1)


1 Decision Trees

What if I have a new observation where age is


equal to 5 and income equal to 1?
Age≥3

No Yes

(4,0) Income ≥ 3

No Yes

Age ≥ 6 Age ≥ 5

No Yes No Yes

(4,1) (0,6) (0,4) (4,1)


1 Decision Trees
Root node

Age≥3

No Yes Internal node


The objective is to:
(4,0) Income ≥ 3
- Discriminate between classes Branches
No Yes
- To obtain leaves that are as pure as possible.

Hopefully each leaf only represents individuals of Age ≥ 6 Age ≥ 5

a particular class No Yes No Yes

(4,1) (0,6) (0,4) (4,1)

Leaf node
1.1 Decision Trees Classification Tree
Classification Trees vs Regression Trees

Age≥3

No Yes

(4,0) Income ≥ 3

No Yes

Age ≥ 6 Age ≥ 5

No Yes No Yes

(4,1) (0,6) (0,4) (4,1)


1.1 Decision Trees Regression Tree
Classification Trees vs Regression Trees

x ≥ 3.2
Income

No Yes

2.0 x ≥ 5.1
No Yes

3.5 x ≥ 6.1
No Yes

Age
4.5 5.0
1.2 Decision Trees
Rule extraction from trees
Age≥3

No Yes
- Each edge adds a conjunction (∧)
(4,0) Income ≥ 3
- Each new leaf adds a disjunction (∨)
No Yes
⇔ 𝑎𝑔𝑒 < 3
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ (𝑖𝑛𝑐𝑜𝑚𝑒 < 3) ∧ (𝑎𝑔𝑒 < 6)
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ (𝑖𝑛𝑐𝑜𝑚𝑒 < 3) ∧ (𝑎𝑔𝑒 ≥ 6) Age ≥ 6 Age ≥ 5
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 < 5
⇔ 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 ≥ 5 No Yes No Yes

(4,1) (0,6) (0,4) (4,1)

⇔ 𝑎𝑔𝑒 < 3 ∨ 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 < 3 ∧ 𝑎𝑔𝑒 < 6 ∨ ( 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 ≥ 5 )

⇔ 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 < 5 ∨ ( 𝑎𝑔𝑒 ≥ 3 ∧ 𝑖𝑛𝑐𝑜𝑚𝑒 ≥ 3 ∧ 𝑎𝑔𝑒 ≥ 5 )


1.3 Decision Trees
Advantages of using Decision Trees

Interpretation
- Easily understand the underlying reason to the decision

No problems in dealing with different types of data


- Interval, ordinal, nominal, etc.
- Not necessary to define the relative importance of the variables

Insensitive to scale factors


- Different types of measurements can be used without the need for normalization

Automatic definition of the attributes that are more relevant in each case
- The most relevant attributes appear in the top part of the tree

Can be adapted to regression


- Linear local models in the leaves

Decision trees are considered a nonparametric method


- No assumptions about the space distribution and the classifier structure
1.3 Decision Trees
Disadvantages of using Decision Trees

- Most of the algorithms (ID3 and C4.5) require a discrete target


- Small variations in the data can result on very different trees
- Sub-trees can be replicated several times
- Worse results when dealing with many classes
- Linear boundaries perpendicular to the axis
1.4 Decision Trees
Decision Trees induction (or how to build trees)

Training Set
1.4 Decision Trees
Decision Trees induction (or how to build trees)

“PROBLEMS”: Where should we “cut”?

Training Set Age≥3

How many edges per node?

When to stop?
Which variable to
query?
1.4 Decision Trees
Decision Trees induction (or how to build trees)

BASIC ALGORITHM (a greedy algorithm)


- Tree is constructed in a top-down recursive divide-and-conquer manner
- At start, all the training observations are at the root
- If attributes are continuous-valued, they are discretized in advance (see slide 44)
- Observations are partitioned recursively based on selected attributes

Conditions for stopping partitioning


- All observations for a given node belong to the same class
- There are no remaining attributes for further partitioning – majority voting is employed for classify
the leaf
- There are no samples left
1.5 Decision Trees
The algorithms

DECISION TREES

CART CHAID
DDT ID3, C4.5, C5
Classification and Chi-Squared Automatic
Divisive decision tree Iterative Dichotomizer 3 …
Regression Trees Interaction Detection
(Hunt 62) (Quinlan 86, 93)
(Breiman 84) (Hartigan 75)
2 Classification Trees

What are
Classification Trees?
Tree models where the target variable can take
a discrete set of values
2.1 Classification Trees DDT
DDT

Major characteristics
- It is a greed search
- There is no “backtracking” (once a partition is done there is no re-evaluation)
- It can become stuck in a local minimum
- It uses discriminate power as selection measure

The algorithm (continue on the next slide)


- Start with a dataset of pre-classified individuals (examples or instances)
- Each node specifies an attribute (independent variables) used as a test
- N – is node N
- ASET – attribute set
- ISET – instance set (individuals or examples)
2.1 Classification Trees DDT
DDT

DDT(N, ASET, ISET)

If the ISET is empty then


the terminal node N is an unknown class
elseif all the examples of ISET are of the same class or ASET is empty
then the terminal node N has the name of the class
else
For each attribute A of the set of attribute ASET
Evaluate A according to its capability to discriminate a class
Select the attribute B which has the best discriminate value
For each value V of the best attribute B
Create a new node C from node N
Place the attribute value pair (B, V) in C
Let JSET be the set of examples of ISET with value V in B
Let KSET be the set of attributes of ASET with B removed
DDT(C, KSET, JSET)

Let’s see an example…


2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

Lethargia

The class
(dependent
Independent variables
variable) Burpoma The quantity of nuclei (1 or 2)

The number of tails (1 or 2)

The color (gray or white)


Healthy
The membrane (thin or thick)
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

# nuclei # tails Color Membrane Class


1 1 Light Thin Lethargia
2 1 Light Thin Lethargia
1 1 Light Thick Lethargia
1 1 Dark Thin Lethargia
1 1 Dark Thick Lethargia
2 2 Light Thin Burpoma
2 2 Dark Thin Burpoma
2 2 Dark Thick Burpoma
2 1 Dark Thin Healthy
2 1 Dark Thick Healthy
1 2 Light Thin Healthy
1 2 Light Thick Healthy
2.1 Classification Trees DDT
DDT

The discriminative metric:

Measure the attribute discrimination:

1
Discriminate power = ෍ 𝐶𝑖
𝑛
𝑖=1

- where n is the total number of examples


- 𝐶𝑖 is the number of examples correctly classified by the most frequent class

This is a measure of “dominance” or “purity”


2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

# nuclei # tails Color Membrane Class


1 1 Light Thin Lethargia
2 1 Light Thin Lethargia Using the number of nuclei:
1 1 Light Thick Lethargia # nuclei 1 2
1 1 Dark Thin Lethargia
Lethargia 4 1
1 1 Dark Thick Lethargia
2 2 Light Thin Burpoma Burpoma 0 3

2 2 Dark Thin Burpoma Healthy 2 2


2 2 Dark Thick Burpoma
2 1 Dark Thin Healthy
2 1 Dark Thick Healthy 4+3
Discriminate power : = 0.58
1 2 Light Thin Healthy 12
1 2 Light Thick Healthy
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

# nuclei # tails Color Membrane Class


1 1 Light Thin Lethargia
2 1 Light Thin Lethargia Using the number of tails:
1 1 Light Thick Lethargia # tails 1 2
1 1 Dark Thin Lethargia
Lethargia 5 0
1 1 Dark Thick Lethargia
2 2 Light Thin Burpoma Burpoma 0 3

2 2 Dark Thin Burpoma Healthy 2 2


2 2 Dark Thick Burpoma
2 1 Dark Thin Healthy
2 1 Dark Thick Healthy 5+3
Discriminate power : = 0.67
1 2 Light Thin Healthy 12
1 2 Light Thick Healthy
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

# nuclei # tails Color Membrane Class


1 1 Light Thin Lethargia
2 1 Light Thin Lethargia Using color:
1 1 Light Thick Lethargia color Light Dark
1 1 Dark Thin Lethargia
Lethargia 3 2
1 1 Dark Thick Lethargia
2 2 Light Thin Burpoma Burpoma 1 2

2 2 Dark Thin Burpoma Healthy 2 2


2 2 Dark Thick Burpoma
2 1 Dark Thin Healthy
2 1 Dark Thick Healthy 3+2
Discriminate power : = 0.41
1 2 Light Thin Healthy 12
1 2 Light Thick Healthy
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

# nuclei # tails Color Membrane Class


1 1 Light Thin Lethargia
2 1 Light Thin Lethargia Using membrane:
1 1 Light Thick Lethargia membrane Thin Thick
1 1 Dark Thin Lethargia
Lethargia 3 2
1 1 Dark Thick Lethargia
2 2 Light Thin Burpoma Burpoma 2 1

2 2 Dark Thin Burpoma Healthy 2 2


2 2 Dark Thick Burpoma
2 1 Dark Thin Healthy
2 1 Dark Thick Healthy 3+2
Discriminate power : = 0.41
1 2 Light Thin Healthy 12
1 2 Light Thick Healthy
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

# nuclei 1 2 # tails 1 2 color Light Dark membrane Thin Thick


Lethargia 4 1 Lethargia 5 0 Lethargia 3 2 Lethargia 3 2

Burpoma 0 3 Burpoma 0 3 Burpoma 1 2 Burpoma 2 1

Healthy 2 2 Healthy 2 2 Healthy 2 2 Healthy 2 2

0.58 0.67 0.41 0.41

Choice: # tails
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

Tails
one two

# Nuclei Color Membrane Classe # Nuclei Color Membrane Classe


1 Light Thin Lethargia 2 Light Thin Burpoma
2 Light Thin Lethargia 2 Dark Thin Burpoma
1 Light Thick Lethargia 2 Dark Thick Burpoma
1 Dark Thin Lethargia 1 Light Thin Healthy
1 Dark Thick Lethargia 1 Light Thick Healthy
2 Dark Thin Healthy
2 Dark Thick Healthy
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96) # Nuclei 1 2


Lethargia 4 1

Burpoma 0 0 D.P. = 0.86

# Nuclei Color Membrane Classe Healthy 0 2


1 Light Thin Lethargia
Color Light Dark
2 Light Thin Lethargia
Lethargia 3 2
1 Light Thick Lethargia
1 Dark Thin Lethargia Burpoma 0 0 D.P. = 0.71
1 Dark Thick Lethargia
Healthy 0 2
2 Dark Thin Healthy
2 Dark Thick Healthy Membrane Thin Thick

Lethargia 3 2

Burpoma 0 0 D.P. = 0.71

Healthy 1 1
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

Tails
one two

Nuclei # Nuclei Color Membrane Classe


2 Light Thin Burpoma
one two 2 Dark Thin Burpoma
2 Dark Thick Burpoma
1 Light Thin Healthy
Color Membrane Classe Color Membrane Classe
1 Light Thick Healthy
Light Thin Lethargia Light Thin Lethargia
Light Thick Lethargia Dark Thin Healthy
Dark Thin Lethargia Dark Thick Healthy
Dark Thick Lethargia
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

Tails
one two

Nuclei # Nuclei Color Membrane Classe


2 Light Thin Burpoma
one two 2 Dark Thin Burpoma
2 Dark Thick Burpoma
1 Light Thin Healthy
Lethargia (4) Color Membrane Classe
1 Light Thick Healthy
Light Thin Lethargia
Dark Thin Healthy
Dark Thick Healthy
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

Tails
one two

Nuclei # Nuclei Color Membrane Classe


2 Light Thin Burpoma
one two 2 Dark Thin Burpoma
2 Dark Thick Burpoma
1 Light Thin Healthy
Lethargia (4)
Color 1 Light Thick Healthy

Light Dark
Lethargia (1) Healthy (2)
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

Tails
one two

Nuclei Nuclei
one two
one two

Lethargia (4) Healthy (2) Burpoma (3)


Color
Light Dark
Lethargia (1) Healthy (2)
2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)

Lethargia (4) Healthy (2) Burpoma (2)

Lethargia (1) Healthy(2)


2.1 Classification Trees DDT
DDT

Example - Cell analysis (Langley 96)


𝑡𝑎𝑖𝑙𝑠 = 1 ∧ 𝑛𝑢𝑐𝑙𝑒𝑖 = 1 ∨ 𝑡𝑎𝑖𝑙𝑠 = 1 ∧ 𝑛𝑢𝑐𝑙𝑒𝑖 = 2 ∧ (𝑐𝑜𝑙𝑜𝑟 = 𝑙𝑖𝑔ℎ𝑡) ⇒ 𝑳𝒆𝒕𝒉𝒂𝒓𝒈𝒊𝒂

𝑡𝑎𝑖𝑙𝑠 = 2 ∧ 𝑛𝑢𝑐𝑙𝑒𝑖 = 1 ∨ 𝑡𝑎𝑖𝑙𝑠 = 1 ∧ 𝑛𝑢𝑐𝑙𝑒𝑖 = 2 ∧ 𝑐𝑜𝑙𝑜𝑟 = 𝑑𝑎𝑟𝑘 ⇒ 𝑯𝒆𝒂𝒍𝒕𝒉𝒚

𝑡𝑎𝑖𝑙𝑠 = 2 ∧ (𝑛𝑢𝑐𝑙𝑒𝑖 = 2) ⇒ 𝑩𝒖𝒓𝒑𝒐𝒎𝒂


2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

Major characteristics
- It uses entropy to measure the “disorder” in each independent variable
- From entropy, we can calculate the information gain, the selection measure
- ID3 handles only categorical attributes while C4.5 is able to deal also with numeric values

The nomenclature
Let D be a training set of class-labeled tuples
Let 𝑝𝑖 be the probability that an arbitrary tuple in D belongs to class 𝐶𝑖 , estimated by 𝐶𝑖,𝐷 / 𝐷
Let 𝐶𝑖,𝐷 be the set of tuples of class 𝐶𝑖 in D.
Let 𝐶𝑖,𝐷 and 𝐷 be the number of tuples in D and 𝐶𝑖,𝐷 and D
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

Expected information (entropy) needed to classify a tuple in D:


𝑚

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) = − ෍ 𝑝𝑖 log 2 ( 𝑝𝑖 )
𝑖=1

If we choose attribute A to split the node with the tuples in D in v partitions, then the Information needed to classify D is:
𝑣
|𝐷𝑗 |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝐷) = ෍ × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑗 )
|𝐷|
𝑗=1

The Information gained by branching on attribute A:

𝐺𝑎𝑖𝑛(𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝐷)


2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

Our dependent variable


AllElectronics Example: - Two classes: buys_computer = “yes” or “no”
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
[31…40] high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
[31…40] low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
[31…40] medium no excellent yes
[31…40] high yes fair yes
>40 medium no excellent no
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) = − ෍ 𝑝𝑖 log 2 ( 𝑝𝑖 )
𝑖=1

Buy Computer
9 9 5 5
Yes No 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) = 𝐸(9,5) = − log 2 ( ) − log 2 ( ) = 0.940
14 14 14 14
9 5
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

We are going to check the entropy associated to the variable age:

Age Yes No Total E

<=30 2 3 5 0.971 2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴𝑔𝑒<30 (𝐷) = 𝐸(2,3) = − log 2 ( ) − log 2 ( ) = 0.971
5 5 5 5
4 4 0 0
[31…40] 4 0 4 0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴𝑔𝑒 31..40 (𝐷) = 𝐸(4,0) = − log 2 ( ) − log 2 ( ) = 0
4 4 4 4
3 3 2 2
>40 3 2 5 0.971 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴𝑔𝑒>40 (𝐷) = 𝐸(3,2) = − log 2 ( ) − log 2 ( ) = 0.971
5 5 5 5

𝑣
|𝐷𝑗 |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝐷) = ෍ × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑗 )
|𝐷|
𝑗=1

5 4 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑎𝑔𝑒 (𝐷) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(2,3) + 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(4,0) + 𝐸(3,2) = 0.694
14 14 14
5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 2,3 means “age <=30” has 5 out of 14 samples, with 2 yes and 3 no
14
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

𝐺𝑎𝑖𝑛(𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝐷)

Thus the information gain for age is:

𝐺𝑎𝑖𝑛(𝑎𝑔𝑒) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑎𝑔𝑒 (𝐷) = 0.940 − 0.694 = 0.246

Similarly for the variables income, student and credit:


𝐺𝑎𝑖𝑛(𝑖𝑛𝑐𝑜𝑚𝑒) = 0.029
𝐺𝑎𝑖𝑛(𝑠𝑡𝑢𝑑𝑒𝑛𝑡) = 0.151
𝐺𝑎𝑖𝑛(𝑐𝑟𝑒𝑑𝑖𝑡) = 0.048

Since age has the highest information gain it is the selected attribute
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

Age

age income student credit_rating buys_computer age income student credit_rating buys_computer
<=30 high no fair no >40 medium no fair yes
<=30 high no excellent no >40 low yes fair yes
<=30 medium no fair no >40 low yes excellent no
<=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes >40 medium no excellent no

age income student credit_rating buys_computer


[31…40] high no fair yes
[31…40] low yes excellent yes
[31…40] medium no excellent yes
[31…40] high yes fair yes
[31…40] high no fair yes
2.2 Classification Trees ID3 / C4.5
Information Gain – ID3 / C4.5

age
But what about if I want to use the original age (algorithm C4.5)?
18
Originally Age is a continuous-valued attribute 20
21
22
26
Must determine the best split point for Age
27
- Sort the values of Age 29
31
- Every midpoint between each pair of adjacent values is a possible split point 35
41
𝐷𝑗
- Evaluate 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 𝐷 = σ𝑣𝑗=1 × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐷𝑗 with two partitions 42
𝐷
45
(Partition 1<split point<Partition 2) 48
49
- The point with the minimum entropy for Age is selected as the split-point …
2.3 Classification Trees CART
Gini Index - CART

- If a data set D contains examples from n classes, Gini index is defined as:

𝑔𝑖𝑛𝑖 𝐷 = 1 − ෍ 𝑝𝑗2
𝑗=1

where 𝑝𝑗 is the relative frequency of class j in D

- Similar to entropy but without the lo FASTER !


2.3 Classification Trees CART
Gini Index - CART

- If a data set D is split on Age into two subsets D1 and D2, the gini index is defined as:

𝐷1 𝐷2
𝑔𝑖𝑛𝑖𝐴𝑔𝑒 𝐷 = 𝑔𝑖𝑛𝑖 𝐷1 + 𝑔𝑖𝑛𝑖 𝐷2
𝐷 𝐷

- The reduction in Impurity is given by:

∆𝑔𝑖𝑛𝑖 𝐴 = 𝑔𝑖𝑛𝑖 𝐷 − 𝑔𝑖𝑛𝑖𝐴 (𝐷)

- The attribute that maximizes the reduction in impurity is selected as splitting attribute
2.3 Classification Trees CART
Gini Index - CART

𝑔𝑖𝑛𝑖 𝐷 = 1 − ෍ 𝑝𝑗2
𝑗=1
For the AllElectronics example:
Buy Computer
Yes No 9 2 5 2
g𝑖𝑛𝑖(𝐷) = 1 − − = 0.459
14 14
9 5

For the variable Age:

Age Yes No Total Gini


2 2
2 3
<=30 2 3 5 0.48 𝐺𝑖𝑛𝑖(𝐷𝑎𝑔𝑒<30 ) = 1 − − = 0.48
5 5
[31…40] 4 0 4 0 …
>40 3 2 5 0.48 …
2.3 Classification Trees CART
Gini Index - CART

𝐷1 𝐷2
𝑔𝑖𝑛𝑖𝐴𝑔𝑒 𝐷 = 𝑔𝑖𝑛𝑖 𝐷1 + 𝑔𝑖𝑛𝑖 𝐷2
𝐷 𝐷

Age Yes No Total Gini


<=30 2 3 5 0.48 5 4 5
𝐺𝑖𝑛𝑖𝑎𝑔𝑒 (𝐷) = 𝐺𝑖𝑛𝑖(2,3) + 𝐺𝑖𝑛𝑖(4,0) + 𝐺𝑖𝑛𝑖(3,2) = 0.343
[31…40] 4 0 4 0 14 14 14
>40 3 2 5 0.48

∆𝑔𝑖𝑛𝑖 𝐴 = 𝑔𝑖𝑛𝑖 𝐷 − 𝑔𝑖𝑛𝑖𝐴 (𝐷)

Δ𝑔𝑖𝑛𝑖(𝐴𝑔𝑒) = 𝑔𝑖𝑛𝑖(𝐷) − 𝑔𝑖𝑛𝑖𝐴𝑔𝑒 (𝐷) = 0.459 − 0.343 = 0.116


3 Regression Trees

What are Regression Trees?


Tree models where the target variable can take a continuous set of values
3 Regression Trees

In a regression problem…
There are some problems that are easily fitted with a linear regression…

100

Money Spent 75

50

20

20 30 40 50
Age
1 Regression Trees

Middle aged people spend


Age Money
the most

Money Spent (€)


Spent
18 5 75
22 7 65
23 6
55
27 68
45 Some middle-aged people
34 72 spend some money
38 69 35
Older people don’t spend
43 77 25 much money
45 48 Younger people don’t spend
15 much money
48 42
5
53 16
57 18 5 10 15 20 25 30 35 40 45 50 55 60 65
Age
64 14
3 Regression Trees

A problem with one variable…

Money Spent (€)


75
Fitting a straight line to the data does
65 not seem very useful!

55
As an example, if someone has
45 the age of 64…
35 we will predict that the money spent
25 would be around 50€…

15 The observed data however suggests it


should be around 15€…
5
Vary far from reality!
5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3 Regression Trees

- In some datasets we should use other methods than using straight lines to make predictions

- One option is the Regression Tree

- A Regression tree is a type of a Decision tree where each leaf represents a numeric value, and not a
discrete category like in Classification Decision Trees.
3 Regression Trees

Regression tree with one variable…


Age < 25

Yes No

6€ spent Age < 44

Yes No

71.5€ spent Age < 50.5

Yes No

45€ spent 16€ spent

However, I could obtain the same answer just by looking to the plot!
3 Regression Trees

Regression tree with more variables…


When we have more than 2 predictors, such as Age, Number of Kids and Number of months as client, to
predict the money spent through drawing a plot is very difficult, if not impossible

Age #Kids #Months as customer … Money Spent


18 0 1 … 5
22 1 3 … 7
23 0 6 … 6
27 2 22 … 68
34 3 27 … 72
38 2 29 … 69
43 1 15 … 77
45 0 12 … 48
48 3 17 … 42
53 2 18 … 16
57 3 43 … 18
64 1 32 … 14
3 Regression Trees

Regression tree with more variables…


A regression tree easily supports more than 2 predictors
Age #Kids #Months as … Money Spent
customer
18 0 1 … 5

Age < 24
Yes No

#Kids < 2 …
Yes No

#months < 7 …

Yes No

6€ spent …
1 Regression Trees

Choose the best split

Money Spent (€)


75

65

55

45 Why do we start with age < 25?


What is the best splitting point?
35

25

15

5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3 Regression Trees

Choose the best split

Money Spent (€)


75

65
Age < 20
55
Yes No
45

35 5€ spent 39.7€ spent

25

15

5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees Age < 20 MSE
Using MSE
Yes No

Choose the best split using MSE 5€ spent 39.7€ spent


𝑛
1
𝑀𝑆𝐸 = ො 2
෍(𝑦 − 𝑦)
𝑛
𝑖=1

Age Money Spent


(𝒚) 𝑦ത = 5
18 5 𝑀𝑆𝐸 (𝐴𝑔𝑒 < 20) = (5 − 5)2 = 0

22 7 7 + 6 + … + 14
23 6 𝑦ത = = 39.7
11
… … 1
𝑀𝑆𝐸 (𝐴𝑔𝑒 ≥ 20) = × ((7 − 39.7)2 + (6 − 39.7)2 + … + (14 − 39.7)2 ) = 733.3
11
64 14

𝑇𝑂𝑇𝐴𝐿 𝑀𝑆𝐸 = 0 + 733.3 = 733.3


3.1 Regression Trees MSE
Using MSE

Age Money Spent ෝ


𝒚 MSE
𝑛 (𝒚)
1
ො 2
𝑀𝑆𝐸 = ෍(𝑦 − 𝑦) 18 5 5 0.0
𝑛
𝑖=1
22 7 39.7
23 6 39.7
27 68 39.7
34 72 39.7
38 69 39.7
43 77 39.7 733.3
45 48 39.7
Age < 20 48 42 39.7
Yes No 53 16 39.7
57 18 39.7
5€ spent 39.7€ spent
64 14 39.7
TOTAL MSE 733.3
3.1 Regression Trees MSE
Using MSE

Money Spent (€)


75

65
Age < 22.5
55
Yes No
45

35 6€ spent 43€ spent

25

15

5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees MSE
Using MSE

Age Money Spent ෝ


𝒚 MSE
1 2 (𝒚)
𝑀𝑆𝐸 = 𝑦 − 𝑦ො𝑖
𝑛 𝑖 18 5 6
1.0
22 7 6
23 6 43
27 68 43
34 72 43
38 69 43
43 77 43
688.8
45 48 43
48 42 43
Age < 22.5 53 16 43
Yes No 57 18 43
6€ spent 43€ spent 64 14 43
TOTAL MSE 689.8
3.1 Regression Trees MSE
Using MSE

Money Spent (€)


75

65
Age < 25
55
Yes No
45
6€ spent 47.1€ spent
35

25

15

5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees MSE
Using MSE

Age Money Spent ෝ


𝒚 MSE
𝑛
1 (𝒚)
𝑀𝑆𝐸 = ො 2
෍(𝑦 − 𝑦)
𝑛 18 5 6
𝑖=1
22 7 6 0.7
23 6 6
27 68 43
34 72 43
38 69 43
43 77 43
45 48 43 596.3
48 42 43
Age < 25 53 16 43
Yes No 57 18 43
6€ spent 47.1€ spent 64 14 43
TOTAL MSE 597.0
3.1 Regression Trees MSE
Using MSE

Splitting TOTAL MSE


criteria for age This is my first splitting criteria!

<20.0 733.3 Age lower than 25


<21.5 689.8
<25.0 597.0
< 30.5 1330.8
< 36.0 1558.1
< 40.5 1526.6
< 44.0 1265.0
< 46.5 1056.8
< 50.5 828.0
< 55.0 816.2
< 60.5 782.1
3.1 Regression Trees MSE
Using MSE

And my second splitting criteria? (MSE)


Age < 25
Yes No

Age Money Spent Age Money Spent


(𝒚) (𝒚)
18 5 27 68 If you continue this process, you will end with one
observation (since all values on money spent are
22 7 34 72
different) in each leaf…
23 6 38 69
43 77 OVERFITTING!!
45 48
!The need to define stopping criterias!
48 42
53 16
57 18
64 14
3.1 Regression Trees MSE
Using MSE

And my second splitting criteria? (MSE)


If I define, for example, the following stopping criteria:
- Minimum samples to split = 7

Age < 25
Yes No
Splitting criteria Total MSE
Only 3 samples! 9 samples…
I finish here I need to check the < 30.5 609.5
MSE for each split < 36.0 577.1
6€ spent
< 40.5 514.4
< 44.0 219.3
< 46.5 226.9
< 50.5 169.9
< 55.0 414.0
< 60.5 516.7
3.1 Regression Trees MSE
Using MSE

My final tree (MSE)

Money Spent (€) Age < 25


75

65 Yes No

55
6€ spent Age < 50.5
45
Yes No
35
62.7€ spent 16€ spent
25

15

5 10 15 20 25 30 35 40 45 50 55 60 65
Age
3.1 Regression Trees MSE
Using MSE

Using more than two predictors

Age #Kids #Months as … Money Spent


customer Just like before, we will try different thresholds
18 0 1 … 5 for Age and calculate the MSE (or other
22 1 3 … 7 measure) at each step, and pick the threshold
23 0 6 … 6 that gives us the minimum MSE.
27 2 22 … 68
34 3 27 … 72
The best threshold becomes a candidate for
the root
38 2 29 … 69
43 1 15 … 77
45 0 12 … 48
Age < 25
48 3 17 … 42
53 2 18 … 16
57 3 43 … 18
64 1 32 … 14
3.1 Regression Trees MSE
Using MSE

Using more than two predictors


Age #Kids #Months as … Money Spent
customer
18 0 1 … 5
22 1 3 … 7
23 0 6 … 6 Splitting criteria TOTAL MSE
27 2 22 … 68 for #Kids
34 3 27 … 72
38 2 29 … 69 <1 1155.8
43 1 15 … 77 <2 1301.1
45 0 12 … 48 <3 1321.6
48 3 17 … 42
53 2 18 … 16
57 3 43 … 18
64 1 32 … 14
3.1 Regression Trees MSE
Using MSE

Using more than two predictors


Age #Kids #Months as … Money Spent
customer
18 0 1 … 5
22 1 1 … 7
23 0 3 … 6
27 2 2 … 68 Splitting criteria TOTAL MSE
34 3 2 … 72 for #Months
38 2 3 … 69
<2 1056.7
43 1 2 … 77
<3 1501.3
45 0 1 … 48
48 3 2 … 42
53 2 1 … 16
57 3 3 … 18
64 1 2 … 14
3.1 Regression Trees MSE
Using MSE

Using more than two predictors


Age TOTAL MSE # Kids TOTAL MSE
Age < 25
Yes No
<20.0 733.3 <1 1155.8
<21.5 689.8 <2 1301.1 6€ spent 43€ spent
<25.0 597.0 <3 1321.6
< 30.5 1330.8 #Kids < 1
#Months TOTAL MSE
< 36.0 1558.1 Yes No
< 40.5 1526.6
<2 1056.7 19.7€ spent 42.6€ spent
< 44.0 1265.0
< 46.5 1056.8 <3 1501.3
#Months < 2
< 50.5 828.0
< 55.0 816.2 Yes No

< 60.5 782.1 19€ spent 45.8€ spent


3 Regression Trees
Other measures

What about other measures?


- Similarly to what we have on classification trees, where gini or entropy can be used to measure the
quality of a split, in regression trees we have different alternatives also…

- MSE

- MAE

- Friedman MSE

- Sum of residuals

- …

- In sklearn, you have the three first options


3.2 Regression Trees Age < 20 MAE
Using MAE
Yes No

5€ spent 42€ spent


𝑛
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛
𝑖=1

Age Money Spent


Median (y) = 5
(𝒚)
18 5 𝑀𝐴𝐸 𝐴𝑔𝑒 < 20 = 5 − 5 = 0

22 7 Median (y)= 42
23 6 1
… … 𝑀𝐴𝐸 𝐴𝑔𝑒 ≥ 20 = × ( 7 − 42 + 6 − 42 + … + 14 − 42 ) = 24.8
11
64 14

𝑇𝑂𝑇𝐴𝐿 𝑀𝐴𝐸 = 0 + 24.8 = 24.8


3.2 Regression Trees MAE
Using MAE

𝑛 Age Money Spent ෝ


𝒚 MAE
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖 (𝒚)
𝑛
𝑖=1 18 5 5 0.0
22 7 39.7
23 6 39.7
27 68 39.7
34 72 39.7
38 69 39.7
43 77 39.7 24.8
45 48 39.7
Age < 20 48 42 39.7

Yes No
53 16 39.7
57 18 39.7
5€ spent 42€ spent
64 14 39.7
TOTAL MAE 24.8
3.2 Regression Trees MAE
Using MAE

Splitting TOTAL MAE


criteria for age This is my first splitting criteria!

<20.0 24.8 Age lower than 25


<21.5 22.7
<25.0 18.9
< 30.5 23.2
< 36.0 26.8
< 40.5 57.7
< 44.0 80.9
< 46.5 70.9
< 50.5 60.0
< 55.0 60.2
< 60.5 59.3
3.2 Regression Trees MAE
Using MAE

And my second splitting criteria? (MAE)


If I define, for example, the following stopping criteria:
- Minimum samples to split = 7

Age < 25
Yes No
Splitting criteria Total MAE
Only 3 samples! 9 samples…
I finish here I need to check the < 30.5 22.0
MAE for each split < 36.0 22.9
6€ spent
< 40.5 21.2
< 44.0 15.0
< 46.5 14.1
< 50.5 11.3
< 55.0 18.4
< 60.5 20.3
3.2 Regression Trees MAE
Using MAE

My final tree (MAE)

Money Spent (€) Age < 25


75

65 Yes No

55
6€ spent Age < 50.5
45
Yes No
35
68.5€ spent 16€ spent
25

15

5 10 15 20 25 30 35 40 45 50 55 60 65
Age
4 Overfitting in Decision Trees

How to avoid overfitting in DT?


Two main approaches: prepruning and postpruning
4 Overfitting in DT

How to avoid overfitting in decision trees?


- An induced tree may overfit the training data
- Too many branches, some may reflect anomalies due to noise or outliers
- Poor accuracy for unseen samples

Two approaches to avoid overfitting


- Prepruning:
- do not split a node if this would result in the goodness measure falling below a threshold
- Difficult to choose an appropriate threshold
- Postpruning:
- Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
- Use a set of data different from the training data to decide which is the “best pruned tree”
4 Overfitting in DT

How to avoid overfitting in decision trees?

Simple to complex trees


Validation
Error

Training Prune here


Complexity

- Training set – used to develop the tree


- Validation set – used to access the generalization ability of the tree
References
- Mitchell, TM: 1997, Machine Learning, McGraw-Hill
- Langley, P: 1996, Elements of Machine Learning, Morgan and Kaufmann Publishers.
- Breiman, L., J. H. Friedman, R. A. Olsen and C. J. Stone (1984). Classification and Regression Trees,
Chapman & Hall, pp 358.
- http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
- http://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf
Obrigado!

Morada: Campus de Campolide, 1070-312 Lisboa, Portugal


Tel: +351 213 828 610 | Fax: +351 213 828 611

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy