Decision Trees - Detailed Notes
Decision Trees - Detailed Notes
Detailed Notes
Prerequisite-
- Probability and statistics.
Objectives-
- Understand the prerequisite term such as Gini index, entropy, Information gain and pruning.
- Understand the Decision tree as a CART algorithm.
- Tuning parameter of the decision tree.
- Advantages and disadvantages of Decision Tree.
Decision Tree
A Decision Tree is one of the most popular and effective supervised learning technique for classification
problem that equally works well with both categorical and quantitative variables. It is a graphical representation
of all the possible solution to a decision that is based on a certain condition. In this algorithm, the training
sample points are split
sainath.kesavan@gmail.com into two or more sets based on the split condition over input variables. A simple
CE8RSBVN1O
example of a decision tree can be as a person has to take a decision for going to sleep or restaurant based on
parameters like he is hungry or has 25$ in his pocket.
Terminology
Before moving forward into the details of the decision tree and its working lets understand the meaning
of some terminology associated with it.
Root node- Represent the entire set of the population which gets further divided into sets based on splitting
decisions.
Decision node- These are the internal nodes of the tree, These nodes are expressed through conditional
expression for input attributes.
Leaf node- Nodes which do not split further are known as leaf nodes or terminal nodes.
Splitting- The process of dividing a node into one or more sub-nodes.
Pruning- It is the reverse process of splitting where the sub-nodes are removed.
sainath.kesavan@gmail.com
CE8RSBVN1O
The tree accuracy is heavily affected by the split point at a decision node. Decision trees use different criteria
to decide split on decision node to get two or more sub-nodes. The resultant sub nodes must increase in the
homogeneity of data points also known as the purity of nodes with respect to the target variable. The split
decision is tested on all available variables and then the split with maximum purity sub-nodes is get selected.
Measures of Impurity:
Decision trees recursively split feature about to their target variable’s purity. The algorithm is designed to
optimize each split such the purity will be maximized. Impurity can be measured in many ways such as Gini
impurity, Entropy and information gain.
Gini Impurity-
2
𝑔𝑖𝑛𝑖 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − ∑ 𝑝𝑖
𝑖
Where the pi represents the probability of random selection of class i observation.
Let consider a simple example of a bag contains some balls (4 red balls and 0 blue balls). The Gini index
would be:
Example- Let we have been given with a data of some students with a target variable that whether a student is
athlete or not. The input attributes are gender (Male/Female) and education (UG/PG).
Case 1- Let first make a split with respect to gender variable: Figure is showing the split-
⎰ ⎱
𝑔𝑖𝑛𝑖𝑓𝑒𝑚𝑎𝑙𝑒 = 1 − 𝑃 𝑦𝑒𝑠
( { 2
) + 𝑃 (𝑁𝑜 ) = 1 −
2
} ⎱ (6)
3 2
+ ( ) = 0. 5
3 2
⎰
6
Step 2- Calculate the Gini for split using weights of each Gini score.
𝑔𝑖𝑛𝑖𝑔𝑒𝑛𝑑𝑒𝑟 =
4
10 (𝑔𝑖𝑛𝑖𝑚𝑎𝑙𝑒) + 106 (𝑔𝑖𝑛𝑖𝑓𝑒𝑚𝑎𝑙𝑒) = 0. 4 (0. 375) + 0. 6 (0. 5 ) = 0. 45
Case 2- Let first make the second split with respect to the education variable: Figure is showing the split-
⎰ 1
( ) + ( ) ⎱⎰ = 0. 32
2 4 2
sainath.kesavan@gmail.com 𝑔𝑖𝑛𝑖
CE8RSBVN1O 𝑃𝐺
2
{ 2
= 1 − 𝑃 (𝑦𝑒𝑠 ) + 𝑃 (𝑁𝑜 ) = 1 −
⎱ 5 } 5
Step 2- Calculate the Gini for split using weights of each Gini score.
𝑔𝑖𝑛𝑖𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 =
5
10 (𝑔𝑖𝑛𝑖𝑈𝐺) + 105 (𝑔𝑖𝑛𝑖𝑃𝐺) = 0. 5 (0. 0) + 0. 5 (0. 32 ) = 0. 16
As it is clear from the calculations that the Gini score for education is lower than for gender variable so the best
variable for making a split is education. ( gives more pure sub-nodes)
Entropy- In Layman term, Entropy is nothing but the measure of disorder. We can also think it as a measure of
purity. The mathematical formula of Entropy is as follows-
𝑐
𝐸= ∑ − 𝑝 𝑖𝑙𝑜𝑔2𝑝𝑖
𝑖=1
Where pi is the probability of class i.
For the given example we have two labels for athlete class (Yes/No). Therefore the athlete could be either Yes
or No. So, the entropy for the given set is defined as:
6 4
𝑃(𝑌𝑒𝑠 ) = 10
𝑎𝑛𝑑 𝑃(𝑁𝑜 ) = 10
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 10
6
log ( )− log ( ) = -(- 0. 442 - 0. 529) = 0. 971
6
2 10
4
10
4
2 10
Information Gain
Information Gain measures the reduction in Entropy and decides which attribute would be selected as a
decision node. In general, information gain has again calculated the subtraction of decision node entropy to the
weighted average of the entropies for the children of the decision node. That is for ”m” points in the first child
node and n points in the second child node, the information gain is:
𝑚 𝑛
𝐼𝐺 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐷𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑛𝑜𝑑𝑒
( )− 𝑚+𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐹𝑖𝑟𝑠𝑡𝐶ℎ𝑖𝑙𝑑 ) − 𝑚+𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑒𝑐𝑜𝑛𝑑𝐶ℎ𝑖𝑙𝑑)
A Simple Graph between a class and Entropy would be like:
sainath.kesavan@gmail.com
CE8RSBVN1O
Figure 3 Entropy graph w.r.t impurity
Gini vs Entropy-
Pruning-
Pruning is very useful in the decision tree because sometimes what happens is that the decision tree
sainath.kesavan@gmail.com
CE8RSBVN1O
may fit the training data very well but performs very poorly in testing or new data. So, by removing branches
we can reduce the complexity of tree which help in reducing the overfitting of the tree.
To generate decision trees that will generalize to new problems well, we can tune different aspect of trees. We
call these aspects of decision tree “hyperparameters”. Some of the Important Hyperparameters used in
decision trees are as follows:
Maximum Depth- The maximum depth of the decision tree is simply the largest length between the root to
leaf. A tree of maximum length k can have at most 2**k leaves.
The maximum number of the feature- We can have too many features to build a decision tree. While
splitting, in every split, we have to check the entire data-set on each of the features. This can be very
expensive. A solution for this is to limit the number of features that one looks for in each split. If this number is
large enough, we're very likely to find a good feature among the ones we look for (although maybe not the
perfect one). However, if it's not as large as the number of features, it will speed up our calculations
sainath.kesavan@gmail.com
significantly.
CE8RSBVN1O
Example-
The decision tree for the example discussed in Gini calculation will be-
- A small change in the data-set can result in a large change in the structure of the decision tree
causing instability in the model.
- It requires more time to train a model in decision tree than any other algorithm presents out there.
- In a Decision tree calculation can be far more expensive than the other algorithm.
- It is not advised to apply decision tree for regression or predicting continuous values.
*********
sainath.kesavan@gmail.com
CE8RSBVN1O