Decision Trees
Decision Trees
Hello everyone. I am a self-taught data scientist, and today's topic is decision trees. I will share what I
have learned, and I hope you can also gain some insights along with me.
Note: I would like to thank Onur Koç, who has played a significant role in helping me understand the
topic as I studied.
Decision trees are non-parametric supervised machine learning algorithms that can be employed for
both classification and regression tasks. They are widely used and robust machine learning algorithms
in today's context. Visually, they can be represented as upside-down trees.
Before we start, I would like to provide information about the terms used in the diagram on the right:
Root node:
• It tests the dataset on a specific feature and divides the data into two or more subsets based on
the test result.
• Each subset can be further divided into more subsets with another test at the next node.
• Decision nodes represent the decision rules used for classifying or regressing the dataset.
Terminal/Leaf node:
• After the dataset is divided based on a specific rule or condition, classification or regression
results are obtained in these terminal nodes.
• Leaf nodes are the bottommost nodes of the tree and produce the final outcomes. They contain
a class or regression value.
Let’s build a decision tree and visualize it to understand the process. We are going to use one of the
most popular datasets, the iris dataset.
Now, with an understanding of the decision tree's working principle, we begin at the root node (depth
0, the first one): this node checks if the petal length (cm) feature is less than or equal to 2.45. If so, we
move to the left child node of the root (depth 1, left). In this case, this node serves as a leaf node,
indicating it doesn't ask further questions but produces a result. The result classifies data with petal
length (cm) values equal to or less than 2.45 as the setosa type.
Exploring data with petal length (cm) values greater than 2.45, we examine the right child node of the
root (depth 1, right), which is a decision node introducing a new question. Does our petal width (cm)
value exceed 1.75? This question leads us to new decision nodes (depth 2) that, in turn, ask more
questions, eventually reaching leaf nodes to classify all our data.
The 'Samples' value indicates the number of examples in that node, while the 'value' list shows the
class affiliation of the examples. For example, when observing the depth 1, left node, it informs us that
a total of 50 samples are divided, with all 50 belonging to the first class (as seen in the 'class' section).
Understanding the logic of the decision tree, let's now address potential questions that might arise,
helping us delve deeper into the working principle of the decision tree, where we will find answers.
• When asking questions, how does it decide which feature to select? For example,
o Why did it choose the petal length feature at the root node instead of sepal width or
petal width?
• When asking questions, how does it decide which feature value to choose? For example,
o Why did it not choose other values like 1.7 or 2.3 instead of the value 2.45?
• What is the Gini value, and why is it important?
𝑛
n: The number of classes
𝐺𝑖𝑛𝑖 = 1 − ∑ 𝑝𝑖2
𝑝𝑖 : The percentage of each class in the node
𝑖=1
|𝑆𝑖 |
𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑖 )
|𝑆 |
𝑖 ∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
54 46
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 1 − [(100) × 0,445 + (100) × 0,151)] ≈ 0,69
• 0: The classes in the node are completely homogeneous, indicating a clear separation.
• 1: The classes in the node are completely mixed and not homogeneous.
The Information Gain value of 0.69 indicates that it has slightly reduced the uncertainty between the
classes in the node, but the classes are still not entirely homogeneous.
2. max_depth
As you may recall, we used the max_depth parameter at the beginning of the text. This parameter
determines the maximum depth of the decision tree. It controls how deep the tree can grow. A larger
max_depth results in a more complex and detailed tree, but it may increase the risk of overfitting.
3. min_samples_split
It sets the minimum number of samples required to split a node. This parameter restricts further
divisions in the tree and can help reduce the risk of overfitting.
min_samples_split = 10
NO SPLIT
min_samples_split = 10
4. max_features
It is a parameter that determines the maximum number of features to consider at each split step of a
decision tree. This parameter is used to control how many features the model will consider in each
split step. It is particularly useful for large datasets. Let's say our dataset has 50 different features, and
we set our parameter as max_features = 10. Before each split, the model randomly selects 10 features
and chooses the best one from these 10 features. It's a parameter that can be adjusted to prevent
overfitting.
5. class_weight (for classification problems)
The main reasons for using `class_weight` are as follows:
• Balancing classes in imbalanced datasets: If some classes in your dataset have fewer examples
than others, you can use class weights to assign more weight to minority classes, allowing the
model to better learn these classes.
• Giving more importance to specific classes: If you want certain classes to have a greater impact on
the model's learning, you can assign higher weights to these classes.
Typically, the `class_weight` parameter is used in two ways:
• `class_weight="balanced"`: This option automatically determines class weights. The weights are
calculated inversely proportional to the frequency of each class in the dataset. This ensures the
automatic assignment of appropriate weights when there is an imbalance among classes.
• Manual Specification (`class_weight={0: 1, 1: 2}`): Users can manually set the weights of classes.
This is useful, especially when prioritizing a specific class or correcting an imbalance situation.
6. sample_weight
The sample_weight parameter is used to determine the importance of each individual example (data
point). For instance, when developing a medical diagnosis model, you may believe that the diagnosis
for some patients is more critical than others.
For example, consider the following scenarios:
Example 1 (Patient A): He/she has a critical condition, and accurate diagnosis is crucial.
Example 2 (Patient B): He/she has a less critical condition, and accurate diagnosis is important but not
a top priority.
By using sample_weight , you can assign higher weight to Example 1, which helps the model pay
more attention to diagnosing critical cases.
Some advantages of decision trees are: The disadvantages of decision trees include:
Simple to understand and to interpret. Trees can be visualized. Decision-tree learners can create over-complex trees that do
not generalize the data well. This is called overfitting.
Mechanisms such as pruning, setting the minimum number of
samples required at a leaf node or setting the maximum depth
of the tree are necessary to avoid this problem.
Requires little data preparation. Other techniques often Decision trees can be unstable because small variations in the
require data normalization, dummy variables need to be data might result in a completely different tree being
created and blank values to be removed. Some tree and generated. This problem is mitigated by using decision trees
within an ensemble.
algorithm combinations support missing values.
The cost of using the tree (i.e., predicting data) is logarithmic Predictions of decision trees are neither smooth nor
in the number of data points used to train the tree. continuous, but piecewise constant approximations.
Therefore, they are not good at extrapolation.
Able to handle both numerical and categorical data. The problem of learning an optimal decision tree is NP-
complete, so practical algorithms often make locally optimal
decisions. These algorithms cannot guarantee the globally
best decision tree.
Able to handle multi-output problems. There are concepts that are hard to learn because decision
trees do not express them easily, such as XOR, parity or
multiplexer problems.
Uses a white box model. If a given situation is observable in a Decision tree learners create biased trees if some classes
model, the explanation for the condition is easily explained by dominate. It is therefore recommended to balance the dataset
boolean logic. prior to fitting with the decision tree.