Decision Trees and Regression Techniques
Decision Trees and Regression Techniques
Module 3
Decision Tree
• Introduction
In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the
root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value
and jump to the next node.
It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further branches and
constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and problem is the
main point to remember while creating a machine learning model. Below are the two reasons for using the Decision tree:
•Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
•The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
•Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets divided into two or
more homogeneous sets.
•Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.
•Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given conditions.
•Branch/Sub Tree: A tree formed by splitting the tree.
•Pruning: Pruning is the process of removing the unwanted branches from the tree.
•Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes
decision tree learners build a model in the form of a tree structure. The model itself comprises a series
of logical decisions, similar to a flowchart, with decision nodes that indicate a decision to be made on
an attribute. These split into branches that indicate the decision's choices. The tree is terminated by leaf
nodes (also known as terminal nodes) that denote the result of following a combination of decisions
Data that is to be classified begin at the root node where it is passed through the various decisions in the tree
according to the values of its features. The path that the data takes funnels each record into a leaf node,
which assigns it a predicted class
As the decision tree is essentially a flowchart, it is particularly appropriate for applications in which the
classification mechanism needs to be transparent for legal reasons or the results need to be shared in order
to facilitate decision making. Some potential uses include:
• Credit scoring models in which the criteria that causes an applicant to be rejected need to be well-specified
• Marketing studies of customer churn or customer satisfaction that will be shared with management or
advertising agencies
• Diagnosis of medical conditions based on laboratory measurements, symptoms, or rate of disease
progression
Divide and conquer
Decision trees are built using a heuristic called recursive partitioning. This approach is generally known as
divide and conquer because it uses the feature values to split the data into smaller and smaller subsets of
similar classes.
Beginning at the root node, which represents the entire dataset, the algorithm chooses a feature that is
the most predictive of the target class. The examples are then partitioned into groups of distinct values
of this feature; this decision forms the first set of tree branches. The algorithm continues to divide-and-
conquer the nodes, choosing the best candidate feature each time until a stopping criterion is reached.
This might occur at a node if:
• All (or nearly all) of the examples at the node have the same class
• There are no remaining features to distinguish among examples
• The tree has grown to a predefined size limit
To illustrate the tree building process, let's consider a simple example. Imagine that you are working for a
Hollywood film studio, and your desk is piled high with screenplays. Rather than read each one cover-to-
cover, you decide to develop a decision tree algorithm to predict whether a potential movie would fall into
one of three categories: mainstream hit, critic's choice, or box office bust.
To gather data for your model, you turn to the studio archives to examine the previous ten years of
movie releases.
After reviewing the data for 30 different movie scripts, a pattern emerges.
There seems to be a relationship between the film's proposed shooting budget, the number of A-list
celebrities lined up for starring roles, and the categories of success
A scatter plot of this data might look something like the following diagram:
To build a simple decision tree using this data, we can apply a divide-and-conquer strategy. Let's first split
the feature indicating the number of celebrities, partitioning the movies into groups with and without a low
number of A-list stars:
Next, among the group of movies with a larger number of celebrities, we can make another split between
movies with and without a high budget:
Function C5.0()
This function applies the divide and conquer stategy onto the example set S to create a descion tree DT.
Else :
//here A != Vi
ii) Use C5.0() to construct a decision tree DTi for example sets Si.
• There are numerous implementations of decision trees, but one of the most wellknown is the C5
algorithm.
• This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior
algorithm, C4.5, which itself is an improvement over his ID3 (Iterative Dichotomiser 3) algorithm
• The C5.0 algorithm has become the industry standard for producing decision trees, because it does well
Support Vector Machines), the decision trees under the C5.0 algorithm generally perform nearly as well
• The first main challenge that a decision tree will face is to identify which feature to split upon. If the
segments of the data contain only a single class, they are considered pure.
C5.0 uses the concept of entropy for measuring purity. The entropy of a sample of data indicates how
mixed the class values are; the minimum value of 0 indicates that the sample is completely
homogenous, while 1 indicates the maximum amount of disorder. The definition of entropy can be
specified as
For a given segment of data (S), the term c refers to the number of class levels and Pi refers to the
proportion of values falling into class level i. For example, suppose we have a partition of data with two
classes: red (60 percent) and white (40 percent). We can calculate the entropy as follows:
0.9709506
Information gain to use entropy to determine the optimal feature to split upon. The information gain for a feature F is
calculated as the difference between the entropy in the segment before the split (S1) and the partitions resulting from
the split (S2):
InfoGain(F)=Entropy(S1)−Entropy(S2)
One complication is that after a split, the data is divided into more than one partition. Therefore, the
function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions. It does this
by weighing each partition’s entropy by the proportion of records falling into the partition.
This can be stated in a formula as:
In simple terms, the total entropy resulting from a split is the sum of entropy of each of the n partitions
The total entropy resulting from a split is the sum of the entropy of each of the n partitions weighted by
The higher the information gain, the better a feature is at creating homogeneous groups after a split
on this feature. If the information gain is zero, there is no reduction in entropy for splitting on this
feature.
Pruning the Decision Tree
Decision trees can continue to grow indefinitely by choosing splitting features and dividing them into
smaller and smaller partitions until each example is perfectly classified, or the algorithm runs out of features
to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and
the model will overfit the training data. To overcome this, we can prune a decision tree which involves a
process of reducing its size such that it generalizes better to unobserved data
One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions,
or if the decision node contains only a small number of examples. This method is known as pre-pruning, or
early stopping the decision tree. As the tree avoids doing unnecessary work, this can quite often be an
appealing strategy.
One of the benefits of the C5.0 algorithm is that it is opinionated about pruning; it takes care of many of the
decisions automatically using fairly reasonable defaults. Its overall strategy is to postprune the tree. It does
this by first growing a large tree that overfits the training data. Aftewrwards, nodes and branches that have
little effect on the classification errors are removed.
One of the benefits of the C5.0 algorithm is that it is opinionated about pruning—it takes care of many of
the decisions, automatically using fairly reasonable defaults. Its overall strategy is to postprune the tree. It
first grows a large tree that overfits the training data. Later, nodes and branches that have little effect on
the classification errors are removed. In some cases, entire branches are moved further up the tree or
replaced by simpler decisions. These processes of grafting branches are known as subtree raising and
subtree replacement, respectively
Understanding classification rules
Classification rules represent knowledge in the form of logical if-else statements that assign a class to
unlabelled examples. They are specified in terms of an antecedent and a consequent; these form a
hypothesis stating that "if this happens, then that happens.
Rule learners are often used in a manner similar to decision tree learners. Like decision trees, they can be
used for applications that generate knowledge for future action, such as:
• Identifying conditions that lead to a hardware failure in mechanical devices
• Describing the defining characteristics of groups of people for customer segmentation
• Finding conditions that precede large drops or increases in the prices of shares on the stock market