0% found this document useful (0 votes)
0 views

Decision tree

The document provides an overview of decision trees, including their structure, advantages, disadvantages, and applications in various fields such as banking and education. It also discusses the implementation of decision trees using Python, focusing on key concepts like Gini index, entropy, and information gain. Additionally, it outlines the steps for calculating entropy and information gain to determine the best features for decision-making.

Uploaded by

chodanker15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Decision tree

The document provides an overview of decision trees, including their structure, advantages, disadvantages, and applications in various fields such as banking and education. It also discusses the implementation of decision trees using Python, focusing on key concepts like Gini index, entropy, and information gain. Additionally, it outlines the steps for calculating entropy and information gain to determine the best features for decision-making.

Uploaded by

chodanker15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

GOA COLLEGE OF ENGINEERING

Affiliated to Goa University


INFORMATION TECHNOLOGY DEPARTMENT

UNIT 2
Vision
Impart high quality knowledge and skills to students in the field of Information Technology ,motivate research, encourage industry consultancy projects and
nurture human values and life skills.

Machine
learning
algorithms (2)
Analysing, Visualizing and Applying Data Science Ms. Seeya Gude
Semester VIII Asst. Professor
B.E.(I.T.) Information Technology Department
Academic Year 2024-25 Goa College of Engineering
MINOR DEGREE DATA SCEINCE SEM 8
06/07/2025 1
(By Ms. Seeya Gude)
Decision trees
A decision tree is a graphical representation of different options for
solving a problem and show how different factors are related. It has a
hierarchical tree structure starts with one main question at the top
called a node which further branches out into different possible
outcomes where:
• Root Node is the starting point that represents the entire dataset.
• Branches: These are the lines that connect nodes. It shows the flow from one
decision to another.
• Internal Nodes are Points where decisions are made based on the input
features.
• Leaf Nodes: These are the terminal nodes at the end of branches that
represent final outcomes or predictions

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 2


a Gude)
Example of Decision tree

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 3


a Gude)
Decision trees
Advantages of Decision Trees
• Simplicity and Interpretability: Decision trees are straightforward and
easy to understand. You can visualize them like a flowchart which
makes it simple to see how decisions are made.
• Versatility: It means they can be used for different types of tasks can
work well for both classification and regression
• No Need for Feature Scaling: They don’t require you to normalize or
scale your data.
• Handles Non-linear Relationships: It is capable of capturing non-
linear relationships between features and target variables.
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 4
a Gude)
Decision trees
Disadvantages of Decision Trees
• Overfitting: Overfitting occurs when a decision tree captures noise
and details in the training data and it perform poorly on new data.
• Instability: instability means that the model can be unreliable slight
variations in input can lead to significant differences in predictions.
• Bias towards Features with More Levels: Decision trees can become
biased towards features with many categories focusing too much on
them during decision-making. This can cause the model to miss out
other important features led to less accurate predictions .

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 5


a Gude)
Decision trees
• Applications of Decision Trees
• Loan Approval in Banking:
• Medical Diagnosis:
• Predicting Exam Results in Education

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 6


a Gude)
Decision trees using Python
Key Components of Decision Trees in Python
• Root Node: The decision tree’s starting node, which stands for the complete
dataset.
• Branch Nodes: Internal nodes that represent decision points, where the data
is split based on a specific attribute.
• Leaf Nodes: Final categorization or prediction-representing terminal nodes.
• Decision Rules: Rules that govern the splitting of data at each branch node.
• Attribute Selection: The process of choosing the most informative attribute
for each split.
• Splitting Criteria: Metrics like information gain, entropy, or the Gini Index are
used to calculate the optimal split.
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 7
a Gude)
Decision trees using Python
Pseudocode of Decision tree
1.Find the best attribute and place it on the root node of the tree.
2.Now, split the training set of the dataset into subsets. While making the
subset make sure that each subset of training dataset should have the same
value for an attribute.
3.Find leaf nodes in all branches by repeating 1 and 2 on each subset.

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 8


a Gude)
Decision trees using Python

• Key concept in Decision Tree


1. Gini index
2. Entropy
3. Information Gain

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 9


a Gude)
Decision trees using Python
• Key concept in Decision Tree
• Gini index and information gain both of these methods are used
to select from the n attributes of the dataset which attribute
would be placed at the root node or the internal node.
• Gini index
• Gini Index =1−∑jj2 Gini Index =1−∑j​j2​
• Gini Index is a metric to measure how often a randomly chosen
element would be incorrectly identified.
• It means an attribute with lower gini index should be preferred.
• Sklearn supports “gini” criteria for Gini Index and by default, it
takes “gini” value.

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 10


a Gude)
Decision trees using Python
• Key concept in Decision Tree
• Information Gain
• Definition: Suppose S is a set of instances, A is an attribute, SvSv​is
the subset of s with A = v and Values(A) is the set of all possible of
A, then
• The entropy typically changes when we use a node in a Python
decision tree to partition the training instances into smaller
subsets. Information gain is a measure of this change in entropy.
• Sklearn supports “entropy” criteria for Information Gain and if we
want to use Information Gain method in sklearn then we have to
mention it explicitly.
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 11
a Gude)
Decision trees using Python
• Key concept in Decision Tree
• Entropy
• If a random variable x can take N different value, the i’value xixi​
with probability piipii​we can associate the following entropy with
x:
• H(x)=−∑i=1Np(xi)log2p(xi)H(x)=−∑i=1N​p(xi​)log2​p(xi​)
• Entropy is the measure of uncertainty of a random variable, it
characterizes the impurity of an arbitrary collection of examples.
The higher the entropy the more the information content.

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 12


a Gude)
Decision trees using ID3

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 13


a Gude)
Decision trees using ID3

Decision : can I Play golf


today?

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 14


a Gude)
Decision trees using ID3

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 15


a Gude)
Decision trees using ID3

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 16


a Gude)
Decision trees using ID3

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 17


a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 18
a Gude)
Where :
p – positive outcomes (yes)
n – negative outcomes (no)
p + n – total records

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 19


a Gude)
Decision trees using ID3 n=5
p=9

Decision : can I Play golf


today?

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 20


a Gude)
Entropy of entire data set

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 21


a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 22
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 23
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 24
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 25
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 26
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 27
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 28
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 29
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 30
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 31
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 32
a Gude)
Decision trees using ID3

Decision : can I Play golf


today?

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 33


a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 34
a Gude)
Now you will work only on the
subset of the dataset where outlook
= sunny

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 35


a Gude)
Sunny

Sunny with temperature

Sunny with wind

Sunny with humidity

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 36


a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 37
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 38
a Gude)
Now you will work only on the
subset of the dataset where outlook
= rainy

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 39


a Gude)
Rainy

Rainy
with temperature

Rainy
with wind

Rainy
with humidity
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 40
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 41
a Gude)
06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 42
a Gude)
ID3 using scikit learn - Python
Libraries needed:

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 43


a Gude)
ID3 using scikit learn - Python
Dataset used in the following program: diabetes

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 44


a Gude)
ID3 using scikit learn - Python
Step1: Reading the data set:
df = pd.read_csv('/content/diabetes.csv')
df.head() // returns first 5 rows of dataset
(optional)

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 45


a Gude)
ID3 using scikit learn - Python
Step2: Calculating Entropy for dataset:
The code defines a function, calculate_entropy, which computes
entropy for a dataset based on a specified target column. It starts by
determining the total number of rows and unique values in the target
column. Then, it iterates through these values, calculating the
proportion of instances for each value and updating the entropy
accordingly.

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 46


a Gude)
ID3 using scikit learn - Python
Step2: Calculating Entropy for dataset:

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 47


a Gude)
ID3 using scikit learn - Python
• Step3: Calculating Entropy and Information Gain
The `calculate_information_gain` function calculates the information
gain for a specified feature by computing the weighted average entropy
of subsets created by splitting the data based on that feature. The final
information gain is obtained by subtracting this weighted entropy from
the overall entropy of the dataset. This approach helps assess the
effectiveness of a feature in reducing uncertainty about the target
variable..

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 48


a Gude)
ID3 using scikit learn - Python
• Step3: Calculating Entropy and Information Gain

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 49


a Gude)
ID3 using scikit learn - Python
• Step4: Assessing best feature with highest information gain
The code iterates over each column in the DataFrame, excluding the
last column ('Outcome'), and calculates both entropy and information
gain for each column concerning the target variable ('Outcome'). For
each iteration, it computes the entropy using the calculate_entropy function
and the information gain using the calculate_information_gain function. Hence,
feature with highest information gain is attained.

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 50


a Gude)
ID3 using scikit learn - Python
• Step3: Assessing best feature with highest information gain

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 51


a Gude)
ID3 using scikit learn - Python
• Step5: Assessing best feature with highest information gain
The code iterates over each column in the DataFrame, excluding the
last column ('Outcome'), and calculates both entropy and information
gain for each column concerning the target variable ('Outcome'). For
each iteration, it computes the entropy using the calculate_entropy function
and the information gain using the calculate_information_gain function. Hence,
feature with highest information gain is attained.

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 52


a Gude)
ID3 using scikit learn - Python
• Step5: Assessing best feature with highest information gain

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 53


a Gude)
ID3 using scikit learn - Python
• Step6: Built ID3 Algorithm

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 54


a Gude)
ID3 using scikit learn - Python

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 55


a Gude)
Thank You

06/07/2025 MINOR DEGREE DATA SCEINCE SEM 8 (By Ms. Seey 56


a Gude)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy