0% found this document useful (0 votes)
8 views

Unit 3 MLT

The document discusses decision trees as a supervised learning algorithm applicable for classification and regression tasks. It explains the concepts of inductive and deductive reasoning, the structure of decision trees, and the ID3 algorithm for constructing them. Additionally, it covers the importance of attributes, information gain, and entropy in building effective decision trees.

Uploaded by

aditya Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unit 3 MLT

The document discusses decision trees as a supervised learning algorithm applicable for classification and regression tasks. It explains the concepts of inductive and deductive reasoning, the structure of decision trees, and the ID3 algorithm for constructing them. Additionally, it covers the importance of attributes, information gain, and entropy in building effective decision trees.

Uploaded by

aditya Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

\

Unit – 3(Decision Tree)

Decision tree is generally seen as Supervised Learning algorithm for


classification tasks.
But can also be used for regression problems and in unsupervised learning
situation.
Reasoning- the action of thinking about something in a logical, sensible way.
Inductive Reasoning- Specific to General
My elder brother is good at math. My friend's elder brother is good at math. My
neighbor's big brother is a math tutor. Therefore, all elder brothers are good at
math.'
We've probably heard people use this type of reasoning in life. We know this is not
always true. You probably know that being an older brother doesn't inherently
make you good at math. What we have done is made a generalized conclusion: all
older brothers are good at math based on three premises of specific instances:
Mine, my friend's and my neighbor's elder brother are all good at math. These
specific instances are not representative of the entire population of older brothers.
Because inductive reasoning is based on specific instances, it can often produce
weak and invalid arguments.
We can remember inductive reasoning like this: inductive reasoning is bottom-up
reasoning; it starts with a probable conclusion and induces premises.

Deductive Reasoning- General to Specific


Deductive reasoning is reasoning where true premises develop a true and valid
conclusion. Deductive reasoning uses general principles to create a specific
conclusion. Deductive reasoning is also known as 'top-down reasoning' because it
goes from general and works its way down more specific.
For example, 'All cars have engines. I have a car. Therefore, my car has an engine.'
1. Decision Tree
1.1 Introduction

Decision tree learning is a method for approximating discrete-valued target


functions, in which the learned function is represented by a decision tree. Learned
trees can also be re-represented as sets of if-then rules to improve human
readability. These learning methods are among the most popular of inductive
inference algorithms and have been successfully applied to a broad range of tasks
from learning to diagnose medical cases to learning to assess credit risk of loan
applicants.

1.2 Decision Tree Representation


Consider a set of training examples in following Table 1 and corresponding
decision tree in Figure 1.
Input Data Set
Label/Output
Day Outlook Temperature Humidit Wind PlayTennis
y
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcas Hot High Weak Yes
t
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcas Cool Normal Strong Yes
t
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcas Mild High Strong Yes
t
D13 Overcas Hot Normal Weak Yes
t
D14 Rain Mild High Strong No
TABLE 1: Training examples for the target concept PlayTennis.

Outlook

Sunny Overcast Rain


Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

Figure 1

Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. Each node in the
tree specifies a test of some attribute of the instance, and each branch descending
from that node corresponds to one of the possible values for this attribute. An
instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the
value of the attribute in the given example. This process is then repeated for the
subtree rooted at the new node. In general, decision trees represent a disjunction of
conjunctions of constraints on the attribute values of instances. Each path from the
tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself
to a disjunction of these conjunctions. For example, the decision tree shown in

Figure 1 corresponds to the expression for playing Tennis=Yes


(Outlook = Sunny ^ Humidity = Normal)
V (Outlook = Overcast)
v (Outlook = Rain ^ Wind = Weak)

1.3 Appropriate Problems For Decision Tree Learning

Decision tree learning is generally best suited to problems with the following
characteristics:
--Instances are represented by attribute-value pairs. Instances are described by
a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The
easiest situation for decision tree learning is when each attribute takes on a small
number of disjoint possible values (e.g., Hot, Mild, Cold). However, few
algorithms allow handling real-valued attributes as well (e.g., representing
Temperature numerically).
--The target function has discrete output values. The decision tree in Figure.1
assigns a boolean classification (e.g., yes or no) to each example. Decision tree
methods easily extend to learning functions with more than two possible output
values. learning target functions with real-valued outputs is also , though the
application of decision trees in this setting is less common.
--Disjunctive descriptions may be required. As noted above, decision trees
naturally represent disjunctive expressions.
--The training data may contain errors. Decision tree learning methods are
robust to errors, both errors in classifications of the training examples and errors in
the attribute values that describe these examples.
--The training data may contain missing attribute values. Decision tree
methods can be used even when some training examples have unknown values
(e.g., if the Humidity of the day is known for only some of the training examples).

Decision tree learning has therefore been applied to problems such as learning to
classify medical patients by their disease, equipment malfunctions by their cause,
and loan applicants by their likelihood of defaulting on payments. Such problems,
in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as Classification problems.

1.4 The Basic Decision Tree Learning Algorithm- ID3 Algorithm

ID3, learns decision trees by constructing them top down, beginning with the
question "which attribute should be tested at the root of the tree?' To answer this
question, each instance attribute is evaluated using a statistical test to determine
how well it alone classifies the training examples.
The best attribute is selected and used as the test at the root node of the tree. A
descendant of the root node is then created for each possible value of this attribute.
The entire process is then repeated using the training examples associated with
each descendant node to select the best attribute to test at that point in the tree. This
forms a greedy search for an acceptable decision tree, in which the algorithm never
backtracks to reconsider earlier choices.
1.4.1 Which Attribute Is the Best Classifier?
The central choice in the ID3 algorithm is selecting which attribute to test at each
node in the tree. We would like to select the attribute that is most useful for
classifying examples. What is a good quantitative measure of the worth of an
attribute? We define a statistical property, called informution gain, that measures
how well a given attribute separates the training examples according to their target
classification. ID3 uses this information gain measure to select among the
candidate attributes at each step while growing the tree.

1.4.2 Entropy Measures Homogeneity Of Examples


In order to define information gain precisely, we begin by defining a measure
commonly used in information theory, called entropy, that characterizes the
(im)purity of an arbitrary collection of examples. Given a collection S, containing
positive and negative examples of some target concept, the entropy of S relative to
this boolean classification is
Entropy(S)= - (P+ )(Log2 P+ )- (P- )(Log2 P- ) EQ(1)
where P+, is the proportion of positive examples in S and P- is the proportion of
negative examples in S. suppose S is a collection of 14 examples of some boolean
concept, including 9 positive and 5 negative examples. Then the entropy of S
relative to this boolean classification is
Entropy([9+,5-])= -(9/14) log2 (9/14) - (5/14) log2 (5/14) =0.940

Entropy is 0 if all members of S belong to the same class. For example, if all
members are positive then P- is 0, and Entropy(S) = -1 . log2(1) - 0 . log2 0 = -1 .
0 - 0 . log2 0 = 0. Note the entropy is 1 when the collection contains an equal
number of positive and negative examples. If the collection contains unequal
numbers of positive and negative examples, then Entropy is between 0 and 1.
More generally, if the target attribute can take on c different values, then the
entropy of S relative to this c-wise classification is defined as

c
Entropy(S) = ∑ - Pi log2 Pi
i=1
where Pi is the proportion of S belonging to class i

Day Outlook Temperature Humidit Wind PlayTennis


y
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcas Hot High Weak Yes
t
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcas Cool Normal Strong Yes
t
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcas Mild High Strong Yes
t
D13 Overcas Hot Normal Weak Yes
t
D14 Rain Mild High Strong No

1.4.3 Information Gain Measures The Expected Reduction In Entropy


Given entropy as a measure of the impurity in a collection of training examples,
we can now define a measure of the effectiveness of an attribute in classifying the
training data. The measure we will use, called information gain, is simply the
expected reduction in entropy caused by partitioning the examples according to
this attribute. More precisely, the information gain, Gain(S, A) of an attribute A,
relative to a collection of examples S, is defined as
Gain(S, A) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)
v€values(A)
where values(A) is the set of all possible values for attribute A, and Sv is the subset
of S for which attribute A has value v (i.e., Sv = {s €S |A(s) = v)).
For example, suppose S is a collection of training-example days described by
attributes including Wind, which can have the values Weak or Strong. As before,
assume S is a collection containing 14 examples, [9+, 5-]. Of these 14 examples,
suppose 6 of the positive and 2 of the negative examples have Wind = Weak, and
the remainder have Wind = Strong. The information gain due to sorting the original
14 examples by the attribute Wind may then be calculated as

Values(Wind) = Weak, Strong


S=[9+, 5-]
Sweak [6+, 2-]
Sstrong [3+, 3-]

Gain(S, Wind)= Entropy(S) - ∑ (|Sv| / |S| )Entropy(Sv)


v€{weak, strong}

= Entropy(S) – (8/14) Entropy(Sweak)-(6/14) Entropy(Sstrong)


Now from EQ(1) Entropy(Sweak)= - (6/8)Log2 (6/8) – (2/8)Log2 (2/8)
= - (0.75)(-0.4150)-(0.25)(-2.0)
=0.31125+0.50
=0.811
Entropy(Sstrong)= - (3/6)Log2 (3/6) – (3/6)Log2 (3/6)
= - (0.5)Log2(0.5)- (0.5)Log2(0.5)
= -(0.5)(-1.0) – (0.5)(-1.0)
=1.0

So, Gain(S,Wind)=0.940-(8/14)(0.811)-(6/14)(1.0)= 0.048

Information gain is precisely the measure used by ID3 to select the best attribute at
each step in growing the tree.

1.4.4 Building Decision Tree


Let us try to build decision tree from table 1.
Which attribute is the best classifier? That attribute which has highest Information
Gain

Let us find out Information gain of all attributes. First take Humidity

S: [9+,5-]
E =0.940 (Calculated in previous page)
Humidity

High Normal

[3+, 4-] [6+, 1-]

E =-3/7(log2 3/7)-4/7(log2 4/7) E=-6/7(log2 6/7)-1/7(log2 1/7)


=-(0.428)(-1.224)-(0.571)(-0.808) =-(0.8571)(log2 0.8571)-(0.1428)(log2 0.1428)
=0.523+0.461=0.985 =-(0.8571)(-0.2224)-(0.1428)(-2.8079)
=0.1906+0.4009=0.592

Gain(S, A) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)


v€values(A)
= Entropy(S) - ∑ (|Sv| / |S| )Entropy(Sv)
v€{High, Normal}

Gain(S, Humidity)
=0.940 - (7/14) (0.985) - (7/14) (0.592) = 0.151

Now take Wind


S: [9+,5-]
E =0.940
Wind

Weak Strong
[6+,2-] [3+,3-]
E=0.811 E=1

Gain(S, Wind)
=0.940 - (8/14) (0.811) - (6/14) (1.0 )= 0.048 (Details in previous page)

Now take Outlook

S: [9+,5-]
E =0.940
Outlook
Sunny Overcast Rain
[2+,3-] [4+,0-] [3+,2-]
E =-2/5(log2 2/5)-3/5(log2 3/5) E=-4/4(log2 1)-0/4(log2 0) E=-3/5(log2 3/5)-2/5(log2 2/5)
= -(0.4)(log2 0.4)-(0.6)(log2 0.6) =0 =-(0.6)(log2 0.6)-(0.4)(log2 0.4)
=-(0.4)(-1.32)-(0.6)(-0.73) =-(0.8571)(-0.2224)-(0.1428)(-2.8079)
=0.528+0.44=0.968 =0.44+0.528=0.968

Gain(S, Outlook) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)


v€values(Sunny,Overcast,Rain)
Gain(S, Outlook) = 0.940-(5/14*0.968+ 4/14 * 0 + 5/14 * 0.968)
=0.940- 0.691=0.249

Now take temperature

S: [9+,5-]
E =0.940
Temperature

Hot Mild Cool


[2+,2-] [4+,2-] [3+,1-]
E =-2/4(log2 2/4)-2/4(log2 2/4) E=-4/6(log2 4/6)-2/6(log2 2/6) E=-3/4(log2 3/4)-1/4(log2 1/4)
= -(0.5)(log2 0.5)-(0.5)(log2 0.5) = -(0.66)(-0.599)-(0.333)(-1.586) =-(0.75)(log2 0.75)-(0.25)(log2 0.25)
=-(0.5)(-1)-(0.5)(-1) =0.395+0.528 =-(0.75)(-0.415)-(0.25)(-2)
=0.5+0.5=1 =0.923 =0.311+0.5=0.811
Gain(S, Temperature) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)
v€values(hot,mild,cool)

Gain(S, Temperature) = 0.940-(4/14 * 1+ 6/14 * 0.923 + 4/14 * 0.811)


=0.940- (0.285+0.395+0.232)=0.029

So,
Gain(S, Outlook)=0.249
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029

Therefore, Outlook is selected as the decision attribute for the root node, and
branches are created below the root for each of its possible values (i.e., Sunny,
Overcast, and Rain). The resulting partial decision tree is shown in Figure 2, along
with the training examples sorted to each new descendant node. Note that every
example for which Outlook = Overcast is also a positive example of PlayTennis.
Therefore, this node of the tree becomes a leaf node with the classification
PlayTennis = Yes. In contrast, the descendants corresponding to Outlook = Sunny
and Outlook = Rain still have nonzero entropy, and the decision tree will be further
elaborated below these nodes.

Outlook

Sunny Overcast Rain

Yes
Which attribute should be tested here?

Figure 2: The partially learned decision tree

Ssunny={D1,D2,D8,D9,D11}
E(Ssunny)=-(2/5)( log2 2/5) – (3/5)( log2 3/5)=-(0.4)(-1.322)-(0.6)(0.737)
=0.529+0.442 =0.970
Humidity has two values high and normal. For Outlook=Sunny, humidity has three
times high and two times normal values.
For outlook = sunny, humidity = high, there are three negative output values and
no positive values. And for humidity=normal, there are two positive output values
and no negative values.

Gain(S, A) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)


v€values(A)

Gain(Ssunny, Humidity)=0.970-[(3/5)(-0/3* log2 0/3-3/3 * log2 3/3)+(2/5)(-2/2 log2


2/2-0/2 *log2 0/2)]=0.970

Gain(Ssunny, Temperature)=0.970-[(2/5)(-0/2 * log2 0/2 - 2/2 log2 2/2)+(2/5)(-1/2 *


log2 ½ - 1/2 log2 ½ )+(1/5)(-1/1 log2 1/1 - 0/1 log2 0/1)]=0.970-[0+0.4+0]=0.570

Gain(Ssunny, Wind)=0.970-[(2/5)(1.0)+(3/5)(-1/3 log2 1/3-2/3 log2 2/3)]


=0.970-[0.4+0.6(-0.33*(-1.599)-0.66*(-0.599))]=0.970-[0.4+0.6(0.527+0.395)]
=0.970-[0.953]=0.017

Day Outlook Temperature Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Outlook

Sunny Overcast Rain

Humidity Yes

Now

Srain= {D4,D5,D6.D10,D14}

E(Srain) =-3/5*log2 3/5-2/5*log2 2/5=-(0.6)(-0.737) – (0.4)(-1.322)=0.442+0.528=


0.970

Gain(S, wind) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)


v€values(weak,strong)

Gain(Srain, wind)=0.970-[2/5(-0/2 log2 0/2-2/2 log2 2/2)+3/5(-3/3 log2 3/3-0/3 log2


0/3)]=0.970-[0]=0.970

Gain(S, temp) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)


v€values(cool,mild)
Gain(Srain, Temperature)=0.970-[2/5(-1/2 log2 ½-1/2 log2 ½ )+3/5(-2/3 log2 2/3-
1/3log2 1/3)]

=0.970-[0.4+0.6(-0.666(-0.586)-0.333(-1.586))]=0.970-[0.4+0.6(0.390+0.528)]

=0.970-[0.951]=0.019

Outlook

Sunny Overcast Rain


Humidity Yes Wind

High Normal Strong Weak

Now it can be observed that Entropy value at bottom four points is zero
For Example Entropy at left most point (High) is

E(D1,D2,D8)=0

Gain(S, temp) = Entropy(S) - ∑ (|Sv| / |S|) Entropy(Sv)


v€values(high,mild)

Gain(S,temp)=0-[2/3(0)+1/3(0)]=0

So, gain at every point with respect to attribute Temperature is zero

SO, final Decision tree is

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Algorithm-
ID3(Examples, Target attribute, Attributes)
Examples are the training examples. Target attribute is the attribute whose
value is to be predicted by the tree. Attributes is a list of other attributes that
may be tested by the learned decision tree. Returns a decision tree that
correctly classifies the given Examples.
● Create a Root node for the tree
● If all Examples are positive, Return the single-node tree Root, with label = +
● If all Examples are negative, Return the single-node tree Root, with label = -
● If Attributes is empty, Return the single-node tree Root, with label = most
common value of Target attribute in Examples
● Otherwise Begin
● A the attribute from Attributes that best classifies Examples
● The decision attribute for Root A
● For each possible value, vi, of A,

● Add a new tree branch below Root, corresponding to the test A = vi


● Let Examplesvi be the subset of Examples that have value vi for A
● If Examplesvi is empty
● Then below this new branch add a leaf node with label = most
common value of Target attribute in Examples
● Else below this new branch add the subtree
ID3(Examplesvi Targetattribute, Attributes – {A})
End
Return Root

1.5 Inductive Bias In Decision Tree Learning

Inductive bias is the set of assumptions that, together with the training data,
deductively justify the classifications assigned by the learner to future instances.
Given a collection of training examples, there are typically many decision trees
consistent with these examples. Describing the inductive bias of ID3 therefore
consists of describing the basis by which it chooses one of these consistent
hypotheses over the others. Which of these decision trees does ID3 choose? ID3
search strategy (a) selects in favor of shorter trees over longer ones, and (b) selects
trees that place the attributes with highest information gain closest to the root.
1.6 Issues In Decision Tree Learning

Practical issues in learning decision trees include determining how deeply to grow
the decision tree, handling continuous attributes, choosing an appropriate attribute
selection measure, Handling training data with missing attribute values, handling
attributes with differing costs, and improving computational efficiency.
2. Overfitting and Underfitting in Machine Learning model

This situation where any given model is performing too well on the training data
but the performance drops significantly over the test set is called
an overfitting model. On the other hand, if the model is performing poorly over
the test and the train set, then we call that an underfitting model.Overfitting and
Underfitting are the two main problems that occur in machine learning and degrade
the performance of the machine learning models.
The main goal of each machine learning model is to generalize well. It means after
providing training on the dataset, it can produce reliable and accurate output.
Hence, the underfitting and overfitting are the two terms that need to be checked
for the performance of the model and whether the model is generalizing well or
not.

Overfitting

Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset. Because
of this, the model starts caching noise and inaccurate values present in the dataset,
and all these factors reduce the efficiency and accuracy of the model. The
overfitted model has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to


our model. It means the more we train our model, the more chances of occurring
the overfitted model.
Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:

Fig : Overfitting
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine
learning model. There are some ways by which we can reduce the occurrence of
overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training

Underfitting

Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of
training data can be stopped at an early stage, due to which the model may not
learn enough from the training data. As a result, it may fail to find the best fit of the
dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.


Example: We can understand the underfitting using below output of the linear
regression model:

Fig: Underfitting
How to avoid underfitting:
o By increasing the training time of the model.
o By increasing the number of features.

o Goodness of Fit
o The "Goodness of fit" term is taken from the statistics, and the goal of the
machine learning models to achieve the goodness of fit. In statistics
modeling, it defines how closely the result or predicted values match the true
values of the dataset.
o The model with a good fit is between the underfitted and overfitted model,
and ideally, it makes predictions with 0 errors, but in practice, it is difficult
to achieve it.
o As when we train our model for a time, the errors in the training data go
down, and the same happens with test data. But if we train the model for a
long duration, then the performance of the model may decrease due to the
overfitting, as the model also learn the noise present in the dataset. The
errors in the test dataset start increasing, so the point, just before the raising
of errors, is the good point, and we can stop here for achieving a good
model.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy