Unit 3 MLT
Unit 3 MLT
Outlook
Figure 1
Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. Each node in the
tree specifies a test of some attribute of the instance, and each branch descending
from that node corresponds to one of the possible values for this attribute. An
instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the
value of the attribute in the given example. This process is then repeated for the
subtree rooted at the new node. In general, decision trees represent a disjunction of
conjunctions of constraints on the attribute values of instances. Each path from the
tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself
to a disjunction of these conjunctions. For example, the decision tree shown in
Decision tree learning is generally best suited to problems with the following
characteristics:
--Instances are represented by attribute-value pairs. Instances are described by
a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The
easiest situation for decision tree learning is when each attribute takes on a small
number of disjoint possible values (e.g., Hot, Mild, Cold). However, few
algorithms allow handling real-valued attributes as well (e.g., representing
Temperature numerically).
--The target function has discrete output values. The decision tree in Figure.1
assigns a boolean classification (e.g., yes or no) to each example. Decision tree
methods easily extend to learning functions with more than two possible output
values. learning target functions with real-valued outputs is also , though the
application of decision trees in this setting is less common.
--Disjunctive descriptions may be required. As noted above, decision trees
naturally represent disjunctive expressions.
--The training data may contain errors. Decision tree learning methods are
robust to errors, both errors in classifications of the training examples and errors in
the attribute values that describe these examples.
--The training data may contain missing attribute values. Decision tree
methods can be used even when some training examples have unknown values
(e.g., if the Humidity of the day is known for only some of the training examples).
Decision tree learning has therefore been applied to problems such as learning to
classify medical patients by their disease, equipment malfunctions by their cause,
and loan applicants by their likelihood of defaulting on payments. Such problems,
in which the task is to classify examples into one of a discrete set of possible
categories, are often referred to as Classification problems.
ID3, learns decision trees by constructing them top down, beginning with the
question "which attribute should be tested at the root of the tree?' To answer this
question, each instance attribute is evaluated using a statistical test to determine
how well it alone classifies the training examples.
The best attribute is selected and used as the test at the root node of the tree. A
descendant of the root node is then created for each possible value of this attribute.
The entire process is then repeated using the training examples associated with
each descendant node to select the best attribute to test at that point in the tree. This
forms a greedy search for an acceptable decision tree, in which the algorithm never
backtracks to reconsider earlier choices.
1.4.1 Which Attribute Is the Best Classifier?
The central choice in the ID3 algorithm is selecting which attribute to test at each
node in the tree. We would like to select the attribute that is most useful for
classifying examples. What is a good quantitative measure of the worth of an
attribute? We define a statistical property, called informution gain, that measures
how well a given attribute separates the training examples according to their target
classification. ID3 uses this information gain measure to select among the
candidate attributes at each step while growing the tree.
Entropy is 0 if all members of S belong to the same class. For example, if all
members are positive then P- is 0, and Entropy(S) = -1 . log2(1) - 0 . log2 0 = -1 .
0 - 0 . log2 0 = 0. Note the entropy is 1 when the collection contains an equal
number of positive and negative examples. If the collection contains unequal
numbers of positive and negative examples, then Entropy is between 0 and 1.
More generally, if the target attribute can take on c different values, then the
entropy of S relative to this c-wise classification is defined as
c
Entropy(S) = ∑ - Pi log2 Pi
i=1
where Pi is the proportion of S belonging to class i
Information gain is precisely the measure used by ID3 to select the best attribute at
each step in growing the tree.
Let us find out Information gain of all attributes. First take Humidity
S: [9+,5-]
E =0.940 (Calculated in previous page)
Humidity
High Normal
Gain(S, Humidity)
=0.940 - (7/14) (0.985) - (7/14) (0.592) = 0.151
Weak Strong
[6+,2-] [3+,3-]
E=0.811 E=1
Gain(S, Wind)
=0.940 - (8/14) (0.811) - (6/14) (1.0 )= 0.048 (Details in previous page)
S: [9+,5-]
E =0.940
Outlook
Sunny Overcast Rain
[2+,3-] [4+,0-] [3+,2-]
E =-2/5(log2 2/5)-3/5(log2 3/5) E=-4/4(log2 1)-0/4(log2 0) E=-3/5(log2 3/5)-2/5(log2 2/5)
= -(0.4)(log2 0.4)-(0.6)(log2 0.6) =0 =-(0.6)(log2 0.6)-(0.4)(log2 0.4)
=-(0.4)(-1.32)-(0.6)(-0.73) =-(0.8571)(-0.2224)-(0.1428)(-2.8079)
=0.528+0.44=0.968 =0.44+0.528=0.968
S: [9+,5-]
E =0.940
Temperature
So,
Gain(S, Outlook)=0.249
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
Therefore, Outlook is selected as the decision attribute for the root node, and
branches are created below the root for each of its possible values (i.e., Sunny,
Overcast, and Rain). The resulting partial decision tree is shown in Figure 2, along
with the training examples sorted to each new descendant node. Note that every
example for which Outlook = Overcast is also a positive example of PlayTennis.
Therefore, this node of the tree becomes a leaf node with the classification
PlayTennis = Yes. In contrast, the descendants corresponding to Outlook = Sunny
and Outlook = Rain still have nonzero entropy, and the decision tree will be further
elaborated below these nodes.
Outlook
Yes
Which attribute should be tested here?
Ssunny={D1,D2,D8,D9,D11}
E(Ssunny)=-(2/5)( log2 2/5) – (3/5)( log2 3/5)=-(0.4)(-1.322)-(0.6)(0.737)
=0.529+0.442 =0.970
Humidity has two values high and normal. For Outlook=Sunny, humidity has three
times high and two times normal values.
For outlook = sunny, humidity = high, there are three negative output values and
no positive values. And for humidity=normal, there are two positive output values
and no negative values.
Outlook
Humidity Yes
Now
Srain= {D4,D5,D6.D10,D14}
=0.970-[0.4+0.6(-0.666(-0.586)-0.333(-1.586))]=0.970-[0.4+0.6(0.390+0.528)]
=0.970-[0.951]=0.019
Outlook
Now it can be observed that Entropy value at bottom four points is zero
For Example Entropy at left most point (High) is
E(D1,D2,D8)=0
Gain(S,temp)=0-[2/3(0)+1/3(0)]=0
Outlook
No Yes No Yes
Algorithm-
ID3(Examples, Target attribute, Attributes)
Examples are the training examples. Target attribute is the attribute whose
value is to be predicted by the tree. Attributes is a list of other attributes that
may be tested by the learned decision tree. Returns a decision tree that
correctly classifies the given Examples.
● Create a Root node for the tree
● If all Examples are positive, Return the single-node tree Root, with label = +
● If all Examples are negative, Return the single-node tree Root, with label = -
● If Attributes is empty, Return the single-node tree Root, with label = most
common value of Target attribute in Examples
● Otherwise Begin
● A the attribute from Attributes that best classifies Examples
● The decision attribute for Root A
● For each possible value, vi, of A,
Inductive bias is the set of assumptions that, together with the training data,
deductively justify the classifications assigned by the learner to future instances.
Given a collection of training examples, there are typically many decision trees
consistent with these examples. Describing the inductive bias of ID3 therefore
consists of describing the basis by which it chooses one of these consistent
hypotheses over the others. Which of these decision trees does ID3 choose? ID3
search strategy (a) selects in favor of shorter trees over longer ones, and (b) selects
trees that place the attributes with highest information gain closest to the root.
1.6 Issues In Decision Tree Learning
Practical issues in learning decision trees include determining how deeply to grow
the decision tree, handling continuous attributes, choosing an appropriate attribute
selection measure, Handling training data with missing attribute values, handling
attributes with differing costs, and improving computational efficiency.
2. Overfitting and Underfitting in Machine Learning model
This situation where any given model is performing too well on the training data
but the performance drops significantly over the test set is called
an overfitting model. On the other hand, if the model is performing poorly over
the test and the train set, then we call that an underfitting model.Overfitting and
Underfitting are the two main problems that occur in machine learning and degrade
the performance of the machine learning models.
The main goal of each machine learning model is to generalize well. It means after
providing training on the dataset, it can produce reliable and accurate output.
Hence, the underfitting and overfitting are the two terms that need to be checked
for the performance of the model and whether the model is generalizing well or
not.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset. Because
of this, the model starts caching noise and inaccurate values present in the dataset,
and all these factors reduce the efficiency and accuracy of the model. The
overfitted model has low bias and high variance.
Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:
Fig : Overfitting
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine
learning model. There are some ways by which we can reduce the occurrence of
overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of
training data can be stopped at an early stage, due to which the model may not
learn enough from the training data. As a result, it may fail to find the best fit of the
dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
Fig: Underfitting
How to avoid underfitting:
o By increasing the training time of the model.
o By increasing the number of features.
o Goodness of Fit
o The "Goodness of fit" term is taken from the statistics, and the goal of the
machine learning models to achieve the goodness of fit. In statistics
modeling, it defines how closely the result or predicted values match the true
values of the dataset.
o The model with a good fit is between the underfitted and overfitted model,
and ideally, it makes predictions with 0 errors, but in practice, it is difficult
to achieve it.
o As when we train our model for a time, the errors in the training data go
down, and the same happens with test data. But if we train the model for a
long duration, then the performance of the model may decrease due to the
overfitting, as the model also learn the noise present in the dataset. The
errors in the test dataset start increasing, so the point, just before the raising
of errors, is the good point, and we can stop here for achieving a good
model.