ml unit 3 part 1
ml unit 3 part 1
LEARNING
DECISION TREE LEARNING
Decision Tree Algorithm is a supervised Machine
Learning Algorithm where data is continuously
divided at each row based on certain rules until the
final outcome is generated.
Decision tree learning is a method for approximating
discrete-valued target functions, in which the learned
function is represented by a decision tree. Learned trees
can also be re-represented as sets of if-then rules to
improve human readability.
These learning methods are among the most popular of
inductive inference algorithms, an iterative and
inductive machine learning algorithm, which is used for
generating a set of a classification rule, produces rules
of the form “IF-THEN”, for a set of examples,
producing rules at each iteration and appending to the
set of rules.
These have been successfully applied to a broad range
of tasks from learning to diagnose medical cases to
learning to assess credit risk of loan applicants.
DECISION TREE
REPRESENTATION
Decision trees classify instances by sorting them
down the tree from the root to
some leaf node, which provides the classification of the
instance. Each node in the tree specifies a test of some
attribute of the instance, and each branch descending
from that node corresponds to one of the possible
values for this attribute.
In general, decision trees represent a disjunction of
conjunctions of constraints on the attribute values
of instances. Each path from the tree root to a leaf
corresponds to a conjunction of attribute tests, and the
tree itself to a disjunction of these conjunctions. For
example, the decision tree shown in Figure 3.1
corresponds to the expression
APPROPRIATE PROBLEMS FOR
DECISION TREE LEARNING:
Decision tree learning is generally best suited to
problems with the following characteristics:
1. Instances are represented by attribute-value pairs:
Instances are described by
a fixed set of attributes (e.g., Temperature) and their values
(e.g., Hot).
2. The target function has discrete output values: The decision
tree in Figure 3.1 assigns a boolean classification (e.g., yes or
no) to each example. Decision tree methods easily extend to
learning functions with more than two possible output values.
3. Disjunctive descriptions may be required: decision trees
naturally represent disjunctive expressions.
4. The training data may contain errors: Decision tree
learning methods are robust to errors, both errors in
classifications of the training examples and errors in
the attribute values that describe these examples.
5. The training data may contain missing attribute
values: Decision tree methods can be used even when
some training examples have unknown values (e.g., if
the Humidity of the day is known for only some of
the training examples).
ID-3 Algorithm:
ID3 stands for Iterative Dichotomiser 3 and is
named such because the algorithm iteratively
(repeatedly) dichotomizes (divides) features into two
or more groups at each step.
Invented by Ross Quinlan, ID3 uses a top-down
greedy approach to build a decision tree. In simple
words, the top-down approach means that we start
building the tree from the top and the greedy
approach means that at each iteration we select the
best feature at the present moment to create a node.
The ID3 algorithm selects the best feature at each
step while building a Decision tree. It uses
Information Gain or just Gain to find the best
feature.
Entropy is the measure of disorder and the Entropy
of a dataset is the measure of disorder in the target
feature of the dataset.
In the case of binary classification (where the target
column has only two types of classes) entropy is 0 if
all values in the target column are
homogenous(similar) and will be 1 if the target
column has equal number values for both the classes.
Information Gain calculates the reduction in the
entropy and measures how well a given feature
separates or classifies the target classes. The feature
with the highest Information Gain is selected as the
best one.
Information Gain for a feature column A is calculated
as:
An Illustrative Example:
To illustrate the operation of ID3, consider the
learning task represented by the training examples of
Table 3.2. Here the target attribute PlayTennis,
which can have values yes or no for different
Saturday mornings, is to be predicted based on other
attributes of the morning in question.
ID3 determines the information gain for each
candidate attribute (i.e., Outlook, Temperature,
Humidity, and Wind), then selects the one with
highest information gain.
To illustrate, suppose S is a collection of 14 examples
of some Boolean concept, including 9 positive and 5
negative examples (we adopt the notation [9+, 5- to
summarize such a sample of data).
Information gain is precisely the measure used by
ID3 to select the best attribute at each step in
growing the tree. The use of information gain to
evaluate the relevance of attributes is summarized in
Figure 3.3.
The information gain values for all four attributes are
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
According to the information gain measure, the
Outlook attribute provides the best prediction of the
target attribute, PlayTennis, over the training
examples. Therefore, Outlook is selected as the
decision attribute for the root node, and branches are
created below the root for each of its possible values
(i.e., Sunny, Overcast, and Rain).
Every example for which Outlook = Overcast is also
a positive example of PlayTennis. Therefore, this
node of the tree becomes a leaf node with the
classification PlayTennis = Yes. In contrast, the
descendants corresponding to Outlook = Sunny and
Outlook =Rain still have nonzero entropy, and the
decision tree will be further elaborated below these
nodes.
The process of selecting a new attribute and
partitioning the training examples is now repeated for
each non-terminal descendant node, this time using
only the training examples associated with that node
HYPOTHESIS SPACE SEARCH IN
DECISION TREE LEARNING:
As with other inductive learning methods, ID3 can be
characterized as searching a space of hypotheses for one
that fits the training examples.
The hypothesis space searched by ID3 is the set of possible
decision trees.
ID3 performs a simple-to complex, hill-climbing search
through this hypothesis space, beginning with the empty
tree, then considering progressively more elaborate
hypotheses in search of a decision tree that correctly
classifies the training data.
The evaluation function that guides this hill-climbing search
is the information gain measure.
Gain (Ssunny, Humidity) = .970
Gain(Ssunny, Temperature) =.570
Gain(Ssunny, Wind) = .019
FIGURE 3.4:
The partially learned decision tree resulting from the first
step of ID3. The training examples are sorted to the
corresponding descendant nodes. The Overcast descendant
has only positive examples and therefore becomes a leaf
node with classification Yes. The other two nodes will be
further expanded, by selecting the attribute with highest
information gain relative to the new subsets of examples.
INDUCTIVE BIAS IN DECISION
TREE LEARNING:
Inductive bias is the set of assumptions that, together with
the training data, deductively justify the classifications
assigned by the learner to future instances.
Describing the inductive bias of ID3 consists of describing
the basis by which it chooses one of these consistent
hypotheses over the others.
ID3 search strategy (a) selects in favor of shorter trees over
longer ones, and (b) selects trees that place the attributes
with highest information gain closest to the root.
ID3 does not always find the shortest consistent tree, and it
is biased to favor trees that place attributes with high
information gain closest to the root.
Restriction Biases and Preference
Biases:
ID3 searches a complete hypothesis space (i.e., one
capable of expressing any finite discrete-valued
function). It searches incompletely through this
space, from simple to complex hypotheses, until its
termination condition is met (e.g., until it finds a
hypothesis consistent with the data).
The version space CANDIDATE-ELIMINATION
algorithm searches an incomplete hypothesis space
(i.e., one that can express only a subset of the
potentially teachable concepts).
The inductive bias of ID3 follows from its search
strategy, whereas the inductive bias of the
CANDIDATE-ELIMINATION algorithm follows from
the definition of its search space.
The inductive bias of ID3 is thus a preference for
certain hypotheses over others (e.g., for shorter
hypotheses), with no hard restriction on the hypotheses
that can be eventually enumerated. This form of bias is
typically called a preference bias (or, alternatively, a
search bias). In contrast, the bias of the CANDIDATE
ELIMINATION algorithm is in the form of a
categorical restriction on the set of hypotheses
considered. This form of bias is typically called a
restriction bias (or, alternatively, a language bias).
Typically, a preference bias is more desirable than a
restriction bias, because it allows the learner to work
within a complete hypothesis space that is assured to
contain the unknown target function. In contrast, a
restriction bias that strictly limits the set of potential
hypotheses is generally less desirable, because it
introduces the possibility of excluding the unknown
target function altogether.
Why Prefer Short Hypotheses?
William of Occam was one of the first to discuss the
question “Is ID3's inductive bias favoring shorter decision
trees a sound basis for generalizing beyond the training
data?”, around the year 1320, so this bias often goes by the
name of Occam's razor.
Occam's razor: Prefer the simplest hypothesis that fits the
data.
One argument is that because there are fewer short
hypotheses than long ones, it is less likely that one will find
a short hypothesis that coincidentally fits the training data.
In contrast there are often many very complex hypotheses
that fit the current training data but fail to generalize
correctly to subsequent data.