0% found this document useful (0 votes)
56 views

UNIT3

Uploaded by

Tom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

UNIT3

Uploaded by

Tom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

CLASSIFICATION ALGORITHMS

DECISION TREE
NAÏVE BAYESIAN
MULTILAYER PERCEPTRON
Decision trees

“Decision tree learning is a method for approximating discrete


valued target functions, in which the learned function is
represented by a decision tree. Decision tree learning is one of the
most widely used and practical methods for inductive inference.”

Decision tree: Example


Consider the following situation. Somebody is hunting for a job. At
the very beginning, he decides that he will consider only those jobs
for which the monthly salary is at least Rs.50,000. Our job hunter
does not like spending much time traveling to place of work. He is
comfortable only if the commuting time is less than one hour. Also,
he expects the company to arrange for a free coffee every
morning! The decisions to be made before deciding to accept or
reject a job offer can be schematically represented as
Two types of decision trees

There are two types of decision trees.

1. Classification trees
Tree models where the target variable can take a
discrete set of values are called classification trees. In
these tree structures, leaves represent class labels
and branches represent conjunctions of features that
lead to those class labels.
2. Regression trees
Decision trees where the target variable can take
continuous values (real numbers) like the price of a
house, or a patient’s length of stay in a hospital, are
called regression trees.
Classification trees

We illustrate the concept with an example.

Consider the data given in Table 8.1 which specify the features of certain
vertebrates and the class to which they belong. For each species, four features have
been identified: “gives birth”, ”aquatic animal”, “aerial animal” and “has legs”.
There are five class labels, namely, “amphibian”, “bird”, “fish”, “mammal” and
“reptile”. The problem is how to use this data to identify the class of a newly
discovered vertebrate.
Construction of the tree
Step 1
We split the set of examples given into disjoint subsets according to
the values of the feature “gives birth”. Since there are only two
possible values for this feature, we have only two subsets: One
subset consisting of those examples for which the value of “gives
birth” is “yes” and one subset for which the value is “no”.
Step 2
We now consider the example We split these examples
based on the values of the feature “aquatic animal”. There
are three possible values for this feature.
these appear in Table 8.2. Accordingly, we need
consider only two subsets. These are shown in Tables
8.4 and 8.5.
• Table 8.4 contains only one example and hence no further splitting is required. It
leads to the assignment of the class label “fish”.
• The examples in Table 8.5 need to be split into subsets based on the values of
“aerial animal”. It can be seen that these subsets immediately lead to unambiguous
assignment of class labels: The value of “no” leads to “mammal” and the value “yes”
leads to ”bird”.

At this stage, the classification tree is


Step 3
Next we consider the examples in Table 8.3 and split them into
disjoint subsets based on the values of “aquatic animal”. We get
the examples in Table 8.6 for “yes”, the examples in Table 8.7for
“no” and the examples in Table 8.8 for “semi”. We now split the
resulting subsets based on the values of
“has legs”, etc. Putting all these together, we get the the diagram
in Figure 8.5 as the classification tree for the data in Table 8.1.
1.On the elements of a classification tree
The various elements in a classification tree are identified as follows.
• Nodes in the classification tree are identified by the feature
names of the given data.
• Branches in the tree are identified by the values of features.
• The leaf nodes identified by are the class labels.

2. On the order in which the features are selected


In the example discussed above, initially we chose the feature “gives
birth” to split the data set into disjoint subsets and then the feature
“aquatic animal”, and so on. There was no theoretical justification for
this choice. We could as well have chosen the feature “aquatic
animal”, or any other feature, as the initial feature for splitting the
data. The classification tree depends on the order in which the
features are selected for partitioning the data.
3. Stopping criteria
A real-world data will contain much more example record than the
example we considered earlier. In general, there will be a large number
of features each feature having several possible values. Thus, the
corresponding classification trees will naturally be more complex. In such
cases, it may not be advisable to construct all branches and leaf nodes
of the tree. The following are some of commonly used criteria for
stopping the construction of further nodes and branches.
• All (or nearly all) of the examples at the node have the same class.
• There are no remaining features to distinguish among the examples.
• The tree has grown to a predefined size limit.
Feature selection measures
If a dataset consists of n attributes then deciding
which attribute is to be to placed at the root or at
different levels of the tree as internal nodes is a
complicated problem. It is not enough that we just
randomly select any node to be the root. If we do
this, it may give us bad results with low accuracy. The
most important problem in implementing the
decision tree algorithm is deciding which features
are to be considered as the root node and at each
level. Several methods have been developed to
assign numerical values to the various features such
that the values reflect the relative importance of the
various features. These are called the feature
selection measures. Two of the popular feature
selection measures are information gain and Gini
index. These are explained in the next section.
Entropy
• The degree to which a subset of examples contains only a single class is
known as purity, and any subset composed of only a single class is called a
pure class.

• Informally, entropy is a measure of “impurity” in a dataset. Sets with high


entropy are very diverse and provide little information about other items
that may also belong in the set, as there is no apparent commonality.

• Entropy is measured in bits. If there are only two possible classes, entropy
values can range from 0 to 1.

• For n classes, entropy ranges from 0 to log2 n .

• In each case, the minimum value indicates that the sample is completely
homogeneous, while the maximum value indicates that the data are as
diverse as possible, and no group has even a small plurality.
Example
Problem
Use ID3 algorithm to construct a decision tree for the
data in Table 8.9.
Solution
Note that, in the given data, there are four features but only
two class labels (or, target variables),
namely, “yes” and “no”.
Step 1
We first create a root node for the tree (see Figure 8.7).

Step 2
Note that not all examples are positive (class label “yes”) and not all examples are
negative (class label “no”). Also the number of features is not zero.
Step 3
We have to decide which feature is to be placed at the root node. For this, we
have to calculate the information gains corresponding to each of the four features.
The computations are shown below.
(i) Calculation of Entropy (S)
(ii) Calculation of Gain (S, outlook)
The values of the attribute “outlook” are “sunny”, “ overcast” and “rain”.
We have to calculate
Entropy (Sv) for v = sunny, v = overcast and v = rain.
(iii) Calculation of Gain (S, temperature)
The values of the attribute “temperature” are “hot”,
“mild” and “cool”. We have to calculate
Entropy (Sv) for v = hot, v = mild and v = cool.

(iv) Calculation of Gain (S, humidity) and Gain (S,


wind)
The following information gains can be calculated in
a similar way:
Step 4
We find the highest information gain which is the maximum among Gain(S,
outlook), Gain(S, temperature),
Gain(S, humidity) and Gain(S, wind). Therefore, we have:
highest information gain = max{0.2469, 0.0293, 0.151, 0.048}
= 0.2469
This corresponds to the feature “outlook”. Therefore, we place “outlook” at
the root node. We now split the root node in Figure 8.7 into three branches
according to the values of the feature “outlook”
as in Figure 8.8.
The maximum of Gain(S(1), temp), Gain(S(1), hum) and Gain(S(1),wind)is
Gain(S(1), hum).
Hence we place “humidity” at Node 1 and split this node into two branches
according to the values
of the feature “humidity” to get the tree in Figure 8.9.
Step 6
It can be seen that all the examples in the the data set corresponding to Node 4 in
Figure 8.9 have the same class label “no” and all the examples corresponding to Node
5 have the same class label“yes”. So we represent Node 4 as a leaf node with value
“no” and Node 5 as a leaf node with value“yes”. Similarly, all the examples
corresponding to Node 2 have the same class label “yes”. Sowe convert Node 2 as a
leaf node with value “ yes. Finally, let S(2) = Soutlook = rain. The highest information gain
for this data set is Gain(S(2), humidity). The branches resulting from splitting this
node corresponding to the values “high” and “normal” of “humidity” lead to leaf nodes
with classlabels “no” and ”yes”. With these changes, we get the tree in Figure 8.10.
Regression trees
A regression problem is the problem of determining a relation between one or more
independent variables and an output variable which is a real continuous variable
and then using the relation to predict the values of the dependent variables.
Regression problems are in general related to prediction of numerical values of
variables. Trees can also be used to make such predictions. A tree
used for making predictions of numerical variables is called a prediction tree or a
regression tree.
Issues in decision tree learning
Avoiding overfitting of data
When we construct a decision tree, the various branches are grown (that is,
sub-branches are constructed) just deeply enough to perfectly classify the
training examples. This leads to difficulties when there is noise in the data or
when the number of training examples are too small. In these
cases the algorithm can produce trees that overfit the training examples.
Definition
We say that a hypothesis overfits the training examples if some other
hypothesis that fits the training examples less well actually performs better
over the entire distribution of instances, including instances beyond the
training set.
Impact of overfitting
Figure 8.14 illustrates the impact of overfitting in a typical decision tree
learning. From the figure,we can see that the accuracy of the tree over
training examples increases monotonically whereas the
accuracy measured over independent test samples first increases then
decreases.
Approaches to avoiding overfitting
• The main approach to avoid overfitting is pruning.
• Pruning is a technique that reduces the size of decision trees by
removing sections of the tree that provide little power to classify
instances.
• Pruning reduces the complexity of the final classifier, and hence
improves predictive accuracy by the reduction of overfitting.
• We may apply pruning earlier, that is, before it reaches the point where
it perfectly classifies the training data.
• We may allow the tree to overfit the data, and then post-prune the
tree.
• Now there is the problem of what criterion is to be used to determine
the correct final tree size. One commonly used criterion is to use a
separate set of examples, distinct from the training examples, to
evaluate the utility of post-pruning nodes from the tree.
Problem of missing attributes
Table 8.15 shows a dataset with missing attribute
values. the missing values are indicated by “?”s.
The following are some of the methods used to
handle the problem of missing attributes.
• Deleting cases with missing attribute values
• Replacing a missing attribute value by the most
common value of that attribute
• Assigning all possible values to the missing attribute
value
• Replacing a missing attribute value by the mean
for numerical attributes
• Assigning to a missing attribute value the
corresponding value taken from the closest t cases,
or replacing a missing attribute value by a new value
Bayesian classifier and ML estimation
• The Bayesian classifier is an algorithm for classifying multiclass datasets.

• This is based on the Bayes’ theorem in probability theory.

• Bayes in whose name the theorem is known was an English


statistician who was known for having formulated a specific case of a
theorem that bears his name.

• The classifier is also known as “naive Bayes Algorithm” where the word
“naive” is an English word with the following meanings: simple,
unsophisticated, or primitive.

• We first explain Bayes’ theorem and then describe the algorithm. Of


course, we require the notion of conditional probability
Bayes’ theorem
Theorem
Let A and B any two events in a random experiment.
If P(A) ≠ 0, then
P(B∣A) =P(A∣B)P(B)/P(A).

1. The importance of the result is that it helps us to “invert” conditional


probabilities, that is, to express the conditional probability P(A∣B) in
terms of the conditional probability P(B∣A).

2. The following terminology is used in this context:


• A is called the proposition and B is called the evidence.

• P(A) is called the prior probability of proposition and P(B) is called the
prior probability of evidence

• P(A∣B) is called the posterior probability of A given B.

• P(B∣A) is called the likelihood of B given A.


Example
Problem 1
Consider a set of patients coming for treatment in a certain clinic. Let A
denote the event that a “Patient has liver disease” and B the event that
a “Patient is an alcoholic.” It is known from experience that 10% of the
patients entering the clinic have liver disease and 5% of the patients are
alcoholics. Also, among those patients diagnosed with liver disease, 7%
are alcoholics. Given that a patient is alcoholic, what is the probability
that he will have liver disease?
Solution
Using the notations of probability, we have
P(A) = 10% = 0.10
P(B) = 5% = 0.05
P(B∣A) = 7% = 0.07
P(A∣B) =P(B∣A)P(A)/P(B)
=0.07 × 0.10/0.05
= 0.14
Naive Bayes algorithm
Assumption
The naive Bayes algorithm is based on the following assumptions:
• All the features are independent and are unrelated to each other. Presence
or absence of a feature does not influence the presence or absence of any
other feature.
• The data has class-conditional independence, which means that events are
independent so long as they are conditioned on the same class value.
These assumptions are, in general, true in many real world problems. It is
because of these assumptions, the algorithm is called a naive algorithm.
Basic idea
• Suppose we have a training data set consisting of N examples having
n features. Let the features be named as (F1, . . . , Fn).
• A feature vector is of the form (f1, f2, . . . , fn).
• Associated with each example, there is a certain class label.
Let the set of class labels be {c1, c2, . . . , cp}.
• Suppose we are given a test instance having the feature vector
• X = (x1, x2, . . . , xn).
• We are required to determine the most appropriate class label that
should be assigned to the test instance.
• For this purpose we compute the following conditional probabilities
• P(c1∣X), P(c2∣X), . . . , P(cp∣X). (6.5)and choose the maximum among
them.
• Let the maximum probability be P(ci∣X). Then, we choose ci as the
most appropriate class label for the training instance having X as the
feature vector.
• The direct computation of the probabilities given in Eq.(6.5) are difficult
for a number of reasons.
• The Bayes’ theorem can be applied to obtain a simpler method. This is
explained below.
Computation of probabilities
Using Bayes’ theorem, we have:
P(ck/X) = P(X/ck)P(ck) / P(X)
Example
Problem
Consider a training data set consisting of the fauna of the world. Each unit has
three features named “Swim”, “Fly” and “Crawl”.
Let the possible values of these features be as follows:
Swim - Fast, Slow, No
Fly- Long, Short, Rarely, No
Crawl -Yes, No
For simplicity, each unit is classified as “Animal”, “Bird” or “Fish”. Let the training
data set be as in Table 6.1. Use naive Bayes algorithm to classify a particular
species if its features are (Slow, Rarely,No)?
Neural networks
• An Artificial Neural Network (ANN) models the relationship between a set of input
signals and an output signal using a model derived from our understanding of how
a biological brain responds to stimuli from sensory inputs.

• Just as a brain uses a network of interconnected cells called neurons to create a


massive parallel processor, ANN uses a network of artificial neurons or nodes to
solve learning problems.

Artificial neurons
Definition
• An artificial neuron is a mathematical function conceived as a model of
biological neurons.
• Artificial neurons are elementary units in an artificial neural network.
• The artificial neuron receives one or more inputs (representing excitatory
postsynaptic potentials and inhibitory postsynaptic potentials at neural
dendrites) and sums them to produce an output.
• Each input is separately weighted, and the sum is passed through a function
known as an activation function or transfer function.
• The small circles in the schematic representation of the artificial neuron shown in
Figure 9.3 are called the nodes of the neuron.
• The circles on the left side which receives the values of x0, x1, . . . , xn
are called the input nodes and the circle on the right side which outputs the
value of y is called output node.
• The squares represent the processes that are taking place before the result is
outputted.
Activation function
Definition
In an artificial neural network, the function which takes the incoming signals as input
and produces the output signal is known as the activation function.

Some simple activation functions


The following are some of the simple activation functions
Perceptron
The perceptron is a special type of artificial neuron in which the activation function
has a special form.
Definition
A perceptron is an artificial neuron in which the activation function is the threshold
function.
Consider an artificial neuron having x1, x2, ⋯, xn as the input signals and w1, w2, ⋯,
wn as the associated weights. Let w0 be some constant. The neuron is called a
perceptron if the output of the neuron is given by the following function:
Remarks
The quantity −w0 can be looked upon as a “threshold” that should be crossed by the
weighted sum w1x1 + ⋯ + wnxn in order for the neuron to output a “1”.

Representations of boolean functions by perceptrons


Boolean functions x1 AND x2 can be represented by perceptrons.
To be consistent with the conventions in the definition of a perceptron we assume that
the values −1 and 1 represent the boolean constants “false” and “true” respectively.

Representation of x1 AND x2
Let x1 and x2 be two boolean variables. Then the boolean function x1 AND x2 is
represented. It can be easily verified that the perceptron shown in Figure 9.13 represents
the function
Representations of OR, NAND and NOR
The functions x1 ORx2, x1 NAND x2 and x1 NORx2 can also be represented by
perceptrons. Table 9.2 shows the values to be assigned to the weights w0, w1, w2 for
getting these boolean functions

Learning a perceptron
By “learning a perceptron” we mean the process of assigning values to the weights
and the threshold such that the perceptron produces correct output for each of the
given training examples.
Artificial neural networks
• An artificial neural network (ANN) is a computing system inspired by the
biological neural networks that constitute human brains.
• An ANN is based on a collection of connected units called artificial
neurons. Each connection between artificial neurons can transmit a
signal from one to another.
• The artificial neuron that receives the signal can process it and then send
it to the artificial neurons connected to it.
• Each connection between artificial neurons has a weight attached to it
that get adjusted as learning proceeds.
• Artificial neurons may have a threshold such that only if the aggregate
signal crosses that threshold the signal is sent.
• Artificial neurons are organized in layers. Different layers may perform
different kinds of transformations on their inputs. Signals travel from the
input layer to the output layer, possibly after traversing the layers
multiple times.
Characteristics of an ANN
An ANN can be defined and implemented in several different
ways. The way the following characteristics are defined
determines a particular variant of an ANN.
• The activation function
This function defines how a neuron’s combined input signals
are transformed into a single output signal to be broadcasted
further in the network.
• The network topology (or architecture)
This describes the number of neurons in the model as well as
the number of layers and manner in which they are
connected.
• The training algorithm
This algorithm specifies how connection weights are set in
order to inhibit or excite neurons in proportion to the input
signal.
Activation functions
The activation function is the mechanism by which the artificial
neuron processes incoming information and passes it throughout the
network. Just as the artificial neuron is modeled after the biological
version, so is the activation function modeled after nature’s design.
Let x1, x2, . . . , xn be the input signals, w1, w2, . . . , wn be the
associated weights and −w0 the
threshold.
Let
x = w0 + w1x1 + ⋯ + wnxn.
Network topology
By “network topology” we mean the patterns and structures in the collection
of interconnected nodes.
The topology determines the complexity of tasks that can be learned by the
network. Generally, larger and more complex networks are capable of
identifying more subtle patterns and complex decision boundaries. However,
the power of a network is not only a function of the network size,
but also the way units are arranged.
Different forms of forms of network architecture can be differentiated by the
following characteristics:
• The number of layers
• Whether information in the network is allowed to travel backward
• The number of nodes within each layer of the network
The number of layers
• In an ANN, the input nodes are those nodes which
receive unprocessed signals directly from the input
data.
• The output nodes (there may be more than one) are
those nodes which generate the final predicted
values.
• A hidden node is a node that processes the signals
from the input nodes (or other such nodes) prior to
reaching the output nodes.
• The nodes are arranged in layers. The set of nodes
which receive the unprocessed signals from the input
data constitute the first layer of nodes.
• The set of hidden nodes which receive the outputs
from the nodes in the first layer of nodes constitute
the second layer of nodes.
• In a similar way we can define the third, fourth, etc.
layers. Figure 9.14 shows an ANN with only one layer
of nodes.Figure 9.15 shows an ANN with two layers.
The direction of information travel
• Networks in which the input signal is fed
continuously in one direction from connection to
connection until it reaches the output layer are
called feedforward networks.
• Networks which allows signals to travel in both
directions using loops are called recurrent
networks (or, feedback networks).
• In spite of their potential, recurrent networks are
still largely theoretical and are rarely used
in practice.
• On the other hand, feedforward networks have
been extensively applied to real-world problems.
In fact, the multilayer feedforward network,
sometimes called the Multilayer Perceptron
• (MLP)
The number of nodes in each layer
• The number of input nodes is predetermined by
the number of features in the input data.

• Similarly, the number of output nodes is


predetermined by the number of outcomes to be
modeled or the number of class levels in the
outcome.

• However, the number of hidden nodes is left to


the user to decide prior to training the model.
Unfortunately, there is no reliable rule to determine
the number of neurons in the hidden layer.

• The appropriate number depends on the number


of input nodes, the amount of training data, the
amount of noisy data, and the complexity of the
learning task, among many other factors.
The training algorithm

The algorithm which is now commonly used to train


an ANN is known simply as backpropagation.

The cost function


Definition
In a machine learning algorithm, the cost function is
a function that measures how well the algorithm
maps the target function that it is trying to guess or a
function that determines how well the algorithm
performs in an optimization problem.
The cost function is also called the loss function, the objective
function, the scoring function, or the
error function.
Example
Let y be the the output variable. Let y1, . . . , yn be the actual
values of y in n examples and yˆ1, . . . , yˆn be the values
predicted by an algorithm.
Outline of the algorithm
1. Initially the weights are assigned at random.
2. Then the algorithm iterates through many cycles of two processes until
a stopping criterion is reached. Each cycle is known as an epoch. Each
epoch includes:
(a) A forward phase in which the neurons are activated in sequence
from the input layer to the output layer, applying each neuron’s weights
and activation function along the way. Upon reaching the final layer, an
output signal is produced.
(b) A backward phase in which the network’s output signal resulting from
the forward phase is compared to the true target value in the training
data. The difference between the network’s output signal and the true
value results in an error that is propagated backwards in the network to
modify the connection weights between neurons and reduce future
errors.
3. The technique used to determine how much a weight should be
changed is known as gradient descent method. At every stage of the
computation, the error is a function of the weights. If we plot the error
against the weights, we get a higher dimensional analog of something
like a curve or surface. At any point on this surface, the gradient suggests
how steeply the error will be reduced or increased for a change in the
weight. The algorithm will attempt to change the weights that result in
the greatest reduction in error (see Figure 9.17).
Illustrative example
To illustrate the various steps in the backpropagation algorithm,
we consider a small network with two inputs, two outputs and
one hidden layer as shown in Figure 9.18.2
We assume that there are two observations:

We are required to estimate the optimal values of the weights w1, . . .


, w8, b1, b2. Here b1 and b2 are the biases. For simplicity, we have
assigned the same biases to both nodes in the same layer.
Step 1. We initialise the connection weights to small
random values. These initial weights are
shown in Figure 9.19.
Step 2. Present the first sample inputs and the corresponding output targets to
the network.
Step 3. Pass the input values to the first layer (the layer with nodes h1 and h2).
Step 4. We calculate the outputs from h1 and h2. We use the logistic activation
function
f(x) = 1/1 + e−x.
Step 5. We repeat this process for every layer. We get the outputs from
the nodes in the output layer as follows:
Step 6. We begin backward phase. We adjust the weights. We first adjust the
weights leading to the nodes o1 and o2 in the output layer and then the weights
leading to the nodes h1 and h2 in the hidden layer. The adjusted values of the
weights w1, . . . , w8, b1, . . . , b4 are denoted by
The computations use a certain constant η called the learning rate. In the
following we have taken η = 0.5.
(a) Computation of adjusted weights leading to o1 and o2:
We choose the next sample input and the corresponding output
targets to the network and repeat Steps 2 to 6.
Step 8. The process in Step 7 is repeated until the root mean square of
output errors is minimised

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy