Classification and Prediction
Classification and Prediction
Classification and Prediction
The construction of decision treeclassifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensionaldata.
Their representation of acquired knowledge in tree formis intuitive and generallyeasy to
assimilate by humans.
The learning and classification steps of decision treeinduction are simple and fast.
In general, decision tree classifiers have good accuracy.
Decision tree induction algorithmshave been used for classification in many application
areas, such as medicine,manufacturing and production, financial analysis, astronomy, and
molecular biology.
Algorithm For Decision Tree Induction:
There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.
1 A is discrete-valued:
In this case, the outcomes of the test at node N corresponddirectly to the known
values of A.
A branch is created for each known value, aj, of A and labeled with that value.
Aneed not be considered in any future partitioning of the tuples.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
wheresplit point is the split-point returned by Attribute selection method as part of the
splitting criterion.
Bayesian Classification:
Bayes’ Theorem:
Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described
by measurements made on a set of n attributes.
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis
H holds given the ―evidence‖ or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it providesa way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if
Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensiveto
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption
of class conditional independence is made. This presumes that the values of the attributes
areconditionally independent of one another, given the class label of the tuple. Thus,
5. In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class
Ci. The classifier predicts that the class label of tuple X is the class Ciif and only if
The inputs to the network correspond to the attributes measured for each training tuple. The
inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer
known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples
Classification by Backpropagation:
Backpropagation is a neural network learning algorithm.
A neural network is a set of connected input/output units inwhich each connection has a
weightassociated with it.
During the learning phase, the network learns by adjusting the weights so as to be able to
predict the correct class label of the input tuples.
Neural network learning is also referred to as connectionist learning due to the connections
between units.
Neural networks involve long training times and are therefore more suitable for
applicationswhere this is feasible.
Backpropagation learns by iteratively processing a data set of training tuples, comparing
the network’s prediction for each tuple with the actual known target value.
The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
For each training tuple, the weights are modified so as to minimize the mean squared
errorbetween the network’s prediction and the actual target value. These modifications
are made in the ―backwards‖ direction, that is, from the output layer, through each
hidden layer down to the first hidden layer hence the name is backpropagation.
Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Advantages:
It include their high tolerance of noisy data as well as their ability to classify patterns on
which they have not been trained.
They can be used when you may have little knowledge of the relationships between
attributesand classes.
They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
algorithms.
They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be used
to speed up the computation process.
Process:
Initialize the weights:
First, the training tuple is fed to the input layer of thenetwork. The inputs pass through the input
units, unchanged. That is, for an input unitj, its output, Oj, is equal to its input value, Ij. Next, the
net input and output of eachunit in the hidden and output layers are computed. The net input to a
unit in the hiddenor output layers is computed as a linear combination of its inputs.
Each such unit has anumber of inputs to it that are, in fact, the outputs of the units connected to
it in theprevious layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit ismultiplied by its correspondingweight, and this is summed.
wherewi,jis the weight of the connection from unit iin the previous layer to unit j;
Oiis the output of unit ifrom the previous layer
Ɵjis the bias of the unit & it actsas a threshold in that it serves to vary the activity of the unit.
Each unit in the hidden and output layers takes its net input and then applies an activation
function to it.
Backpropagate the error:
The error is propagated backward by updating the weights and biases to reflect the error of
the network’s prediction. For a unit j in the output layer, the error Err jis computed by
whereOjis the actual output of unit j, and Tjis the known target value of the giventraining
tuple.
The error of a hidden layerunit j is
wherewjkis the weight of the connection from unit j to a unit k in the next higher layer,
andErrkis the error of unit k.
Weights are updatedby the following equations, where Dwi j is the change in weight wi j:
k-Nearest-Neighbor Classifier:
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a
given test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a point in an n-
dimensional space. In this way,all of the training tuples are stored in an n-dimensional
pattern space. When given anunknown tuple, a k-nearest-neighbor classifier searches the
pattern space for the k trainingtuples that are closest to the unknown tuple. These k training
tuples are the k nearest neighbors of the unknown tuple.
Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … , x1n) and
X2 = (x21, x22, … ,x2n), is
In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1and in tuple X2, square this difference,and accumulate it.
The square root is taken of the total accumulated distance count.
Min-Max normalization can be used to transforma value v of a numeric attribute A to v0 in
therange [0, 1] by computing
Genetic Algorithms:
Example:
Regression Analysis:
Regression analysis can be used to model the relationship between one or moreindependent
or predictor variables and a dependent or response variable which is continuous-valued.
In the context of data mining, the predictor variables are theattributes of interest describing
the tuple (i.e., making up the attribute vector).
In general,the values of the predictor variables are known.
The response variable is what we want to predict.
Linear Regression:
Straight-line regression analysis involves a response variable, y, and a single predictor
variable x.
It is the simplest form of regression, and models y as a linear function of x.
That is,y = b+wx
where the variance of y is assumed to be constant
band w are regression coefficientsspecifying the Y-intercept and slope of the line.
The regression coefficients, w and b, can also be thought of as weights, so that we can
equivalently write, y = w0+w1x
These coefficients can be solved for by the method of least squares, which estimates
the best-fitting straight line as the one that minimizes the error between the actual data
and the estimate of the line.
Let D be a training set consisting of values of predictor variable, x, for some population and
their associated values for response variable, y. The training set contains |D| data points of
the form(x1, y1), (x2, y2), … , (x|D|, y|D|).
The regressioncoefficients can be estimated using this method with the following equations:
where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|.
The coefficients w0 and w1 often provide good approximations to otherwise complicated
regression equations.
Multiple Linear Regression:
It is an extension of straight-line regression so as to involve more than one predictor
variable.
It allows response variable y to be modeled as a linear function of, say, n predictor
variables or attributes, A1, A2, …, An, describing a tuple, X.
An example of a multiple linear regression model basedon two predictor attributes or
variables, A1 and A2, isy = w0+w1x1+w2x2
where x1 and x2 are the values of attributes A1 and A2, respectively, in X.
Multiple regression problemsare instead commonly solved with the use of statistical
software packages, such as SAS,SPSS, and S-Plus.
Nonlinear Regression:
It can be modeled by adding polynomial terms to the basic linear model.
By applying transformations to the variables, we can convert the nonlinear model into a
linear one that can then be solved by the method of least squares.
Polynomial Regression is a special case of multiple regression. That is, the addition of
high-order terms like x2, x3, and so on, which are simple functions of the single variable, x,
can be considered equivalent to adding new independent variables.
Transformation of a polynomial regression model to a linear regression model:
Consider a cubic polynomial relationship given by
y = w0+w1x+w2x2+w3x3
To convert this equation to linear form, we define new variables:
x1 = x, x2 = x2 ,x3 = x3
It can then be converted to linear formby applying the above assignments,resulting in the
equationy = w0+w1x+w2x2+w3x3
which is easily solved by themethod of least squares using software for regression analysis.
Classifier Accuracy:
The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate of
theclassifier, that is, it reflects how well the classifier recognizes tuples of the various
classes.
The error rate or misclassification rate of a classifier,M, which is simply 1-
Acc(M), where Acc(M) is the accuracy of M.
The confusion matrix is a useful tool for analyzing how well your classifier can
recognize tuples of different classes.
True positives refer to the positive tuples that were correctly labeled by the
classifier. True negatives are the negative tuples that were correctly labeled by
the classifier.
False positives are the negative tuples that were incorrectly labeled.
How well the classifier can recognize, for this sensitivity and specificity measures
can be used.
Accuracy is a function of sensitivity and specificity.