Classification and Prediction

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

Classification and Prediction

Classification and Prediction:


Classification and prediction are two forms of data analysis that can be used to extractmodels
describing important data classes or to predict future data trends.
Classificationpredicts categorical (discrete, unordered) labels, prediction models
continuousvaluedfunctions.
For example, we can build a classification model to categorize bankloan applications as
either safe or risky, or a prediction model to predict the expendituresof potential customers
on computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered value, as
opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.

Issues Regarding Classification and Prediction:


1. Preparing the Data for Classification and Prediction:
The following preprocessing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.

SPH 391- DATA MINING


(i) Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missingvalues (e.g., by replacing a missing
value with the most commonly occurring value for that attribute, or with the most probable
value based on statistics).
Although most classification algorithms have some mechanisms for handling noisy or
missing data, this step can help reduce confusion during learning.
(ii) Relevance analysis:
Many of the attributes in the data may be redundant.
Correlation analysis can be used to identify whether any two given attributes are
statisticallyrelated.
For example, a strong correlation between attributes A1 and A2 would suggest that one of
the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be used
in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained
using all attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the classification or
prediction task.
Such analysis can help improve classification efficiency and scalability.
(iii) Data Transformation And Reduction
The data may be transformed by normalization, particularly when neural networks or
methods involving distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a
small specified range, such as -1 to +1 or 0 to 1.
The data can also be transformed by generalizing it to higher-level concepts. Concept
hierarchies may be used for this purpose. This is particularly useful for continuous
valuedattributes.
For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such
as binning, histogram analysis, and clustering.

Comparing Classification and Prediction Methods:


 Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label
information).
The accuracy of a predictor refers to how well a given predictor can guess the value of
the predicted attribute for new or previously unseen data.
 Speed:
This refers to the computational costs involved in generating and using the
given classifier or predictor.
 Robustness:
This is the ability of the classifier or predictor to make correct
predictions given noisy data or data with missing values.
 Scalability:
This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
 Interpretability:
This refers to the level of understanding and insight that is providedby the classifier or
predictor.
Interpretability is subjective and therefore more difficultto assess.
Classification by Decision Tree Induction:
Decision tree induction is the learning of decision trees from class-labeled training tuples.
A decision tree is a flowchart-like tree structure,where
 Each internal nodedenotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node holds a class label.
 The topmost node in a tree is the root node.

The construction of decision treeclassifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensionaldata.
Their representation of acquired knowledge in tree formis intuitive and generallyeasy to
assimilate by humans.
The learning and classification steps of decision treeinduction are simple and fast.
In general, decision tree classifiers have good accuracy.
Decision tree induction algorithmshave been used for classification in many application
areas, such as medicine,manufacturing and production, financial analysis, astronomy, and
molecular biology.
Algorithm For Decision Tree Induction:

The algorithm is called with three parameters:


 Data partition
 Attribute list
 Attribute selection method

The parameter attribute list is a list ofattributes describing the tuples.


Attribute selection method specifies a heuristic procedurefor selecting the attribute that
―best‖ discriminates the given tuples according to class.
The tree starts as a single node, N, representing the training tuples in D.
If the tuples in D are all of the same class, then node N becomes a leaf and is labeledwith
that class .
Allof the terminating conditions are explained at the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion.
The splitting criterion tells us which attribute to test at node N by determiningthe ―best‖
way to separate or partition the tuples in D into individual classes.

There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.

1 A is discrete-valued:

In this case, the outcomes of the test at node N corresponddirectly to the known
values of A.
A branch is created for each known value, aj, of A and labeled with that value.
Aneed not be considered in any future partitioning of the tuples.

2 A is continuous-valued:

In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
wheresplit point is the split-point returned by Attribute selection method as part of the
splitting criterion.

3 A is discrete-valued and a binary tree must be produced:

The test at node N is of the form―A€SA?‖.


SA is the splitting subset for A, returned by Attribute selection methodas part of the splitting
criterion. It is a subset of the known values of A.
(a) If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary
tree must be produced:

Bayesian Classification:

Bayesian classifiers are statistical classifiers.


They can predictclass membership probabilities, such as the probability that a given tuple
belongs toa particular class.

Bayesian classification is based on Bayes’ theorem.

Bayes’ Theorem:
Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described
by measurements made on a set of n attributes.
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis
H holds given the ―evidence‖ or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it providesa way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).

Naïve Bayesian Classification:

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.

2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.

That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if

Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensiveto
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption
of class conditional independence is made. This presumes that the values of the attributes
areconditionally independent of one another, given the class label of the tuple. Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) fromthe


trainingtuples. For eachattribute, we look at whether the attribute is categorical or
continuous-valued. Forinstance, to compute P(X|Ci), we consider the following:
 If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D havingthe value
xkfor Ak, divided by |Ci,D| the number of tuples of class Ciin D.
 If Akis continuous-valued, then we need to do a bit more work, but the calculationis pretty
straightforward.
A continuous-valued attribute is typically assumed tohave a Gaussian distribution with a
mean μ and standard deviation , defined by

5. In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class
Ci. The classifier predicts that the class label of tuple X is the class Ciif and only if

A Multilayer Feed-Forward Neural Network:


The backpropagation algorithm performs learning on a multilayer feed-
forward neural network.
It iteratively learns a set of weights for prediction of the class label of tuples.
A multilayer feed-forward neural network consists of an input layer, one or
more hiddenlayers, and an output layer.
Example:

The inputs to the network correspond to the attributes measured for each training tuple. The
inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer
known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples

Classification by Backpropagation:
Backpropagation is a neural network learning algorithm.
A neural network is a set of connected input/output units inwhich each connection has a
weightassociated with it.
During the learning phase, the network learns by adjusting the weights so as to be able to
predict the correct class label of the input tuples.
Neural network learning is also referred to as connectionist learning due to the connections
between units.
Neural networks involve long training times and are therefore more suitable for
applicationswhere this is feasible.
Backpropagation learns by iteratively processing a data set of training tuples, comparing
the network’s prediction for each tuple with the actual known target value.
The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
For each training tuple, the weights are modified so as to minimize the mean squared
errorbetween the network’s prediction and the actual target value. These modifications
are made in the ―backwards‖ direction, that is, from the output layer, through each
hidden layer down to the first hidden layer hence the name is backpropagation.
Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Advantages:
It include their high tolerance of noisy data as well as their ability to classify patterns on
which they have not been trained.
They can be used when you may have little knowledge of the relationships between
attributesand classes.
They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
algorithms.
They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be used
to speed up the computation process.
Process:
Initialize the weights:

The weights in the network are initialized to small random numbers


ranging from-1.0 to 1.0, or -0.5 to 0.5. Each unit has a bias associated with it. The biases are
similarly initialized to small random numbers.
Each training tuple, X, is processed by the following steps.

Propagate the inputs forward:

First, the training tuple is fed to the input layer of thenetwork. The inputs pass through the input
units, unchanged. That is, for an input unitj, its output, Oj, is equal to its input value, Ij. Next, the
net input and output of eachunit in the hidden and output layers are computed. The net input to a
unit in the hiddenor output layers is computed as a linear combination of its inputs.
Each such unit has anumber of inputs to it that are, in fact, the outputs of the units connected to
it in theprevious layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit ismultiplied by its correspondingweight, and this is summed.

wherewi,jis the weight of the connection from unit iin the previous layer to unit j;
Oiis the output of unit ifrom the previous layer

Ɵjis the bias of the unit & it actsas a threshold in that it serves to vary the activity of the unit.

Each unit in the hidden and output layers takes its net input and then applies an activation
function to it.
Backpropagate the error:

The error is propagated backward by updating the weights and biases to reflect the error of
the network’s prediction. For a unit j in the output layer, the error Err jis computed by

whereOjis the actual output of unit j, and Tjis the known target value of the giventraining
tuple.
The error of a hidden layerunit j is

wherewjkis the weight of the connection from unit j to a unit k in the next higher layer,
andErrkis the error of unit k.
Weights are updatedby the following equations, where Dwi j is the change in weight wi j:

Biases are updated by the following equations below


Algorithm:

k-Nearest-Neighbor Classifier:
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a
given test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a point in an n-
dimensional space. In this way,all of the training tuples are stored in an n-dimensional
pattern space. When given anunknown tuple, a k-nearest-neighbor classifier searches the
pattern space for the k trainingtuples that are closest to the unknown tuple. These k training
tuples are the k nearest neighbors of the unknown tuple.
Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … , x1n) and
X2 = (x21, x22, … ,x2n), is

In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1and in tuple X2, square this difference,and accumulate it.
The square root is taken of the total accumulated distance count.
Min-Max normalization can be used to transforma value v of a numeric attribute A to v0 in
therange [0, 1] by computing

whereminAand maxAare the minimum and maximum values of attribute A

For k-nearest-neighbor classification, the unknown tuple is assigned the mostcommon


class among its k nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to
it in pattern space.
Nearestneighborclassifiers can also be used for prediction, that is, to return a real-valued
prediction for a given unknown tuple.
In this case, the classifier returns the averagevalue of the real-valued labels associated
with the k nearest neighbors of the unknowntuple.

Other Classification Methods:

Genetic Algorithms:

Genetic algorithms attempt to incorporate ideas of natural evolution. In general, geneticlearning


starts as follows.
An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits. As a simple example, suppose that samples in a given
training set are described by two Boolean attributes, A1 and A2, and that there are two
classes,C1 andC2.
The rule ―IF A1 ANDNOT A2 THENC2‖ can be encoded as the bit string ―100,‖
where the two leftmost bits represent attributes A 1 and A2, respectively, and the rightmost
bit represents the class.
Similarly, the rule ―IF NOT A1 AND NOT A2 THEN C1‖ can be encoded as ―001.‖
If an attribute has k values, where k > 2, then k bits may be used to encode the attribute’s
values.
Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules in the current population, as well as offspring of these rules.
Typically, thefitness of a rule is assessed by its classification accuracy on a set of training
samples.
Offspring are created by applying genetic operators such as crossover and mutation.
In crossover, substrings from pairs of rules are swapped to form new pairs of rules.
Inmutation, randomly selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules continues
until a population, P, evolves where each rule in P satisfies a pre specified fitness
threshold.
Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems. In data mining, they may be used to evaluate
the fitness of other algorithms.

Fuzzy Set Approaches:


Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree ofmembership
that a certain value has in a given category. Each category then represents afuzzy set.
Fuzzy logic systemstypically provide graphical tools to assist users in converting attribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory.
It was proposed by LotfiZadeh in1965 as an alternative to traditional two-value logic and
probability theory.
It lets usworkat a high level of abstraction and offers a means for dealing with imprecise
measurementof data.
Most important, fuzzy set theory allows us to dealwith vague or inexact facts.
Unlike the notion of traditional ―crisp‖ sets where anelement either belongs to a set S or its
complement, in fuzzy set theory, elements canbelong to more than one fuzzy set.
Fuzzy set theory is useful for data mining systems performing rule-based classification.
It provides operations for combining fuzzy measurements.
Several procedures exist for translating the resulting fuzzy output into a defuzzifiedor crisp
value that is returned by the system.
Fuzzy logic systems have been used in numerous areas for classification, including
market research, finance, health care, and environmental engineering.

Example:

Regression Analysis:
Regression analysis can be used to model the relationship between one or moreindependent
or predictor variables and a dependent or response variable which is continuous-valued.
In the context of data mining, the predictor variables are theattributes of interest describing
the tuple (i.e., making up the attribute vector).
In general,the values of the predictor variables are known.
The response variable is what we want to predict.

Linear Regression:
Straight-line regression analysis involves a response variable, y, and a single predictor
variable x.
It is the simplest form of regression, and models y as a linear function of x.
That is,y = b+wx
where the variance of y is assumed to be constant
band w are regression coefficientsspecifying the Y-intercept and slope of the line.
The regression coefficients, w and b, can also be thought of as weights, so that we can
equivalently write, y = w0+w1x
These coefficients can be solved for by the method of least squares, which estimates
the best-fitting straight line as the one that minimizes the error between the actual data
and the estimate of the line.
Let D be a training set consisting of values of predictor variable, x, for some population and
their associated values for response variable, y. The training set contains |D| data points of
the form(x1, y1), (x2, y2), … , (x|D|, y|D|).
The regressioncoefficients can be estimated using this method with the following equations:

where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|.
The coefficients w0 and w1 often provide good approximations to otherwise complicated
regression equations.
Multiple Linear Regression:
It is an extension of straight-line regression so as to involve more than one predictor
variable.
It allows response variable y to be modeled as a linear function of, say, n predictor
variables or attributes, A1, A2, …, An, describing a tuple, X.
An example of a multiple linear regression model basedon two predictor attributes or
variables, A1 and A2, isy = w0+w1x1+w2x2
where x1 and x2 are the values of attributes A1 and A2, respectively, in X.
Multiple regression problemsare instead commonly solved with the use of statistical
software packages, such as SAS,SPSS, and S-Plus.

Nonlinear Regression:
It can be modeled by adding polynomial terms to the basic linear model.
By applying transformations to the variables, we can convert the nonlinear model into a
linear one that can then be solved by the method of least squares.
Polynomial Regression is a special case of multiple regression. That is, the addition of
high-order terms like x2, x3, and so on, which are simple functions of the single variable, x,
can be considered equivalent to adding new independent variables.
Transformation of a polynomial regression model to a linear regression model:
Consider a cubic polynomial relationship given by
y = w0+w1x+w2x2+w3x3
To convert this equation to linear form, we define new variables:
x1 = x, x2 = x2 ,x3 = x3
It can then be converted to linear formby applying the above assignments,resulting in the
equationy = w0+w1x+w2x2+w3x3
which is easily solved by themethod of least squares using software for regression analysis.

Classifier Accuracy:
The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate of
theclassifier, that is, it reflects how well the classifier recognizes tuples of the various
classes.
The error rate or misclassification rate of a classifier,M, which is simply 1-
Acc(M), where Acc(M) is the accuracy of M.
The confusion matrix is a useful tool for analyzing how well your classifier can
recognize tuples of different classes.
True positives refer to the positive tuples that were correctly labeled by the
classifier. True negatives are the negative tuples that were correctly labeled by
the classifier.
False positives are the negative tuples that were incorrectly labeled.
How well the classifier can recognize, for this sensitivity and specificity measures
can be used.
Accuracy is a function of sensitivity and specificity.

wheret _posis the number of true


positives posis the number of
positive tuples
t _negis the number of true
negatives negis the number
of negative tuples, f _posis
the number of false positives

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy