Unit-4 DM
Unit-4 DM
Unit-4 DM
Classification
I) Basic concepts:
i) What is classification? :
Classification is a form of data analysis that extracts models describing important
data classes. Such models, called classifiers, predict categorical (discrete, unordered) class
labels. For example, we can build a classification model to categorize bank loan applications
as either safe or risky. Such analysis can help provide us with a better understanding of the
data at large. Classification has numerous applications, including fraud detection, target
marketing, performance prediction, manufacturing, and medical diagnosis.
A bank loans officer needs analysis of her data to learn which loan applicants are
“safe” and which are “risky” for the bank. A marketing manager at AllElectronics needs data
analysis to help guess whether a customer with a given profile will buy a new computer. The
data analysis task is classification, where a model or classifier is constructed to predict class
(categorical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for
the marketing data.
Suppose that the marketing manager wants to predict how much a given customer will
spend during a sale at AllElectronics. This data analysis task is an example of numeric
prediction, where the model constructed predicts a continuous-valued function, or ordered
value, as opposed to a class label. This model is a predictor. Regression analysis is a
statistical methodology that is most often used for numeric prediction.
ii) General Approach to Classification:
Data classification is a two-step process, consisting of a learning step (where a
classification model is constructed) and a classification step (where the model is used to
predict class labels for given data). The process is shown for the loan application data of
Figure(a).
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm builds
the classifier by analyzing or “learning from” a training set made up of database tuples and
their associated class labels. A tuple, X, is represented by an n-dimensional attribute vector,
X = (x1,x2,x3...xn), depicting n measurements made on the tuple from n database attributes,
respectively, A1, A2, : : : , An. Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class label attribute. The class label
attribute is discrete-valued and unordered. The individual tuples making up the training set
are referred to as training tuples and are randomly sampled from the database under
analysis. This first step of the classification process can also be viewed as the learning of a
mapping or function, y = f (x), that can predict the associated class label y of a given tuple X.
In the second step (Figure b), the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the
classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to
overfit the data. Therefore, a test set is used, made up of test tuples and their associated
class labels. They are independent of the training tuples, meaning that they were not used to
construct the classifier.
Where Pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by
Ci,D/D.
Step 2 : Info(D) is the average amount of information needed to identify the class label of a
tuple in D.Info(D) is also known as the entropy of D.The expected information required to
classify a tuple from D, based on the partitioning by attribute A is calculated by,
Step 3: Information gain is defined as the difference between the original information
requirement and new requirement
Gain ratio:
C4.5(or j48) , a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information gain
using a “split information” value defined analogously with Info(D) as
This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A. It differs
from information gain, which measures the information with respect to classification that is
acquired based on the same partitioning. The gain ratio is defined as
Example for Information Gain:
In this example, The class label attribute, buys computer, has two distinct values namely,
{yes, no}; therefore, there are two distinct classes Let class C1 correspond to yes and class C2
correspond to no.There are 9 tuples of class yes and 5 tuples of class no. we must compute the
Information gain of each attribute. First compute the expected information needed to classify
a tuple in D:
Next, we need to compute the expected information requirement for each attribute.
Let’s start with the attribute age. We need to look at the distribution of yes and no tuples for
each category of age. For the age category youth, there are 2 yes tuples and 3 no tuples. For
the category middle aged, there are 4 yes tuples and zero no tuples. For the category senior,
there are 3 yes tuples and 2 no tuples
The expected information needed to classify a tuple in D if the tuples are partitioned
according to age is
Similarly, we can compute Gain (income) = 0.029 bits, Gain (student) = 0.151 bits,
and Gain (credit rating) = 0.048 bits. Because age has the highest information gain among
the attributes, it is selected as the splitting attribute.
iii) Tree pruning: There are two common approaches to tree pruning:
Prepruning and postpruning.
In the prepruning approach, a tree is “pruned” by halting its construction early (e.g.,
by deciding not to further split or partition the subset of training tuples at a given node). Upon
halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset
tuples or the probability distribution of those tuples. When constructing a tree, measures such
as statistical significance, information gain, Gini index, and so on can be used to assess the
goodness of a split. If partitioning the tuples at a node would result in a split that falls below a
prespecified threshold, then further partitioning of the given subset is halted.
The second and more common approach is postpruning, which removes subtrees
from a “fully grown” tree. A subtree at a given node is pruned by removing its branches and
replacing it with a leaf. The leaf is labeled with the most frequent class among the subtree
being replaced. For example, notice the subtree at node “A3?” in the unpruned tree of Figure
6.6. Suppose that the most common class within this subtree is “class B.” In the pruned
version of the tree, the subtree in question is pruned by replacing it with the leaf “class B.”
III) Bayesian Classification Algorithm:
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the
statistical classifiers. Bayesian classifiers can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
i) Baye’s Theorem:
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c). Look at the equation below:
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
ii) Algorithm:
Step 1: Let D be a training set of tuples and their associated class labels. As usual, each tuple
is represented by an n-dimensional attribute vector, X = (x1, x2. . . xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, . . . , An.
Step 2: Suppose that there are m classes, C1, C2,... Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus we maximize P (Ci|X). The class Ci for which P (Ci(|X) is maximized is called
the maximum posterior hypothesis.
Step 3: As P(X) is constant for all classes, only P(X|Ci) P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P (X|Ci) P(Ci).
Step 4: Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the
naive assumption of class conditional independence is made. Given the class label of the
tuple (i.e., that there are no dependence relationships among the attributes). Thus,
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), . . . , P(xn|Ci) from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D.
(b) If Ak is continuous-valued, a continuous-valued attribute is typically assumed to have a
Gaussian distribution with a mean μ and standard deviation s, defined by
Step 5: In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ci if and only if
Example: Predicting a class label using naïve Bayesian classification for the given data:
The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute, buys computer, has two distinct values (namely, {yes, no}). Let C1
correspond to the class buys computer = yes and C2 correspond to buys computer = no. The
tuple we wish to classify is
X = (age = youth, income = medium, student = yes, credit rating = fair)
We need to maximize P(X|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can
be computed based on the training tuples:
P(buys computer = yes) = 9/14 = 0.643
P(buys computer = no) = 5/14 = 0.357
To compute P(X|Ci), for i = 1, 2, we compute the following conditional probabilities:
P(age = youth | buys computer = yes) = 2/9 = 0.222
P(age = youth | buys computer = no) = 3/5 = 0.600
P(income = medium | buys computer = yes) = 4/9 = 0.444
P(income = medium | buys computer = no) = 2/5 = 0.400
P(student = yes | buys computer = yes) = 6/9 = 0.667
P(student = yes | buys computer = no) = 1/5 = 0.200
P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
P(credit rating = fair | buys computer = no) = 2/5 = 0.400
Using the above probabilities, we obtain
Similarly,
P(X|buys computer = no) = 0.600×0.400×0.200×0.400 = 0.019.
To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute
P(X|buys computer = yes)P(buys computer = yes) = 0.044×0.643 = 0.028
P(X|buys computer = no)P(buys computer = no) = 0.019×0.357 = 0.007
Therefore, the naïve Bayesian classifier predicts buys computer = yes for tuple X.
The accuracy of a classifier on a given test set is the percentage of test set tuples that
are correctly classified by the classifier. That is,
We can also speak of the error rate or misclassification rate of a classifier, which is
defined as:
The sensitivity and specificity measures can be used, respectively, for this purpose.
Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of
positive tuples that are correctly identified),which are measured as