Unit-4 DM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Unit-IV

Classification
I) Basic concepts:
i) What is classification? :
Classification is a form of data analysis that extracts models describing important
data classes. Such models, called classifiers, predict categorical (discrete, unordered) class
labels. For example, we can build a classification model to categorize bank loan applications
as either safe or risky. Such analysis can help provide us with a better understanding of the
data at large. Classification has numerous applications, including fraud detection, target
marketing, performance prediction, manufacturing, and medical diagnosis.
A bank loans officer needs analysis of her data to learn which loan applicants are

“safe” and which are “risky” for the bank. A marketing manager at AllElectronics needs data
analysis to help guess whether a customer with a given profile will buy a new computer. The
data analysis task is classification, where a model or classifier is constructed to predict class
(categorical) labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for
the marketing data.
Suppose that the marketing manager wants to predict how much a given customer will
spend during a sale at AllElectronics. This data analysis task is an example of numeric
prediction, where the model constructed predicts a continuous-valued function, or ordered
value, as opposed to a class label. This model is a predictor. Regression analysis is a
statistical methodology that is most often used for numeric prediction.
ii) General Approach to Classification:
Data classification is a two-step process, consisting of a learning step (where a
classification model is constructed) and a classification step (where the model is used to
predict class labels for given data). The process is shown for the loan application data of
Figure(a).
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm builds
the classifier by analyzing or “learning from” a training set made up of database tuples and
their associated class labels. A tuple, X, is represented by an n-dimensional attribute vector,
X = (x1,x2,x3...xn), depicting n measurements made on the tuple from n database attributes,
respectively, A1, A2, : : : , An. Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class label attribute. The class label
attribute is discrete-valued and unordered. The individual tuples making up the training set
are referred to as training tuples and are randomly sampled from the database under
analysis. This first step of the classification process can also be viewed as the learning of a
mapping or function, y = f (x), that can predict the associated class label y of a given tuple X.
In the second step (Figure b), the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the
classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to
overfit the data. Therefore, a test set is used, made up of test tuples and their associated
class labels. They are independent of the training tuples, meaning that they were not used to
construct the classifier.

Figure: a) Learning Process b) Classification Process


II) Decision Tree Induction:
i) What is Decision Tree?
A decision tree is like a flowchart structure that includes a root node, branches, and
leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not.

ii) Decision Tree Induction Algorithm:


A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the
successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no
backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner.
The Following the process of algorithm:
Input: Training Dataset, attribute list, Attribute selection_ method.
Output: A Decision Tree.
Method:
(1)create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf nod labeled with the class C;
(4) If attribute_list is empty then
(5) Return N as a leaf node labeled with the majority class in D; //majority voting
(6) Apply attribute_seletion_method (D, arrtibute_list) to find the “best”
splitting_criterion;
(7)Label node N with splitting_criterion;
(8)If splitting_attribute is discrete-valued and Multiway splits allowed then
(9) attribute_list-attribute_list - splitting_attribute
(10) for each outcome j of splitting_criterion // partition the tuples and grow sub-tees
for each partition
(11) Let Dj be the set of a data tuples in D satisfying outcome j; // a partition
(12) If Dj is empty then
(13) attach a leaf labelled with the majority class in D to node N;
(14) Else attach the node returned by Geneate_decision_tree (Dj, attr_ list) to node N
(15) Return N;
iii) Attribute Selection Measures:
During tree construction, attribute selection measures are used to select the attributes
that partition the tuples into distinct classes. The attribute selection measures are
Information gain and Gini Index.
a) Information Gain:
Step 1: Let D be a Set of Training Samples, the expected information needed to classify a
tuple in D is given by,

Where Pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by
Ci,D/D.
Step 2 : Info(D) is the average amount of information needed to identify the class label of a
tuple in D.Info(D) is also known as the entropy of D.The expected information required to
classify a tuple from D, based on the partitioning by attribute A is calculated by,

Step 3: Information gain is defined as the difference between the original information
requirement and new requirement
Gain ratio:
C4.5(or j48) , a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information gain
using a “split information” value defined analogously with Info(D) as

This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A. It differs
from information gain, which measures the information with respect to classification that is
acquired based on the same partitioning. The gain ratio is defined as
Example for Information Gain:

In this example, The class label attribute, buys computer, has two distinct values namely,
{yes, no}; therefore, there are two distinct classes Let class C1 correspond to yes and class C2
correspond to no.There are 9 tuples of class yes and 5 tuples of class no. we must compute the
Information gain of each attribute. First compute the expected information needed to classify
a tuple in D:

Next, we need to compute the expected information requirement for each attribute.
Let’s start with the attribute age. We need to look at the distribution of yes and no tuples for
each category of age. For the age category youth, there are 2 yes tuples and 3 no tuples. For
the category middle aged, there are 4 yes tuples and zero no tuples. For the category senior,
there are 3 yes tuples and 2 no tuples
The expected information needed to classify a tuple in D if the tuples are partitioned
according to age is

Hence, the gain in information from such a partitioning would be

Similarly, we can compute Gain (income) = 0.029 bits, Gain (student) = 0.151 bits,
and Gain (credit rating) = 0.048 bits. Because age has the highest information gain among
the attributes, it is selected as the splitting attribute.
iii) Tree pruning: There are two common approaches to tree pruning:
Prepruning and postpruning.
In the prepruning approach, a tree is “pruned” by halting its construction early (e.g.,
by deciding not to further split or partition the subset of training tuples at a given node). Upon
halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset
tuples or the probability distribution of those tuples. When constructing a tree, measures such
as statistical significance, information gain, Gini index, and so on can be used to assess the
goodness of a split. If partitioning the tuples at a node would result in a split that falls below a
prespecified threshold, then further partitioning of the given subset is halted.
The second and more common approach is postpruning, which removes subtrees
from a “fully grown” tree. A subtree at a given node is pruned by removing its branches and
replacing it with a leaf. The leaf is labeled with the most frequent class among the subtree
being replaced. For example, notice the subtree at node “A3?” in the unpruned tree of Figure
6.6. Suppose that the most common class within this subtree is “class B.” In the pruned
version of the tree, the subtree in question is pruned by replacing it with the leaf “class B.”
III) Bayesian Classification Algorithm:
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the
statistical classifiers. Bayesian classifiers can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
i) Baye’s Theorem:
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c). Look at the equation below:

Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

ii) Algorithm:
Step 1: Let D be a training set of tuples and their associated class labels. As usual, each tuple
is represented by an n-dimensional attribute vector, X = (x1, x2. . . xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, . . . , An.
Step 2: Suppose that there are m classes, C1, C2,... Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if

Thus we maximize P (Ci|X). The class Ci for which P (Ci(|X) is maximized is called
the maximum posterior hypothesis.

Step 3: As P(X) is constant for all classes, only P(X|Ci) P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P (X|Ci) P(Ci).
Step 4: Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the
naive assumption of class conditional independence is made. Given the class label of the
tuple (i.e., that there are no dependence relationships among the attributes). Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), . . . , P(xn|Ci) from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D.
(b) If Ak is continuous-valued, a continuous-valued attribute is typically assumed to have a
Gaussian distribution with a mean μ and standard deviation s, defined by

Step 5: In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ci if and only if

Example: Predicting a class label using naïve Bayesian classification for the given data:

The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute, buys computer, has two distinct values (namely, {yes, no}). Let C1
correspond to the class buys computer = yes and C2 correspond to buys computer = no. The
tuple we wish to classify is
X = (age = youth, income = medium, student = yes, credit rating = fair)
We need to maximize P(X|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can
be computed based on the training tuples:
P(buys computer = yes) = 9/14 = 0.643
P(buys computer = no) = 5/14 = 0.357
To compute P(X|Ci), for i = 1, 2, we compute the following conditional probabilities:
P(age = youth | buys computer = yes) = 2/9 = 0.222
P(age = youth | buys computer = no) = 3/5 = 0.600
P(income = medium | buys computer = yes) = 4/9 = 0.444
P(income = medium | buys computer = no) = 2/5 = 0.400
P(student = yes | buys computer = yes) = 6/9 = 0.667
P(student = yes | buys computer = no) = 1/5 = 0.200
P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
P(credit rating = fair | buys computer = no) = 2/5 = 0.400
Using the above probabilities, we obtain

Similarly,
P(X|buys computer = no) = 0.600×0.400×0.200×0.400 = 0.019.
To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute
P(X|buys computer = yes)P(buys computer = yes) = 0.044×0.643 = 0.028
P(X|buys computer = no)P(buys computer = no) = 0.019×0.357 = 0.007
Therefore, the naïve Bayesian classifier predicts buys computer = yes for tuple X.

IV) Model Evaluation:


These section various evaluation metrics for the predictive accuracy of a
classifier. Holdout and random subsampling, various evaluation metrics for the predictive
accuracy of a classifier. Holdout and random subsampling.
i) Metrics for Evaluating Classifier Performance:
This section presents measures for assessing how good or how “accurate” your
classifier is at predicting the class label of tuples. They include accuracy (also known as
recognition rate), sensitivity (or recall), specificity, precision, etc.
we can talk in terms of positive tuples (tuples of the main class of interest) and
negative tuples (all other tuples).There are four additional terms we need to know that are
the “building blocks” used in computing many evaluation measures:
 True positives (TP): These refer to the positive tuples that were correctly labelled by
the classifier.
 True negatives (TN): These are the negative tuples that were correctly labelled by the
Classifier.
 False positives (FP): These are the negative tuples that were incorrectly labelled as
Positive. (e.g., tuples of class buys computer D no for which the classifier predicted.

buys computer D yes).


 False negatives (FN): These are the positive tuples that were mislabelled as negative
(e.g., tuples of class buys computer D yes for which the classifier predicted buys
computer D no).
These terms are summarized in the confusion matrix of Figure:

The accuracy of a classifier on a given test set is the percentage of test set tuples that
are correctly classified by the classifier. That is,

We can also speak of the error rate or misclassification rate of a classifier, which is
defined as:

The sensitivity and specificity measures can be used, respectively, for this purpose.
Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of
positive tuples that are correctly identified),which are measured as

Precision can be thought of as a measure of exactness (i.e., what percentage of tuples


labelled as positive are actually such), whereas recall is a measure of completeness (what
percentage of positive tuples are labelled as such).
An alternative way to use precision and recall is to combine them into a single
measure. This is the approach of the F measure (also known as the F1 score or F-
score) and the F_ measure. They are defined as

ii) Holdout Method and Random Subsampling:


The holdout method is what we have alluded to so far in our discussions about
accuracy. In this method, the given data are randomly partitioned into two independent sets, a
training set and a test set. Typically, two-thirds of the data are allocated to the training set,
and the remaining one-third is allocated to the test set. The training set is used to derive the
model. The model’s accuracy is then estimated with the test set as shown in fig:

Random subsampling is a variation of the holdout method in which the holdout


method is repeated k times. The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
iii) Cross-validation: is a technique to evaluate predictive models by partitioning the original
sample into training set to train the model, and a test set to evaluate it. In k-fold cross-
validation, the original sample is randomly partitioned into k equal size subsamples.
iv) Bootstrap: The bootstrap method samples the given training tuples uniformly with
replacement. That is, each time a tuple is selected, it is equally likely to be selected again and
re-added to the training set. For instance, imagine a machine that randomly selects tuples for
our training set. In sampling with replacement, the machine is allowed to select the same
tuple more than once.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy