Unit 3 - Classification
Unit 3 - Classification
Unit 3 - Classification
Classification
3.1 Introduction
Classification is the process where a model or classifier is constructed to predict categorical labels
of unknown data. Classification problems aim to identify the characteristics that indicate the group
to which each case belongs. This pattern can be used both to understand the existing data and to
predict how new instances will behave.
Definition: Classification is the task of learning a target function f that maps each attribute set X to
one of the predefined class label Y.
For example, classification of loan applicants as “safe” or “risky” for the bank, whether a customer
with a given profile will buy a new computer or not, whether a patient is a good candidate for a
surgical procedure or not etc.
Data classification is a two-step process:
I. Model construction
In the first step, a classifier is built describing a predetermined set of data classes or concepts. This
is the learning step (or training phase), where a classification algorithm builds the classifier by
analyzing or “learning from” a training set made up of database tuples and their associated class
labels.
II. Model usage
In the second step, the model is used for classification. Test data are used to estimate the accuracy
of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the
classification of new data tuples.
Training set
Arjun Lamichhane 1
Data Mining and Data Warehousing Unit 3: Classification
Example:
Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can
be used to identify whether any two given attributes are statistically related. For example, a strong
correlation between attributes A1 and A2 would suggest that one of the two could be removed from
further analysis.
Data transformation and reduction: The data may be transformed by normalization, particularly
when neural networks or methods involving distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a small
specified range, such as -1.0 to 1.0, or 0.0 to 1.0. Data can also be reduced by applying methods
such as binning, histogram analysis, and clustering.
Classification methods can be compared and evaluated according to the following criteria:
Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label information).
Arjun Lamichhane 2
Data Mining and Data Warehousing Unit 3: Classification
Speed: This refers to the computational costs involved in generating and using the given classifier
or predictor.
Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy
data or data with missing values.
Scalability: This refers to the ability to construct the classifier or predictor efficiently given large
amounts of data.
Types of Classifiers:
Bayesian Classifier
Arjun Lamichhane 3
Data Mining and Data Warehousing Unit 3: Classification
The problem of constructing a decision tree can be expressed recursively. First, select an attribute
to place at the root node and make one branch for each possible value. This splits up the example
set into subsets, one for every value of the attribute. Now the process can be repeated recursively
for each branch, using only those instances that actually reach the branch. If at any time all
instances at a node have the same classification, stop developing that part of the tree.
“How are decision trees used for classification?” Given a tuple, X, for which the associated class
label is unknown, the attribute values of the tuple are tested against the decision tree. A path is
traced from the root to a leaf node, which holds the class prediction for that tuple (X). Decision
trees can easily be converted to classification rules as well.
The only (and the most important) thing left to decide is how to determine which attribute to split
on, given a set of examples with different classes. Consider the weather data in the table 1. There
are four possibilities for root node (Outlook, Temperature, Humidity and Wind). Which is the best
choice? The number of yes and no classes are shown at the leaves. Any leaf with only one class—
yes or no—will not have to be split further, and the recursive process down that branch will
terminate. Because we seek small trees, we would like this to happen as soon as possible. If we
had a measure of the purity of each node, we could choose the attribute that produces the purest
child nodes.
An attribute selection measure is an approach for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes. If we
were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally
each partition would be pure (i.e. all of the tuples that fall into a given partition would belong to
the same class). Conceptually, the “best” splitting criterion is the one that most closely results in
such a scenario. Attribute selection measure is also known as splitting rules because they determine
how the tuples at a given nodes are to be split.
There are three popular attribute selection measures—information gain, gain ratio, and gini index.
a. Information Gain
Information gain is used by ID3 algorithm. The attribute with the highest information gain is
chosen as the splitting attribute for node N. This attribute minimizes the information needed to
Arjun Lamichhane 4
Data Mining and Data Warehousing Unit 3: Classification
classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in
these partitions.
The expected information needed to classify a tuple in D is given by
𝑚
|Dj|/D acts as the weight of the jth partition. InfoA(D) is the expected information required to classify
a tuple from D based on the partitioning by A.
Finally the information gain of any attribute A can be calculated as:
𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝐷)−𝐼𝑛𝑓𝑜𝐴 (𝐷)
In other words, Gain(A) tells us how much would be gained by branching on A. The attribute A
with the highest information gain, (Gain(A)), is chosen as the splitting attribute at node N. This is
equivalent to saying that we want to partition on the attribute A that would do the “best
classification,” so that the amount of information still required to finish classifying the tuples is
minimal (i.e., minimum InfoA(D)).
Example 1: Decision tree using information gain.
Table below presents a training set, D, of class-labeled tuples randomly selected from the weather
dataset consisting of weather information of last 14 days and whether a match was played on that
day or not. Now using the decision tree we need to predict whether the game will happen or not in
the day with testing attributes.
Arjun Lamichhane 5
Data Mining and Data Warehousing Unit 3: Classification
5 2 2 3 3 4 4 4 0 0
𝐼𝑛𝑓𝑜𝑂𝑢𝑡𝑙𝑜𝑜𝑘 (𝐷) = × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )] + × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )]
14 5 5 5 5 14 4 4 4 4
5 3 3 2 2
+ × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )] = 0.694 𝑏𝑖𝑡𝑠
14 5 5 5 5
Hence, the gain in information from such a partitioning would be
𝐺𝑎𝑖𝑛(𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = 𝐼𝑛𝑓𝑜(𝐷)−𝐼𝑛𝑓𝑜𝑂𝑢𝑡𝑙𝑜𝑜𝑘 (𝐷) = 0.940 − 0.694 = 0.246 𝑏𝑖𝑡𝑠
Similarly,
7 3 3 4 4 7 6 6 1 1
𝐼𝑛𝑓𝑜𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 (𝐷) = × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )] + × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )]
14 7 7 7 7 14 7 7 7 7
= 0.787 𝑏𝑖𝑡𝑠
Hence, the gain in information from such a partitioning would be
𝐺𝑎𝑖𝑛(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦) = 𝐼𝑛𝑓𝑜(𝐷)−𝐼𝑛𝑓𝑜𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 (𝐷) = 0.940 − 0.787 = 0.153 𝑏𝑖𝑡𝑠
In similar way we can compute,
Gain(Temperature)= 0.031 bits
Gain(Wind)=0.048 bits
Because Outlook has the highest information gain (0.246) among the attributes, it is selected as
the splitting attribute. Node N is labeled with Outlook, and branches are grown for each of the
attribute’s values. The tuples are then partitioned accordingly. So, initially our decision tree will
look like:
Arjun Lamichhane 7
Data Mining and Data Warehousing Unit 3: Classification
Notice that the tuples falling into the partition for Outlook = overcast all belong to the same class.
Because they all belong to class “yes,” a leaf should therefore be created at the end of this branch
and labeled with “yes”.
As we can see, for Outlook being sunny, there are 2 yes and 3 no. So we have to further split the
decision tree. To do so, we need to compute information gain of attributes for the sub table on the
left using same methods as above.
2 2 3 3
So, 𝐼𝑛𝑓𝑜(𝐷) = − [5 𝑙𝑜𝑔2 (5) + 5 𝑙𝑜𝑔2 (5)] = 0.970 𝑏𝑖𝑡𝑠
So, the expected information needed to classify a tuple in D if the tuples are partitioned according
to Outlook is
2 0 0 2 2 2 1 1 1 1
𝐼𝑛𝑓𝑜𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 (𝐷) = × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )] + × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )]
5 2 2 2 2 5 2 2 2 2
1 1 1 0 0
+ × [− 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( )] = 0.400 𝑏𝑖𝑡𝑠
5 1 1 1 1
Gain(Humidity)=0.970 bits
Gain(Wind)=0.020 bits
So, we select Humidity as splitting criteria. In similar way, we can compute information gain for
all attributes in right sub table. In the right sub table, Wind will be our splitting criteria. So our
final decision tree will be as follows:
Arjun Lamichhane 8
Data Mining and Data Warehousing Unit 3: Classification
b. Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it prefers to
select attributes having a large number of values. C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio, which attempts to overcome this bias. It applies a kind of
normalization to information gain using a “split information” value defined analogously with
Info(D) as
𝑣
|𝐷𝑗 | |𝐷𝑗 |
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 (𝐷) = − ∑ × 𝑙𝑜𝑔2 ( )
|𝐷| |𝐷|
𝑗=1
This value represents the potential information generated by splitting the training data set, D, into
v partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each outcome,
it considers the number of tuples having that outcome with respect to the total number of tuples in
D. It differs from information gain, which measures the information with respect to classification
that is acquired based on the same partitioning. The gain ratio is defined as
𝐺𝑎𝑖𝑛(𝐴)
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜(𝐴) =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 (𝐷)
The attribute with the maximum gain ratio is selected as the splitting attribute.
For example
Arjun Lamichhane 9
Data Mining and Data Warehousing Unit 3: Classification
A test on Temperature splits the data of Table 1 into three partitions, namely cool, mild, and hot,
containing four, six, and four tuples, respectively. To compute the gain ratio of Temperature, we
first compute
4 4 6 6 4 4
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 (𝐷) = − [ 𝑙𝑜𝑔2 ( ) + 𝑙𝑜𝑔2 ( ) + 𝑙𝑜𝑔2 ( )] = 0.926 𝑏𝑖𝑡𝑠
14 14 14 14 14 14
From above example, we have Gain(Temperature)=0.031 bits
Therefore, GainRatio(Temperature)=0.031/0.926=0.033 bits.
c. Gini Index
The Gini index is used in CART. Using the notation described above, the Gini index measures the
impurity of D, a data partition or set of training tuples, as
𝑚
𝐺𝑖𝑛𝑖(𝐷) = 1 − ∑ 𝑃𝑖 2
𝑖=1
Where, pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The
sum is computed over m classes.
The Gini index considers a binary split for each attribute. Let’s first consider the case where A is
a discrete-valued attribute having v distinct values, {a1, a2, . . . , av}, occurring in D. To determine
the best binary split on A, we examine all of the possible subsets that can be formed using known
values of A.
When considering a binary split, we compute a weighted sum of the impurity of each resulting
partition. For example, if a binary split on A partitions D into D1 and D2, the gini index of D given
that partitioning is
|𝐷1 | |𝐷2 |
𝐺𝑖𝑛𝑖𝐴 (𝐷) = 𝐺𝑖𝑛𝑖(𝐷1 ) + 𝐺𝑖𝑛𝑖(𝐷2 )
|𝐷| |𝐷|
For each attribute, each of the possible binary splits is considered. The reduction in impurity that
would be incurred by a binary split on a discrete- or continuous-valued attribute A is
Gini(A) = Gini(D)-GiniA(D).
-------------Refer Text Book------------------
Tree Pruning
When a decision tree is built, many of the branches will reflect anomalies in the training data due
to noise or outliers. Tree pruning methods address this problem of overfitting the data. Such
Arjun Lamichhane 10
Data Mining and Data Warehousing Unit 3: Classification
methods typically use statistical measures to remove the least reliable branches. The pruned trees
are smaller and less complex. The dual goal of pruning is reduced complexity of the final classifier
as well as better predictive accuracy by the reduction of overfitting and removal of sections of a
classifier that may be based on noisy data. A tree that is too large risks overfitting the training data
and poorly generalizing to new samples.
A small tree might not capture important structural information about the sample space. But it is
hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a
single extra node will dramatically decrease error. A common strategy is to grow the tree until
each node contains a small number of instances then use pruning to remove nodes that do not
provide additional information.
Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured
by a test set or using cross-validation.
Approaches:
i Pre pruning
In the pre pruning approach, a tree is “pruned” by halting its construction early (e.g. by deciding
not to further split or partition the subset of training tuples at a given node). Upon halting, the node
becomes a leaf. The leaf may hold the most frequent class among the subset tuples or the
probability distribution of those tuples. When constructing a tree, measures such as statistical
significance, information gain, Gini index, and so on can be used to assess the goodness of a split.
If partitioning the tuples at a node would result in a split that falls below a pre specified threshold,
then further partitioning of the given subset is halted. There are difficulties, however, in choosing
an appropriate threshold. High thresholds could result in oversimplified trees, whereas low
thresholds could result in very little simplification.
ii Post pruning
Post pruning removes subtrees from a “fully grown” tree. A subtree at a given node is pruned by
removing its branches and replacing it with a leaf. The leaf is labeled with the most frequent class
among the subtree being replaced. Post pruning is done by computing the cost complexity. The
cost complexity for each internal node N and the complexity if it were to be replaced by a leaf
node is computed. If pruning would result in lower cost complexity, it would be pruned.
Arjun Lamichhane 11
Data Mining and Data Warehousing Unit 3: Classification
Arjun Lamichhane 12
Data Mining and Data Warehousing Unit 3: Classification
ncovers = 2 [antecedent part is true for 2 tuples. i.e two tuples have age=youth as well as
student = yes]
ncorrect = 2 [both antecedent part and consequent parts are true for 2 tuples]
So,
2
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 (𝑅) = = 14.28%
14
And,
2
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (𝑅) = = 100%
2
How does a rule based classifier work?
If a rule is satisfied by a testing tuple X, a rule is said to be triggered. But triggering may not
necessarily lead the rule to be fired. If more than one rule is triggered, we have a potential problem.
What if they each specify a different class? Or what if no rule is satisfied by X? In such situation,
three cases may arise:
Case I
If only rule is satisfied, then the rule fires by returning the class prediction for X.
Case II
If more than one rule is triggered, we need a conflict resolution strategy to figure out which rule
gets to fire and assign its class prediction to X. There are many possible strategies. Rule ordering
or rule ranking or rule priority can be set in case of rules conflict. A rule ordering may be class-
based or rule-based.
Rule-based ordering: Individual rules are ranked based on their quality i.e. according to
accuracy, coverage etc.
Class-based ordering: Rules that belong to the same class appear together. The class are
sorted in decreasing order of importance.
When rule-based ordering is used, the rule set is known as a decision list.
Case III
If any instance not triggered by any rule, use default class for classification. Mostly most frequent
class is assigned as default class which is usually the most frequent class.
Arjun Lamichhane 13
Data Mining and Data Warehousing Unit 3: Classification
For the decision tree above, there are five possible rules which can be extracted (because there are
five leaf nodes). They are as follows:
R1: IF Outlook = sunny AND Humidity = High THEN Play_Tennis = no
R2: IF Outlook = sunny AND Humidity = Normal THEN Play_Tennis = yes
R3: IF Outlook = Overcast THEN Play_Tennis = yes
R4: IF Outlook = Rain AND Wind = Weak THEN Play_Tennis = yes
R5: IF Outlook = Rain AND Wind = Strong THEN Play_Tennis = no
example of instance-based learning, in which the training data set is stored, so that a classification
for a new unclassified record may be found simply by comparing it to the most similar records in
the training set.
Nearest neighbor classifier requires:
Set of stored records
Distance matric to compute distance between records. For distance calculation any
standard approach can be used such as Euclidean distance.
The value of ‘K’, the number of nearest neighbor to retrieve.
To classify the unknown records:
Compute distance to other training records.
Identify the k-nearest neighbor.
Use class label nearest neighbors to determine the class label of unknown record. In case
of conflict, use the majority vote for classification.
For Example,
Attribute may have to be scaled to prevent distance measure from being dominated by one
of attributes. E.g. Height, Temperature etc.
iii. Distance computing for non-numeric data.
iv. Missing values
Disadvantages
i. Poor accuracy when data have noise and irrelevant attributes
ii. Computationally expensive
Arjun Lamichhane 16
Data Mining and Data Warehousing Unit 3: Classification
For example, suppose our world of data tuples is confined to customers described by the attributes
age and income, and that X is a 35-year-old customer with an income of $40,000. Suppose that H
is the hypothesis that our customer will buy a computer. Then,
P(H|X) → the probability that customer X will buy a computer given that we know the customer’s
age and income. It is the posterior probability, or a posteriori probability, of H conditioned on X
P(X|H) → the probability that a customer, X, is 35 years old and earns $40,000, given that we
know the customer will buy a computer. It is the posterior probability of X conditioned on H.
P(H) → the probability that any given customer will buy a computer, regardless of age, income,
or any other information. It is the prior probability, or a priori probability, of H.
P(X) → the probability that a person from our set of customers is 35 years old and earns $40,000.
It is the prior probability of X.
Arjun Lamichhane 17
Data Mining and Data Warehousing Unit 3: Classification
Arjun Lamichhane 18
Data Mining and Data Warehousing Unit 3: Classification
Arjun Lamichhane 19
Data Mining and Data Warehousing Unit 3: Classification
An Artificial Neural Network (ANN) is a massive parallel distributed processor made up of simple
processing units. It has the ability to learn experiential knowledge expressed through interunit
connection strengths, and can make such knowledge available for use.
ANN represents a very basic level to imitate the type of nonlinear learning that occurs in the nature.
The inputs (x ) are collected from upstream neurons (or the data set) and combined through a
combination function such as summation (Σ), which is then input into a (usually nonlinear)
activation function to produce an output response (y), which is then channeled downstream to other
neurons.
Backpropagation learns by iteratively processing a data set of training tuples, comparing the
network’s prediction for each tuple with the actual known target value. For each training tuple,
the weights are modified so as to minimize the mean squared error between the network’s
prediction and the actual target value. These modifications are made in the “backwards” direction,
that is, from the output layer, through each hidden layer down to the first hidden layer (hence the
name backpropagation). Although it is not guaranteed, in general the weights will eventually
converge, and the learning process stops.
Before training the network topology must be designed by:
i. Specifying number of input nodes/units: Depends upon number of independent variable in data
set.
ii. Specifying Number of hidden layers: Generally only one layer is considered in most of the
problem. Two layers can be designed for complex problem. Number of nodes in the hidden
layer can be adjusted iteratively.
iii. Number of output nodes/units: Depends upon number of class labels of the data set.
iv. Learning rate: Can be adjusted iteratively.
v. Learning algorithm: Any appropriate learning algorithm can be selected during training phase.
vi. Bias value: Can be adjusted iteratively.
Arjun Lamichhane 20
Data Mining and Data Warehousing Unit 3: Classification
Algorithm
1. Initialize the weight and inputs
2. Calculate the outputs as
For input layer j, 𝑂𝑗 = 𝐼𝑗
For output and hidden layer,
𝐼𝑗 = ∑ 𝑊𝑖𝑗 𝑂𝑖
1
𝑂𝑗 =
1 + 𝑒 −𝐼𝑗
3. Calculate the error as
For output layer
𝐸𝑟𝑟𝑗 = 𝑂𝑗 (1 − 𝑂𝑗 )(𝑇 − 𝑂𝑗 )
For hidden layer
Arjun Lamichhane 21
Data Mining and Data Warehousing Unit 3: Classification
Advantages
i. High tolerance of noisy data
ii. Classify patterns on which they have not been trained
iii. Can be used in various applications such as handwriting recognition, image classification, text
narration etc.
iv. Parallelization can be implemented
Disadvantages
i. Require long training time
ii. Requires number of parameters whose best value is unknown
iii. Difficulty to interpret the meaning of weights and hidden network
Underfitting:
It refers to a model that can neither model the training data nor generalize to new data. An underfit
machine learning model is not a suitable model and will be obvious as it will have poor
performance on the training data. Underfitting is often not discussed as it is easy to detect given a
good performance metric. The remedy is to move on and try alternate machine learning algorithms.
Arjun Lamichhane 22
Data Mining and Data Warehousing Unit 3: Classification
Validation
Validation is the process of evaluating the model using the training dataset. It is done by a
resampling techinique called cross validation
Arjun Lamichhane 23
Data Mining and Data Warehousing Unit 3: Classification
is trained on subsets D1, D3, . . . , Dk and tested on D2; and so on. Unlike the holdout and random
subsampling methods above, here, each sample is used the same number of times for training and
once for testing. For classification, the accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total number of tuples in the initial data.
Model Comparison
Models can be evaluated based on the output using different method:
i. Confusion Matrix
ii. ROC Analysis
Confusion Matrix
A confusion matrix, sometimes called a classification matrix, is used to assess the prediction
accuracy of a model. It measures whether a model is confused or not, that is, whether the model is
making mistakes in its predictions or not.
Arjun Lamichhane 24
Data Mining and Data Warehousing Unit 3: Classification
In the two-class case with classes yes and no, buys computer or not, plays golf or not and so on, a
single prediction has the four different possible outcomes shown in Table 3. Given m classes, a
confusion matrix is a table of at least size m by m. An entry, CMi, j in the first m rows and m columns
indicates the number of tuples of class i that were labeled by the classifier as class j.
Predicted Class
Yes No Total
Yes True Positive True Negative P
Actual Class
No False Positive False Negative N
Total P’ N’ P+N
Table 3: Confusion Matrix
True positive (TP) refer to the positive tuples that were correctly labeled by the classifier.
True Negative (TN) are the negative tuples that were correctly labeled by the classifier.
A false positive (FP) occurs when the outcome is incorrectly predicted as yes (or positive) when
it is actually no (negative). e.g., tuples of class buys_computer = no for which the classifier
predicted buys_computer = yes
A false negative (FN) occurs when the outcome is incorrectly predicted as negative when it is
actually positive. e.g., tuples of class buys_computer = yes for which the classifier predicted
buys_computer = no
Accuracy is not always the best measure of the quality of the classification model. It is especially
true for the real - world problems where the distribution of classes is unbalanced. For example, if
the problem is classification of healthy persons from those with the disease. In many cases the
medical database for training and testing will contain mostly healthy persons (99%), and only
small percentage of people with disease (about 1%). In that case, no matter how good the accuracy
of a model is estimated to be, there is no guarantee that it reflects the real world. Therefore, we
need other measures for model quality. In practice, several measures are developed, and some of
the best known are as follows:
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑃+𝑁
𝐹𝑃+𝐹𝑁
𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 = 𝑃+𝑁
Arjun Lamichhane 25
Data Mining and Data Warehousing Unit 3: Classification
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝐹𝑃+𝑇𝑁
Arjun Lamichhane 26
Data Mining and Data Warehousing Unit 3: Classification
On the ROC curve, we move right and plot a point. This process is repeated for each of the test
tuples, each time moving up on the curve for a true positive or toward the right for a false positive.
To assess the accuracy of a model, we can measure the area under the curve. The closer the area
is to 0.5, the less accurate the corresponding model is. A model with perfect accuracy will have an
area of 1.0.
Arjun Lamichhane 27
Data Mining and Data Warehousing Unit 3: Classification
References
[2] J. Han and K. Micheline, Data Mining: Concepts and Techniques, San Francisco: Elsevier
Inc., 2006.
[3] P.-N. Tan, M. Steinbach and V. Kumar, INTRODUCTION TO DATA MINING, New York:
PEARSON Addison Wesley, 2006.
[4] S. Chakrabarti, E. Cox, E. Frank, R. H. Guting, j. Han, X. Jiang and M. Kamber, Data Mining
know It All, Burlington: Elsevier Inc, 2009.
[5] I. H. Witten and E. Frank, Data Mining Practical Machine Learning Tools and Techniques.
Arjun Lamichhane 28