Data Mining Unit-Iii
Data Mining Unit-Iii
UNIT-III
Basic concepts:
What is Classification?
Classification is a supervised learning technique used to categorize data into predefined classes
or labels.
(or)
In other words, Classification is a data mining technique that categorizes items in a collection
based on some predefined properties.
Examples:
1. A bank loans officer needs analysis of her data to learn which loan applicants are “safe”
and which are “risky” for the bank.
2. A marketing manager needs data analysis to help guess whether a customer with a given
profile will buy a new computer.
3. A medical researcher wants to analyse breast cancer data to predict which one of three
specific treatments a patient should receive.
In each of these examples, the data analysistask is classification, where a model or classifier is
constructed to predict class labels, such as
- “safe” or “risky” for the loan application data;
- “yes” or “no” for the marketing data;
- “treatmentA,” “treatment B,” or “treatment C” for the medical data.
These categories/class labels can be represented by discrete values, where the ordering among
valueshas no meaning.
How does classification work?
Data classification is a two-step process, consisting of alearning step (where a classification
model is constructed) and a classification step (wherethe model is used to predict class labels for
given data).
- Learning step (or training phase):
In this, a classification algorithmbuilds the classifier/model by analyzing or
“learning from” a training dataset and their associated class labels.and the learned/trained
model or classifier isrepresented in the formof classification rules.
- Classification step
In this, Test data are used to estimatethe accuracy of the classification rules. If the
accuracy is considered acceptable, the rules canbe applied to the classification of new data
tuples. i.e., the classifier/model is used to predict class labels for the given new data.
Example:
A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and
which are “risky” for the bank.
- Learning step (or training phase):
The accuracy of a classifier on a given test data set is the percentage of test set tuples thatare
correctly classified by the classifier. The associated class label of each test tuple is comparedwith
the learned classifier’s class prediction for that tuple.
Decision Tree:
“What is decision tree?” A decision tree is a flowchart-like tree structure, where each internal
node (nonleafnode) denotes a test on an attribute, each branch represents an outcome of thetest, and
each leaf node (or terminal node) holds a class label. The topmost node ina tree is the root node.It
represents the best attribute selected for classification.
Decision Tree is a type of data mining technique that is used to build Classification Models. It
builds classification models in the form of a tree-like structure, just like its name. This type of
mining belongs to supervised class learning.
In supervised learning, the target result is already known. Decision trees can be used for both
categorical and numerical data. The categorical data represent gender, marital status, etc. while the
numerical data represent age, temperature, etc.
Example:
The following is a typical decision tree. It representsthe concept buys_computer, that is, it predicts
whether a customer at AllElectronics islikely to purchase a computer. Internal nodes are denoted by
rectangles, and leaf nodesare denoted by ovals.
“How are decision trees used for classification?” Given a tuple, X, for which the associatedclass
label is unknown, the attribute values of the tuple are tested against thedecision tree. A path is
traced from the root to a leaf node, which holds the classprediction for that tuple. Decision trees
can easily be converted to classification rules.
Example,According to the decision tree provided above, the tuple (age = senior, credit_rating =
excellent) is associated to the class label (buys_computer = yes).
“Why are decision tree classifiers so popular?” The construction of decision tree classifiersdoes
not require any domain knowledge or parameter setting, and therefore isappropriate for exploratory
knowledge discovery. Decision trees can handle multidimensionaldata.
Decision tree inductionalgorithms have been used for classification in many application areas such
as medicine, manufacturing and production, financial analysis, astronomy, and molecular biology.
Decision TreeInduction:
In 1980, J. Ross Quinlan, a researcher in machine learning, created the ID3 (Iterative
Dichotomiser) algorithm, which is a decision tree algorithm. Later on, he introduced C4.5, which
was the successor of ID3. Both ID3 and C4.5 use a greedy approach, where the trees are
constructed in a top-down recursive divide-and-conquer manner without any backtracking.
Generate_decision_tree. Generate a decision tree from the training tuples ofdata partition, D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute_list, the set of candidate attributes;
Attribute_selection_ method, describes the method for selecting the best attribute for discrimination
among tuples.
Method:
Step-1: Begin the tree with the root node, says R, which contains the complete dataset D.
Step-2: Find the best attribute in the dataset using Attribute Selection Method (ASM).
Step-3: Divide the R into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Example: Suppose there is a person who has a job offer and wants to decide whether he should
accept or decline the offer. Solve this problem,using the decision tree.
Input:
attribute_list: Salary, Distance and Cab_facility.
class_labels: Accept, Decline.
Method:
The decision tree starts with the root node (Salary attribute).
The root node splits further into the next decision node (Distanceattribute) and one leaf
node based on the corresponding labels.
The next decision node further gets split into one decision node (Cab_facilityattribute) and
one leaf node.
Finally, the decision node(Cab_facilityattribute) splits into two leaf nodes (Accept and
Decline).
Output:
The below diagram is the decision tree:
The algorithm calls Attribute selection method to determine the splitting criterion. The splitting
criterion tells us which attribute to test at node N by determiningthe “best” way to separate or
partition the tuples in D into individual classes.
There are three possibilities for partitioning tuples based on the splitting criterion.
(a) If A is discrete-valued, then one branch is grown for each known value of A.
Example:
(b) If A is continuous-valued, then two branches are grown, corresponding to A <= split_point
and A > split_point.
Example:
(c) If A is discrete-valued and a binary tree must be produced, then the test is of the form AꜪ
SA, where SA is thesplitting subset for A.
Example:
Attribute Selection Measures:
When implementing a Decision tree, the primary challenge is to choose the most suitable attribute
for the root node and sub-nodes. However, there is a technique known as Attribute selection
measure that can help resolve such issues. With the help of this measurement, we can easily
determine the best attribute for the nodes of the tree.
Entropy
The calculation of entropy is performed for each attribute after every split, and the splitting
process is then continued. A lower entropy indicates a better model, as it implies that the
classes are split more effectively due to reduced uncertainty.
It is the measure of impurity (or) uncertainty in the data. It lies between 0 to 1 and is
calculated using the below formula.
For example, let us see the formula for a use case with 2 classes- Yes and No.
Here Yes and No are two classes and Dis the total number of samples. P(Yes) is the
probability of Yes and P(No) is the probability of No.
If there are many classes then the formula is
Thenthe expected information required to classify the tuple from D based onattribute A is
𝑣
|𝐷𝑗 |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐴 (𝐷) = ∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷𝑗 )
|𝐷|
𝑗 =1
The term | Dj | / | D| acts as the weight of the jth partition Entropy A(D) is theexpected
information required to classify a tuple from D based on thepartitioning by A.
Let us illustrate the ID3 algorithm with a simple example of classifying whether customer buys a
computer based on some conditions. (or) Construct a decision tree to classify “buys computer.”
In the given customer dataset, we have 5 attributes/columns, those are age, income, student,
credit_rating, and buys_computer.
In which age, income, student, credit_ratingattributes are categorized as input features and
whereas buys_computercategorized as output/target feature.
Here, we are going to classify whether customer buys a computer or not based on the input
features. And we will be going to use ID3 algorithmfor building decision tree.
Let us start the solution:
First, we need to count the number of “yes” and “no” for output column / feature called
The given dataset with 2 classes- yesand no. Then the formula of Entropyfor buys_computer is
Where,
- P(yes) = p(yes) / (p(yes) + p(no))
= 9 / ( 9 + 5)
= 9/14
for “yes” component of buys_computer.
buys_computer
yes no
youth 2 3 5
age middle_aged 4 0 4
senior 3 2 5
14
Now we can calculate the Entropy for each value of age column.
Note: If the Entropy is Zero (0) that will be categorized as pure subset.
So, once we are ready with Entropy for each value in age column then we will find out the Entropy
for complete age column with respect to buys_computer column.
Similarly, we can compute Information Gain (buys_computer, income)is 0.029 bits, Information
Gain (buys_computer, student) is 0.151 bits, andInformation Gain (buys_computer, credit_rating)
is 0.048 bits.
The attribute age has the highest information gain and therefore becomes the splittingattribute at
the root node of the decision tree. Branches are grown for each outcomeof age. The tuples are
shown partitioned accordingly.
Notice that the tuples falling into the partition forage = middle_aged all belong to the same class.
Because they all belong to class “yes,”a leaf should therefore be created at the end of this branch
and labelled “yes.”
Let’s continue with the next level. If the age is “youth” then the dataset is
In the above, income, student and credit_rating are categorized as input attributes and
buys_computer is categorized as output attribute. There are 5 instances among those 2 yes and 3 no
for buys_computer attribute.
Now, we will calculate the Information Gain for all the attributes with respect to target/class
attribute called buys_computer.To calculate Information Gain first, we need to calculate the
Entropy for all the attributes.
Let’s calculate the Entropyfor class/target attribute buys_computer is
Once we are ready with the Entropy of target variable, we will find out the Entropy for each
column/variable/attribute with respect to target variable to find most homogeneous sample.
buys_computer
yes no
low 1 0 1
income medium 1 1 2
high 0 2 2
5
Now we can calculate the Entropy for each value of income column.
Similarly, we can compute Information Gain (buys_computer, student) and Information Gain
(buys_computer, credit_rating).
Notice that the tuples falling into the partition forstudent = yes all belong to the same class.
Because they all belong to class “yes,”a leaf should therefore be created at the end of this branch
and labelled “yes.” And the partition forstudent = no all belong to the same classtherefore be
created at the end of this branch and labelled “no.”
If you continue the above process then we will get the attributecredit_rating has the highest
information gain and therefore becomes the splitting attribute at the next level node of the decision
tree. Branches are grown for each outcomeof student. The tuples are shown partitioned
accordingly.
Now, we will reach to leaf nodes of decision tree. i.e., we have derived class attributes for all the
attributes in the dataset.
The final decision tree for all the attributes in the above dataset using ID3 algorithm.
Let us illustrate the ID3 algorithm with a simple example of classifying golf playbased on weather
conditions. (or) Construct a decision tree to classify “golf play.”
Bayesian classification is a statistical method in data mining to data classification that uses
Bayes'theorem to make predictions about a class of a data point based on observed data.
The basic idea behind Bayesian classification is to assign a class label to a new data record based
on the probability that it belongs to a particular class.
The Bayesian classification is a powerful technique for probabilistic inference and decision-making
and is widely used in various applications such as medical diagnosis, spam classification, fraud
detection, etc.
Bayesian classification is based on Bayes’ theorem, a simple Bayesian classifier known as the
naïve Bayesian classifier.
Bayes’ theorem:
Bayes’ theorem is named after Thomas Bayes, an 18th-century British mathematician who first
formulated it.
In data mining, Bayes' theorem is used to compute the probability of a hypothesis (such as a class
label) for given some observed event (such as a set of attributes or features).
Bayes' theoremis a technique for predicting the class label of a new tuple / instance based on the
probabilities of different class labels and the observed attributes/ features of the instance.
Bayes' Theorem consists of two types of probabilities: the posterior probability (P(H/X)) and the
prior probability (P(H)), where X represents the data tuple and H represents a hypothesis.
𝑷 𝑿|𝑯 ∗ 𝑷 𝑯
𝑷 𝑯|𝑿 =
𝑷 𝑿
Let X is a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by
measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X
belongs to a specified class C.
For classification problems, we want to determine P (H/X), the probability that the hypothesis H
holds given the observed data tuple X. In other words, we are looking for the probability that tuple
X belongs to class C, given that we know the attribute description of X.
Posterior probability:
For example, suppose we have data tuples from the customers database describedby the
attributesage and income, respectively, and that X is a 30-year-old customer withan income of
$40,000. Suppose that H is the hypothesis that our customer will buy acomputer. Then the P(H/X)
represents the probability of customer X purchasing a computer, considering the customer's age
and income information.
Prior probability:
Forexample,this is the probability that any given customer will buy a computer, regardless of
age,income, or any other information, for that matter. i.e., The prior probabilityP(H), is
independent of data tuple X.
Similarly, P(X) is the prior probability of X. Using our example, it is the probability that aperson
from our data set of customers is 30 years old and earns $40,000.
Naive Bayes Classifier Algorithm:
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other, all of these properties independently
contribute to the probability that this fruit is an apple and that is why it is known as „Naive‟.
Example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So,
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So, to solve this problem, we need to follow the below steps:
Problem: If the Weather is Sunny, then the Player should play or not?
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Next, Generate Likelihood table by finding the probabilities of given features as follows,
Applying Bayes'theorem:
𝑷 𝑺𝒖𝒏𝒏𝒚|𝒀𝒆𝒔 ∗ 𝑷 𝒀𝒆𝒔
𝑷 𝒀𝒆𝒔|𝑺𝒖𝒏𝒏𝒚 =
𝑷 𝑺𝒖𝒏𝒏𝒚
Where,
P(Yes)=10/14 = 0.71
𝟎. 𝟑 ∗ 𝟎. 𝟕𝟏
𝑷 𝒀𝒆𝒔|𝑺𝒖𝒏𝒏𝒚 =
𝟎. 𝟑𝟓
= 0.60
𝑷 𝑺𝒖𝒏𝒏𝒚|𝑵𝒐 ∗ 𝑷 𝑵𝒐
𝑷 𝑵𝒐|𝑺𝒖𝒏𝒏𝒚 =
𝑷 𝑺𝒖𝒏𝒏𝒚
Where,
P(No)=4/14 = 0.29
𝟎. 𝟓 ∗ 𝟎. 𝟐𝟗
𝑷 𝑵𝒐|𝑺𝒖𝒏𝒏𝒚 =
𝟎. 𝟑𝟓
= 0.41
Exercise:
Problem: If the Weather is Sunnyand Humidity isHigh, then the Player should play or not?
Solution:
If the data tuple(X) is(Sunny, High),and thehypothesis(H) is Yesthenthe probability of Play is:
If the data tuple(X) is(Sunny, High),and thehypothesis(H) is Nothenthe probability of Play is:
𝑷 𝑺𝒖𝒏𝒏𝒚|𝑵𝒐 ∗ 𝑷 𝑯𝒊𝒈𝒉|𝑵𝒐 ∗ 𝑷 𝑵𝒐
𝑷 𝑵𝒐|𝑺𝒖𝒏𝒏𝒚 =
𝑷 𝑺𝒖𝒏𝒏𝒚 ∗ 𝑷(𝑯𝒊𝒈𝒉)
Rule-Based Classification:
Rule-based classification in data mining is a technique in which the records are classified /
categorized by using a set of “IF…THEN…” rules.
Rules are a good way of representing information or bits of knowledge. A rule-based classifier
uses a set of IF-THEN rules for classification.
The “IF” part (or left side) of a rule is known as the rule antecedent or precondition. The
“THEN” part (or right side) is the rule consequent.
In the rule antecedent, the condition consists of one or more attribute tests (e.g., age = youth and
student = yes) that are logically ANDed. The rule’s consequent contains a class prediction (in this
case, we are predicting whether a customer will buy a computer). R1 can also be written as
If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given tuple, we say
that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the
tuple.
Coverage: A rule’s coverage is the percentage of tuples that are covered by the rule.
Accuracy: A rule’s coverage is the percentage of tuples that are correctly classified by the rule.
Given a tuple, X, from a class labelled data set, D, let ncovers be the number of tuples covered by R;
ncorrect be the number of tuples correctly classified by R; and |D| be the number of tuples in D. We
can define the coverage and accuracy of rule R as
ncovers
𝐂𝐨𝐯𝐞𝐫𝐚𝐠𝐞 𝐑 =
D
ncorrect
𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐑 =
ncovers
Example: Rule accuracy and coverage. Let’s go back to our data set in Table-1. These are class
labelled tuples from the customer database. Our task is to predict whether a customer will buy a
computer.
Consider rule R1, which covers 2 of the 14 tuples. It can correctly classify both tuples. Therefore,
coverage R1 (D) = 2/14 = 14.28% and accuracy R1 (D) = 2/2 = 100%.
Decision tree classifiers are a popular method of classification - it is easy to understand how
decision trees work and they are known for their accuracy. Decision trees can become large and
difficult to interpret.
In this, we look at how to build a rule based classifier by extracting IF-THEN rules from a
decision tree. In comparison with a decision tree, the IF-THEN rules may be easier for humans to
understand, particularly if the decision tree is very large.
To extract rules from a decision tree, one rule is created for each path from the root to a leaf node.
Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF”
part). The leaf node holds the class prediction, forming the rule consequent (“THEN” part).
Eager learners, when given a set of training tuples, will construct a classification model before
receiving new (e.g., test) tuples to classify. We can think of the model as being ready and eager to
classify previously unseen tuples.
Lazy learners simply store training tuples (or perform minimal processing) and wait until a test
tuple is encountered. Only when it sees the test tuple does it perform generalization to classify the
tuple based on its similarity to the stored training tuples.
Unlike eager learning methods, lazy learners do less work when a training tuple is presented and
more work when making a classification or numeric prediction for the test tuple.
The K-Nearest Neighbors (K-NN) algorithm is a popular algorithm used mostly for solving
classification problems.
The K-NN algorithm stores all the available data (training data) and classifies a new data point (test
data) based on the similarity.
In other words, The K-NN algorithm compares a new data point to the data / values in a given data
set (with different classes or categories). Based on its closeness or similarities in a given range (K)
of neighbors, the algorithm assigns the new data to a class or category in the data set (training
data).
It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
Suppose we have two categories, namely Category A and Category B. Now, let's consider a new
data point, p1. The question arises, which category does this data point belong to? To address this
problem, we can utilize the K-NN algorithm. By employing the K-NN algorithm, we can easily
identify the category or class of a particular dataset.
How k-Nearest-Neighbor (K-NN) Algorithm works?
How to Compute the Distance? (or) Distance Measures Used in KNN Algorithm:
The K-NN algorithm enables us to classify or categorize the new data point by calculating the
distance from other pre-existing data points. To calculate the distance we use any of the following
are Distance Measures.
Euclidean distance:
Euclidean distance can be simply explained as the ordinary distance between two points.
Mathematically it computes the root of squared differences between the coordinates between two
objects.
The Euclidean distance between two points or tuples, say, X = (x1, x2, . . . , xn) and Y = (y1, y2, .
.. , yn) is
The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the
distance between two real-valued vectors.
Manhattan Distance metric is generally used when we are interested in the total distance travelled
by the object instead of the displacement. This metric is calculated by summing the absolute
difference between the coordinates of the points in n-dimensions.
The Manhattan distance between two points or tuples, say, X = (x1, x2, . . . , xn) and Y = (y1, y2,
. . . , yn) is
𝑛
Example:
= (−2)2 + (−3)2
= 4+9
= 13 = 3.61
The Manhattan distance between the two data points is
𝑑𝑖𝑠𝑡(𝑋, 𝑌) = |1 − 3| + |2 − 5|
=2+3 =𝟓
Note:
Euclidean distance is the most popular distance measure to calculate the distance between data
points or tuples.
Example:
And apply K-NN algorithm to find the class label for the following test tuple.
Solution:
In the given dataset, we have two columns – Height (in CM) and Weight (in KG). Each row in the
dataset has a class of either Normal or Overweight.
Step-1:
To start working with K-NN algorithm, first we need to choose the value for K.
Let’s assume the value of K is 5.
Step-2:
In this, we need to calculate the distance between new (test) tuple and all other tuples in the
existing (training) dataset.
√(X₂-X₁)² + (Y₂-Y₁)²
Where:
- X₂ = New entry's Height (148).
- X₁= Existing entry's Height.
- Y₂ = New entry's Weight (60).
- Y₁ = Existing entry's Weight.
Let's starts the calculation
Distance #1
Distance #2
Like above we have to calculate the distance from all the other data entries/tuples.
After calculating the distances from all other existing data tuples, the updated table or dataset is
Step -3:
Since we chose 5 as the value of K, we'll only consider the first five rows.
Step -4:
In the above 5 nearest neighbors, 3 neighbors have Overweight as a class and 2 neighbours
have Normal as a class.
Step -5:
As you can see above, the class Overweight has maximum number of neighbors, Therefore,
we'll classify the new entry/tuple as Overweight.
It is simple to implement.
No training is required before classification.
Exercise:
And apply K-NN algorithm to find the class label for the following test tuple.