Unit-2 Material (1)
Unit-2 Material (1)
22CS63
Unit 2
1
Overview
2
Classification
• Classification refers to the process of identifying which
category or class an object belongs to, based on its features.
• It is a type of supervised learning, where the algorithm is
trained using labeled data
• Applications :
o Disease Prediction (e.g., cancer, diabetes, heart disease)
o Medical Imaging Classification (e.g., tumor detection in X-rays, MRIs)
o Patient Risk Assessment (e.g., classifying high-risk patients)
o Spam Email Detection
o Sentiment Analysis of Customer Feedback or Reviews
o Credit Scoring (e.g., loan approval prediction)
o Fraud Detection (e.g., fraudulent transaction identification)
3
Supervised Learning
– Supervision: Supervision in machine learning (ML) refers to the process of
training a model using labeled data.
– In supervised learning, the algorithm learns from input-output pairs where the
output (also called the label or target) is known. The goal is for the model to learn
the mapping from inputs (features) to outputs to make predictions or
classifications on new, unseen data.
– Classification and numeric prediction are the two major types of prediction
problems.
– Regression analysis is a statistical methodology that is most often used for
numeric prediction. Ex: predict house prices based on features like size, bedrooms,
and location.
– Ranking is another type of numerical prediction where the model predicts the
ordered values (i.e., ranks), for example, a web search engine (e.g., Google) ranks
the relevant web pages for a given query, with the higher-ranked webpages being
more relevant to the query.
4
Classification
Training Model
Instances Learning
Positive
Test Prediction
Instances Model Negative
5
Unsupervised learning (clustering)
7
Example of Loan Application –
Classification step
• The accuracy of a classifier on a given test set is the percentage of test tuples that are
correctly classified by the classifier.
8
Decision tree Induction - overview
• It is the learning of decision trees from class-labeled training tuples.
• It is a flowchart-like tree structure, where each internal node (nonleaf node)
denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label.
• The topmost node in a tree is the root node.
• Rectangles denote internal nodes and leaf nodes are denoted by ovals (or
circles).
• Some decision tree algorithms produce only binary trees (where each
internal node branches to exactly two other nodes. Ex: ID3, CART), whereas
others can produce nonbinary trees(Ex: C4.5, ID3(multi-class classification),
CART(multi-class classification)).
9
Learning of decision trees from
class-labeled training tuples
Decision tree construction: A top-down, recursive, divide-
and-conquer process
11
https://www.saedsayad.com/decision_tree.htm
Decision Tree Induction:
Algorithm
• Algorithms ID3 (Iterative Dichotomizer), C4.5, and CART adopt a
greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer
manner
• A Basic algorithm
– Tree is constructed in a top-down, recursive, divide-and-
conquer manner
– At the start, all the training examples are at the root
– Examples are partitioned recursively based on selected
attributes
– On each node, attributes are selected based on the training
examples on that node, and a heuristic or statistical
measure (e.g., information gain, Gain ratio, Gini index)
12
Decision Tree Induction Algorithm
- stopping
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning
– There are no samples left
• Prediction
– Majority voting is employed for classifying the leaf
13
Decision tree algorithm
14
Decision tree splitting criteria
15
How to Handle Continuous-Valued
Attributes?
• Method 1: Discretize continuous values and treat them as
categorical values
– E.g., age: < 20, 20..30, 30..40, 40..50, > 50
• Method 2: Determine the best split point for continuous-valued
attribute A
– Sort:, e.g. 15, 18, 21, 22, 24, 25, 29, 31, …
– Possible split point: (ai+ai+1)/2
• e.g., (15+18)/2 = 16.5, 19.5, 21.5, 23, 24.5, 27, 30, …
– The point with the maximum information gain for A is
selected as the split point for A
• Split: Based on split point P
– The set of tuples in D satisfying A ≤ P vs. those with A > P 16
Pro’s and Con’s
Pro’s
1. Easy to Understand and Interpret: Decision trees are simple to understand and
visualize, making them very interpretable.
2. No Need for Feature Scaling: Decision trees do not require feature scaling (like
normalization or standardization) because they split the data based on conditions
rather than distances between values.
3. Handle Both Numerical and Categorical Data: Decision trees can handle both
types of data without the need for special preprocessing.
4. Non-Linear Relationships: They can model complex non-linear relationships
between the features and the target, unlike linear models (such as linear regression).
5. Works Well with Large Datasets: Decision trees perform relatively well with
large datasets and can capture hidden patterns in the data.
6. Can Capture Interactions Between Features: Decision trees can capture
interactions between different features, which can be very useful in some problems.
7. Automatic Feature Selection: They automatically perform feature selection,
identifying the most important variables for the prediction task.
17
Con’s
1. Prone to Overfitting: decision trees tend to overfit the training data,
especially when the tree is too deep. Overfitting can lead to poor
generalization of unseen data.
2. Instability: Decision trees can be very sensitive to small changes in the
data. A small variation in the dataset can result in a completely different
tree structure.
3. Biased Towards Features with More Levels: Decision trees can be biased
towards features with more levels (categories) because they tend to prefer
splits that involve features with more possible values.
4. Greedy Algorithm: Decision trees use a greedy algorithm to build the tree.
This means they make the optimal choice at each step, but that does not
necessarily lead to the globally optimal tree structure.
5. Hard to Optimize: Tuning hyperparameters such as tree depth, minimum
samples per leaf, etc., can be tricky, and a poor choice can greatly affect the
tree's performance.
–
18
Information Gain: An Attribute Selection
Measure
❑ ID3 /C4.5 uses information gain as its attribute selection measure.
❑ Let node N represent or hold the tuples of partition D. The attribute with the highest information
gain is chosen as the splitting attribute for node N.
❑ The expected information needed to classify a tuple in D is given by
m
Info ( D ) = − pi log 2 ( pi )
i =1
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated
by |Ci, D|/|D|. A log function to the base 2 is used, because the information is encoded in bits.
Info(D)is also known as the entropy of D.
❑ Attribute A can split D into v partitions or subsets, {D1, D2,..., Dv}, where Dj contains those tuples
in D that have outcome aj of A. These partitions would correspond to the branches grown from
node N.
❑ The amount of information we still need (after the partitioning) to arrive at an exact classification is
given by: v | D |
Info A ( D) = Info( D j )
j
j =1 |D|
The term |acts as the weight of the jth partition. InfoA(D) is the expected information required to
classify a tuple from D based on the partitioning by A.
Outlook Temp
Humidit Wind Play Compute 𝑮𝒂𝒊𝒏 𝒐𝒖𝒕𝒍𝒐𝒐𝒌
y y Golf
Rainy Hot High False No The expected information needed to classify a tuple in D if
Rainy Hot High True No the tuples are partitioned according to A is given by
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No Info(yes,
outlook yes no outlook?
Overcast Cool Normal True Yes no)
Rainy Mild High False No rainy 2 3 0.971 Sunny Rainy
Rainy Cool Normal False Yes overcast 4 0 0
Sunny Mild Normal False Yes
sunny 3 2 0.971 Overcast
Rainy Mild Normal True Yes
Overcast Mild High True Yes 𝟓 4 5
𝐼𝑛𝑓𝑜𝑜𝑢𝑡𝑙𝑜𝑜𝑘 𝐷 = 𝑰𝒏𝒇𝒐 𝟐, 𝟑 + 𝐼𝑛𝑓𝑜 4,0 + 𝐼𝑛𝑓𝑜 3,2
Overcast Hot Normal False Yes 𝟏𝟒 14 14
Sunny Mild High True No
𝟓
𝑰𝒏𝒇𝒐(𝟐, 𝟑)means “outlook=rainy” has 5 out of 14
𝟏𝟒
samples, with 2 “yes” and 3 “No”.
𝐻𝑒𝑛𝑐𝑒,
𝐺𝑎𝑖𝑛 𝑜𝑢𝑡𝑙𝑜𝑜𝑘
= 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑜𝑢𝑡𝑙𝑜𝑜𝑘 𝐷
21
= 0.940 − 0.694 = 0.246
Example: Attribute Selection with Information Gain
Info(Yes,
Temp Yes No
No)
Hot 2 2 ?
Mild 4 2 ?
Cool 3 1 ? Similarly, we can get
𝐺𝑎𝑖𝑛 𝑇𝑒𝑚𝑝 = 0.029,
Info(Yes,
Windy Yes No
No) 𝐺𝑎𝑖𝑛 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151,
True ? ? ? 𝐺𝑎𝑖𝑛 𝑊𝑖𝑛𝑑𝑦 = 0.048
False ? ? ?
Humidit Info(Yes,
Yes No
y No)
Normal 6 1 ?
High 3 4 ?
23
Gain Ratio: A Refined Measure for
Attribute Selection
• Information gain measure is biased towards multivalued attributes, with a
large number of values (e.g. ID)
• Gain ratio: Overcomes the problem (as a normalization to information gain)
v | Dj | | Dj |
–
SplitInfo ( D) = −
A j =1 | D |
2 log (
|D|
)
– The attribute with the maximum gain ratio is selected as the splitting
attribute
– Gain ratio is used in a popular algorithm C4.5 (a successor of ID3) by R.
Quinlan
– tends to prefer unbalanced splits in which one partition is much smaller
than the others
• Example:
4 4 6 6 4 4
– SplitInfotemp D = − log 2 − log 2 − log 2 = 1.557
14 14 14 14 14 14
– GainRatio(temp) = 0.029/1.557 = 0.019
24
Gini impurity
• The Gini impurity (or Gini in short) is used in CART. The Gini measures the
impurity of D, a data partition or a set of training tuples, as
Gini Impurity measures the randomness in our data, how random our data is?
The reduction in impurity (gain) incurred by a binary split on a discrete or
continuous valued attribute A is:
25
Example for Gini Impurity
We have 2 classes (0 and 1); m = 2; # class 1=9;
# Class 0=5
27
Exercise: Find
the best-split
attribute for the
given database
using Gini Index
Comparing Decision Tree Attribute Selection
Measures
Key
Metric Purpose Key Disadvantage Best Used In
Advantage
Measures how much
Data with balanced class
"information" a Captures class
Information Bias toward features with distribution and not too
feature provides in uncertainty or
Gain many distinct values. many features with
predicting the target randomness.
distinct values.
variable.
Computationally more
Adjusts Information
complex.
Gain to prevent bias Corrects the
Tends to prefer Data with categorical
towards features with bias of
Gain Ratio unbalanced splits in features that have many
many categories, Information
which one partition is distinct values.
making it more Gain.
much smaller than the
balanced.
others.
May perform poorly in
Simpler to imbalanced class Works well with both
Measures the impurity
compute, no problems. It tends to categorical and
of a dataset, aiming to
Gini Index bias toward favor tests that result in continuous features,
create splits that result
feature equal-sized partitions and typically for binary or
in pure groups.
cardinality. purity in both partitions multi-class classification.
29
Tree Pruning
• Tree pruning is a technique used in decision tree algorithms to reduce the size of a
decision tree.
• The main objective of pruning is to enhance generalization by removing nodes or
branches that do not contribute significantly to the predictive power of the tree.
• Pruning aims to reduce overfitting, improve the model’s accuracy on unseen data,
and increase computational efficiency.
30
Tree Pruning example
31
Pre Pruning & Post Pruning &
Combined method
Method Definition Goal Advantages Disadvantages Best For
Balance
Combines both pre-
between - Flexible. - More complex to
pruning and post- - Complex datasets.
Combin preventing - Can avoid both implement.
pruning techniques. A - When there's a need for
ed overfitting underfitting and - Can be
tree is partially pruned a flexible pruning
Method and building overfitting. computationally
during growth and fully strategy.
an optimal - Better generalization. expensive.
pruned afterward.
tree.
32
Bayes Classification Methods
• Bayesian classifiers are statistical classifiers(Gaussian Naïve Bayes,
Multinomial Naïve Bayes, Bernoulli Naïve Bayes). They can predict
class membership probabilities, such as the probability that a given
tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the naïve Bayesian classifier
to be comparable in performance with decision trees and selected
neural network classifiers.
• Bayesian classifiers have also exhibited high accuracy and speed
when applied to large databases.
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
This assumption is called class-conditional independence.
• It is made to simplify the computations involved and, in this sense, is
considered “naïve”. That is, features are conditionally independent,
given the target class.
33
Bayes’ Theorem: Basics
❑ Let X be a data tuple (“evidence”).
❑ Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
❑ For classification problems, determine P(H|X), the probability that the hypothesis H
holds given the “evidence” or observed data tuple X.
❑ In other words, we are looking for the probability that tuple X belongs to class C,
given that we know the attribute description of X.
❑ P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
❑ Ex: suppose H is the hypothesis that our customer will buy a computer. Then P(H|X)
reflects the probability that customer X will buy a computer given that we know the
customer’s age and income.
❑ In contrast, P(H)is the prior probability, or a priori probability, of H. For our example,
this is the probability that any given customer will buy a computer, regardless of age,
income, or any other information, for that matter.
❑ Similarly, P(X|H)is the conditional probability of X conditioned on H. That is, it is the
probability that a customer, X, is 35 years old and earns $40,000, given that we know
the customer will buy a computer. In classification, P(X|H)is also often called
likelihood.
❑ P(X) is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000. In classification, P(X) is
also often called marginal probability. 34
Bayes’ Theorem: Basics
35
Bayes’ Theorem :Example 1
You have planned a picnic, but the morning is cloudy. What is the probability that it will rain today?
From historical data, we know that:
• 50% of rainy days start off cloudy.
• 40% of all days start off cloudy.
• This is a dry month, and only 3 out of 30 days tend to be rainy.
• Using Bayes' Theorem, calculate the probability that today will be rainy given that the morning is
cloudy.
The probability that it will rain given that the morning is cloudy is 12.5%.
So, while the morning clouds might look worrying, there's still an 87.5% chance that it won't rain!
36
Bayes’ Theorem :Example 2
A factory has two machines, A and B. Machine A produces 60% of the total items, and Machine B
produces 40%. It is known that 2% of the items produced by Machine A are defective, while 5% of
the items produced by Machine B are defective. If an item is selected at random and found to be
defective, what is the probability that it was produced by Machine B?
Bayes' theorem states:
The probability that a randomly selected defective item was produced by Machine B is 0.625 or 62.5%. 37
Bayes’ Theorem : Exercises
1. A school has 60% boys and 40% girls. It is known that 70% of the boys wear glasses,
while 50% of the girls wear glasses. If a student wearing glasses is randomly selected, what
is the probability that the student is a boy?
2. A certain disease affects 0.1% (1 in 1000 people) of a population. A test for the disease is
99% accurate for positive cases (true positive rate) and 98% accurate for negative cases (true
negative rate). If a randomly selected person tests positive, what is the probability that they
actually have the disease?
38
Bayes’ Theorem : Exercise 1
If a person tests positive, the probability that they actually have the disease is 4.72%.
40
Naïve Bayes Classifier: Making
a Naïve Bayes Assumption
1. Let D be a training set of tuples and their associated class labels. Each tuple is
represented by an n-dimensional attribute vector, X = (x1,x2,...,xn).
2. Suppose that there are m classes, C1, C2,…, Cm. The classifier will predict that X
belongs to the class having the highest posterior probability, conditioned on X. That is, the
naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
By Bayes’ theorem:
3. As P(X) is constant for all classes, we only need to find out which class maximizes
P(X|Ci)P(Ci). If the class prior probabilities are not known, then it is commonly assumed
that the classes are equally likely, that is, P(C1) = P(C2) =···=P(Cm), and we would
therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci). Note that the class
prior probabilities may be estimated by P(Ci)=|Ci,D|/|D|, where |Ci,D| is the number of
training tuples of class Ci in D.
41
Naïve Bayes Classifier..
Categorical and continuous-valued attribute
4. To reduce computation in evaluating P(X|Ci), the naïve assumption of class-
conditional independence is made. This presumes that the attributes’ values are
conditionally independent of one another, given the class label of the tuple
(i.e., there are no dependence relationships among the attributes, if we know
which class the tuple belongs to). Thus
Recall that here xk refers to the value of attribute Ak for tuple X. For each attribute, we
look at whether the attribute is categorical or continuous-valued. For instance, to
compute P(X|Ci), we consider the following:
42
Naïve Bayes Classifier
b. If Ak is continuous-valued attribute, it is typically assumed to have a
Gaussian distribution with a mean μ and standard deviation σ, defined by
5. To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if
P(X|Ci)P(Ci)>P(X|Cj)P(Cj) for 1≤j ≤m,j =i.
In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the
maximum.
43
Naïve Bayes Classifier :Example 1
Classify the tuple X=(age=youth, income = By Bayes’ theorem:
medium, student = yes, credit_rating = fair)
using Naïve Bayes classifier.
Find the class that maximizes P(X|Ci)P(Ci).
P(credit_rating = fair | buys_computer = no) = 2/5 = Therefore, the naïve Bayesian classifier
0.400.
predicts buys_computer = yes for tuple X 45
Naïve Bayes Classifier: Exercise 1
46
Naïve Bayes Classifier : Exercise 1
By Bayes’ theorem:
P(Temp=mild | Play Golf = no) =2/5=0.400 Similarly, P(X| Play Golf = no)
=0.400×0.400×0.200×0.600=0.019.
P(Humidity=Normal | Play Golf= yes) =6/9=0.667
To find the class, Ci, that maximizes P(X|Ci)P(Ci), we
compute
P(Humidity=Normal | Play Golf = no) =1/5=0.200
P(X| Play Golf= yes)P(Play Golf= yes) = 0.030
P(Windy=True | Play Golf= yes) = 3/9 = 0.333 ×0.643=0.021
P(Windy=True | Play Golf = no) = 3/5 = 0.600. P(X| Play Golf= no)P(Play Golf= no)
=0.019×0.357=0.007.
50
Naïve Bayes Classifier: Strength vs.
Weakness
• Weakness
– Assumption: attributes conditional independence,
therefore loss of accuracy
• E.g., Patient’s Profile: (age, family history),
• Patient’s Symptoms: (fever, cough),
• Patient’s Disease: (lung cancer, diabetes).
• Dependencies among these cannot be modeled by
Naïve Bayes Classifier
– How to deal with these dependencies?
Use Bayesian Belief Networks (chapter 7)
51
Summary
Classification is commonly required for data
mining applications
Decision tree and Bayesian methods are
important and popular techniques to use for
classification
52