0% found this document useful (0 votes)
13 views

Unit-2 Material  (1)

The document provides an overview of classification in data mining, detailing supervised and unsupervised learning methods, particularly focusing on decision tree induction and Bayesian classification. It explains the processes involved in classification, including attribute selection measures like information gain and gain ratio, as well as the advantages and disadvantages of decision trees. Applications of classification techniques are highlighted, including disease prediction, spam detection, and credit scoring.

Uploaded by

Latharao Salanke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit-2 Material  (1)

The document provides an overview of classification in data mining, detailing supervised and unsupervised learning methods, particularly focusing on decision tree induction and Bayesian classification. It explains the processes involved in classification, including attribute selection measures like information gain and gain ratio, as well as the advantages and disadvantages of decision trees. Applications of classification techniques are highlighted, including disease prediction, spam detection, and credit scoring.

Uploaded by

Latharao Salanke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Data Mining

22CS63
Unit 2

1
Overview

Classification: Basic Concept: What is


Classification? General approaches to
classification
Decision Tree Induction: Decision Tree Induction,
Attribute selection measures, Tree pruning.
Bayesian Classification methods: Bayes’ theorem,
Naïve Bayesian Classification

2
Classification
• Classification refers to the process of identifying which
category or class an object belongs to, based on its features.
• It is a type of supervised learning, where the algorithm is
trained using labeled data
• Applications :
o Disease Prediction (e.g., cancer, diabetes, heart disease)
o Medical Imaging Classification (e.g., tumor detection in X-rays, MRIs)
o Patient Risk Assessment (e.g., classifying high-risk patients)
o Spam Email Detection
o Sentiment Analysis of Customer Feedback or Reviews
o Credit Scoring (e.g., loan approval prediction)
o Fraud Detection (e.g., fraudulent transaction identification)
3
Supervised Learning
– Supervision: Supervision in machine learning (ML) refers to the process of
training a model using labeled data.
– In supervised learning, the algorithm learns from input-output pairs where the
output (also called the label or target) is known. The goal is for the model to learn
the mapping from inputs (features) to outputs to make predictions or
classifications on new, unseen data.
– Classification and numeric prediction are the two major types of prediction
problems.
– Regression analysis is a statistical methodology that is most often used for
numeric prediction. Ex: predict house prices based on features like size, bedrooms,
and location.
– Ranking is another type of numerical prediction where the model predicts the
ordered values (i.e., ranks), for example, a web search engine (e.g., Google) ranks
the relevant web pages for a given query, with the higher-ranked webpages being
more relevant to the query.
4
Classification

Training Model
Instances Learning

Positive
Test Prediction
Instances Model Negative

5
Unsupervised learning (clustering)

▪ Unsupervised learning is a type of machine learning where


the algorithm is trained on unlabeled data.
▪ The goal is to find hidden patterns in the data.
▪ Clustering is one of the most common techniques used in
unsupervised learning.
▪ Clustering involves grouping data points into clusters, where
points in the same cluster are more similar than those in other
clusters.
▪ Anomaly Detection, Association Rule Learning, Dimensionality
Reduction etc. are other examples of unsupervised learning.
6
General Approach to Classification
Data classification: A two-step process, a learning step (where a classification model is
constructed) and a classification step (where the model is used to predict class labels for
given data).
Loan Application – The learning step- can be viewed as the learning of a mapping or
function, y = f(X), that can predict the associated class label y of a given tuple X.

7
Example of Loan Application –
Classification step

• The accuracy of a classifier on a given test set is the percentage of test tuples that are
correctly classified by the classifier.

8
Decision tree Induction - overview
• It is the learning of decision trees from class-labeled training tuples.
• It is a flowchart-like tree structure, where each internal node (nonleaf node)
denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label.
• The topmost node in a tree is the root node.
• Rectangles denote internal nodes and leaf nodes are denoted by ovals (or
circles).
• Some decision tree algorithms produce only binary trees (where each
internal node branches to exactly two other nodes. Ex: ID3, CART), whereas
others can produce nonbinary trees(Ex: C4.5, ID3(multi-class classification),
CART(multi-class classification)).

9
Learning of decision trees from
class-labeled training tuples
Decision tree construction: A top-down, recursive, divide-
and-conquer process

Each internal (nonleaf) node represents a test on an attribute. Each leaf


node represents a class (either buys_computer = yes or
buys_computer = no).
10
Decision Tree Induction: An Example
Training data set: Play Golf?
❑ Resulting tree
Play
Outlook Temp Humidity Windy
Golf
outlook? Rainy Hot High False No
Rainy Hot High True No
Sunny Rainy Overcast Hot High False Yes
Sunny Mild High False Yes
Overcast
Sunny Cool Normal False Yes

Yes Sunny Cool Normal True No


windy? Humidity?
Overcast Cool Normal True Yes
Rainy Mild High False No
False High Rainy Cool Normal False Yes
True Normal
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Yes No Yes No
Overcast Hot Normal False Yes
Sunny Mild High True No

11
https://www.saedsayad.com/decision_tree.htm
Decision Tree Induction:
Algorithm
• Algorithms ID3 (Iterative Dichotomizer), C4.5, and CART adopt a
greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer
manner
• A Basic algorithm
– Tree is constructed in a top-down, recursive, divide-and-
conquer manner
– At the start, all the training examples are at the root
– Examples are partitioned recursively based on selected
attributes
– On each node, attributes are selected based on the training
examples on that node, and a heuristic or statistical
measure (e.g., information gain, Gain ratio, Gini index)
12
Decision Tree Induction Algorithm
- stopping
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning
– There are no samples left
• Prediction
– Majority voting is employed for classifying the leaf

13
Decision tree algorithm

14
Decision tree splitting criteria

15
How to Handle Continuous-Valued
Attributes?
• Method 1: Discretize continuous values and treat them as
categorical values
– E.g., age: < 20, 20..30, 30..40, 40..50, > 50
• Method 2: Determine the best split point for continuous-valued
attribute A
– Sort:, e.g. 15, 18, 21, 22, 24, 25, 29, 31, …
– Possible split point: (ai+ai+1)/2
• e.g., (15+18)/2 = 16.5, 19.5, 21.5, 23, 24.5, 27, 30, …
– The point with the maximum information gain for A is
selected as the split point for A
• Split: Based on split point P
– The set of tuples in D satisfying A ≤ P vs. those with A > P 16
Pro’s and Con’s
Pro’s
1. Easy to Understand and Interpret: Decision trees are simple to understand and
visualize, making them very interpretable.
2. No Need for Feature Scaling: Decision trees do not require feature scaling (like
normalization or standardization) because they split the data based on conditions
rather than distances between values.
3. Handle Both Numerical and Categorical Data: Decision trees can handle both
types of data without the need for special preprocessing.
4. Non-Linear Relationships: They can model complex non-linear relationships
between the features and the target, unlike linear models (such as linear regression).
5. Works Well with Large Datasets: Decision trees perform relatively well with
large datasets and can capture hidden patterns in the data.
6. Can Capture Interactions Between Features: Decision trees can capture
interactions between different features, which can be very useful in some problems.
7. Automatic Feature Selection: They automatically perform feature selection,
identifying the most important variables for the prediction task.
17
Con’s
1. Prone to Overfitting: decision trees tend to overfit the training data,
especially when the tree is too deep. Overfitting can lead to poor
generalization of unseen data.
2. Instability: Decision trees can be very sensitive to small changes in the
data. A small variation in the dataset can result in a completely different
tree structure.
3. Biased Towards Features with More Levels: Decision trees can be biased
towards features with more levels (categories) because they tend to prefer
splits that involve features with more possible values.
4. Greedy Algorithm: Decision trees use a greedy algorithm to build the tree.
This means they make the optimal choice at each step, but that does not
necessarily lead to the globally optimal tree structure.
5. Hard to Optimize: Tuning hyperparameters such as tree depth, minimum
samples per leaf, etc., can be tricky, and a poor choice can greatly affect the
tree's performance.

18
Information Gain: An Attribute Selection
Measure
❑ ID3 /C4.5 uses information gain as its attribute selection measure.
❑ Let node N represent or hold the tuples of partition D. The attribute with the highest information
gain is chosen as the splitting attribute for node N.
❑ The expected information needed to classify a tuple in D is given by
m
Info ( D ) = −  pi log 2 ( pi )
i =1
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated
by |Ci, D|/|D|. A log function to the base 2 is used, because the information is encoded in bits.
Info(D)is also known as the entropy of D.
❑ Attribute A can split D into v partitions or subsets, {D1, D2,..., Dv}, where Dj contains those tuples
in D that have outcome aj of A. These partitions would correspond to the branches grown from
node N.
❑ The amount of information we still need (after the partitioning) to arrive at an exact classification is
given by: v | D |
Info A ( D) =   Info( D j )
j

j =1 |D|
The term |acts as the weight of the jth partition. InfoA(D) is the expected information required to
classify a tuple from D based on the partitioning by A.

Gain(A) = Info(D) − Info A(D) 19


Example: Attribute Selection with Information Gain
Outloo Tem Humidi Wind Play
k p ty y Golf
Rainy Hot High False No
Rainy Hot High True No
Overcas Hot
Tuples with Yes = 9; Tuples with No = 5; m=2
t High False Yes

Sunny Mild High False Yes


Compute the expected information needed to classify
Sunny Cool Normal False Yes a tuple in D
Sunny Cool Normal True No
Overcas Cool Normal True Yes
t
Rainy Mild High False No

Rainy Cool Normal False Yes

Sunny Mild Normal False Yes

Rainy Mild Normal True Yes


Overcas Mild High True Yes
t
Overcas Hot Normal False Yes
t
Sunny Mild High True No
20
Example: Attribute Selection with Information Gain

Outlook Temp
Humidit Wind Play Compute 𝑮𝒂𝒊𝒏 𝒐𝒖𝒕𝒍𝒐𝒐𝒌
y y Golf
Rainy Hot High False No The expected information needed to classify a tuple in D if
Rainy Hot High True No the tuples are partitioned according to A is given by
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No Info(yes,
outlook yes no outlook?
Overcast Cool Normal True Yes no)
Rainy Mild High False No rainy 2 3 0.971 Sunny Rainy
Rainy Cool Normal False Yes overcast 4 0 0
Sunny Mild Normal False Yes
sunny 3 2 0.971 Overcast
Rainy Mild Normal True Yes
Overcast Mild High True Yes 𝟓 4 5
𝐼𝑛𝑓𝑜𝑜𝑢𝑡𝑙𝑜𝑜𝑘 𝐷 = 𝑰𝒏𝒇𝒐 𝟐, 𝟑 + 𝐼𝑛𝑓𝑜 4,0 + 𝐼𝑛𝑓𝑜 3,2
Overcast Hot Normal False Yes 𝟏𝟒 14 14
Sunny Mild High True No
𝟓
𝑰𝒏𝒇𝒐(𝟐, 𝟑)means “outlook=rainy” has 5 out of 14
𝟏𝟒
samples, with 2 “yes” and 3 “No”.

𝐻𝑒𝑛𝑐𝑒,
𝐺𝑎𝑖𝑛 𝑜𝑢𝑡𝑙𝑜𝑜𝑘
= 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑜𝑢𝑡𝑙𝑜𝑜𝑘 𝐷
21
= 0.940 − 0.694 = 0.246
Example: Attribute Selection with Information Gain

Info(Yes,
Temp Yes No
No)
Hot 2 2 ?
Mild 4 2 ?
Cool 3 1 ? Similarly, we can get
𝐺𝑎𝑖𝑛 𝑇𝑒𝑚𝑝 = 0.029,
Info(Yes,
Windy Yes No
No) 𝐺𝑎𝑖𝑛 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151,
True ? ? ? 𝐺𝑎𝑖𝑛 𝑊𝑖𝑛𝑑𝑦 = 0.048
False ? ? ?

Humidit Info(Yes,
Yes No
y No)
Normal 6 1 ?
High 3 4 ?

Because Outlook has the highest information gain among the


attributes, Outlook is selected as the splitting attribute.
22
Exercise
Find the best-split
attribute for the given
database using
Information Gain

23
Gain Ratio: A Refined Measure for
Attribute Selection
• Information gain measure is biased towards multivalued attributes, with a
large number of values (e.g. ID)
• Gain ratio: Overcomes the problem (as a normalization to information gain)

v | Dj | | Dj |

SplitInfo ( D) = −
A j =1 | D |
2  log (
|D|
)

– The attribute with the maximum gain ratio is selected as the splitting
attribute
– Gain ratio is used in a popular algorithm C4.5 (a successor of ID3) by R.
Quinlan
– tends to prefer unbalanced splits in which one partition is much smaller
than the others
• Example:
4 4 6 6 4 4
– SplitInfotemp D = − log 2 − log 2 − log 2 = 1.557
14 14 14 14 14 14
– GainRatio(temp) = 0.029/1.557 = 0.019
24
Gini impurity
• The Gini impurity (or Gini in short) is used in CART. The Gini measures the
impurity of D, a data partition or a set of training tuples, as

where pi is the probability that a tuple in D belongs to class Ci and is estimated


by |Ci,D|/|D|. The sum is computed over m classes.
• The Gini impurity considers a binary split for each attribute.
• When evaluating a binary split, we calculate a weighted sum of the impurity
for each resulting partition. For instance, if a binary split on variable A divides
dataset D into two subsets, D1 and D2, the Gini impurity of D based on this
partitioning can be represented as follows:

Gini Impurity measures the randomness in our data, how random our data is?
The reduction in impurity (gain) incurred by a binary split on a discrete or
continuous valued attribute A is:
25
Example for Gini Impurity
We have 2 classes (0 and 1); m = 2; # class 1=9;
# Class 0=5

Gini(D) = 1- p²(0)-p²(1) = 1–(5/14)²-(9/14)² = 0.459.

The next step is to calculate the Gini Impurity for the 4


features (outlook, temp, humidity, windy), and decide
which feature will be the root node.
Calculate the Gini Impurity for Outlook; the Outlook
feature is a categorical variable with three possible values
(sunny, overcast, and rainy).

Gini(outlook = rainy) = 1-(2/5)²-(3/5)² = 0.48


Gini(outlook = overcast) = 1-(4/4)² = 0
Gini(outlook = sunny) = 1-(3/5)²-(2/5)² = 0.48
26
Example for Gini Impurity
Calculate the Gini Impurity of outlook by weighting the impurity of
each branch and how many elements it has,

Gini (outlook) (D)= 5/14 * 0.48 + 4/14 * 0 + 5/14 * 0.48 = 0.34

Gini Gain( ) is calculated by subtracting the weighted


impurities of the branches from the original impurity.

(outlook) = Gini (D) — Gini (outlook)(D)= 0.459–0.34 = 0.119


(temp) =0.459 - 0.440 = 0.019
(Humidity) = 0.459 – 0.367 = 0.092
(windy)= 0.459- 0.429 = 0.027

Which feature should I use as a decision node(root node)?


The best split is chosen by maximizing the Gini Gain or by
minimizing the Gini Impurity.
In our example, the outlook has the minimum Gini Impurity value
and the maximum Gini Gain value, so, It will be chosen as the root
decision to split our data.

27
Exercise: Find
the best-split
attribute for the
given database
using Gini Index
Comparing Decision Tree Attribute Selection
Measures
Key
Metric Purpose Key Disadvantage Best Used In
Advantage
Measures how much
Data with balanced class
"information" a Captures class
Information Bias toward features with distribution and not too
feature provides in uncertainty or
Gain many distinct values. many features with
predicting the target randomness.
distinct values.
variable.
Computationally more
Adjusts Information
complex.
Gain to prevent bias Corrects the
Tends to prefer Data with categorical
towards features with bias of
Gain Ratio unbalanced splits in features that have many
many categories, Information
which one partition is distinct values.
making it more Gain.
much smaller than the
balanced.
others.
May perform poorly in
Simpler to imbalanced class Works well with both
Measures the impurity
compute, no problems. It tends to categorical and
of a dataset, aiming to
Gini Index bias toward favor tests that result in continuous features,
create splits that result
feature equal-sized partitions and typically for binary or
in pure groups.
cardinality. purity in both partitions multi-class classification.

29
Tree Pruning
• Tree pruning is a technique used in decision tree algorithms to reduce the size of a
decision tree.
• The main objective of pruning is to enhance generalization by removing nodes or
branches that do not contribute significantly to the predictive power of the tree.
• Pruning aims to reduce overfitting, improve the model’s accuracy on unseen data,
and increase computational efficiency.

Types of Tree Pruning:


1. Pre-Pruning (Early Stopping): This method involves stopping the tree from
growing before it becomes overly complex, imposing constraints during tree-
building.
2. Post-Pruning (Post-Growth Pruning or Cost-Complexity Pruning): In this
method, the tree is allowed to grow fully first, and then irrelevant branches or nodes
are pruned after the tree construction.

30
Tree Pruning example

• An unpruned decision tree (left) and a pruned version of it (right). Pruning


methods typically use statistical measures to remove the least-reliable
branches.
• Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend. They are usually faster and better at correctly classifying
independent test data (i.e., of previously unseen tuples) than unpruned trees.

31
Pre Pruning & Post Pruning &
Combined method
Method Definition Goal Advantages Disadvantages Best For

Stops tree growth early, Prevent - May underfit (too


- Faster training time. - Small datasets.
before it fully develops, overfitting simplistic tree).
Pre- - Prevents overfitting - Situations where
by setting constraints by halting - Difficult to set
Pruning by stopping growth overfitting is a major
(e.g., max depth, min growth optimal stopping
early. concern.
samples). early. criteria.

Fully builds the tree - Allows full tree


Reduce - Large datasets.
first, then prunes exploration. - Slower training
complexity - When overfitting is not
Post- branches that don't - More accurate final time.
after the tree a major concern but
Pruning improve accuracy tree. - Requires additional
has been generalization is
(usually using a - Often results in better data (validation set).
built. important.
validation set). generalization.

Balance
Combines both pre-
between - Flexible. - More complex to
pruning and post- - Complex datasets.
Combin preventing - Can avoid both implement.
pruning techniques. A - When there's a need for
ed overfitting underfitting and - Can be
tree is partially pruned a flexible pruning
Method and building overfitting. computationally
during growth and fully strategy.
an optimal - Better generalization. expensive.
pruned afterward.
tree.
32
Bayes Classification Methods
• Bayesian classifiers are statistical classifiers(Gaussian Naïve Bayes,
Multinomial Naïve Bayes, Bernoulli Naïve Bayes). They can predict
class membership probabilities, such as the probability that a given
tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the naïve Bayesian classifier
to be comparable in performance with decision trees and selected
neural network classifiers.
• Bayesian classifiers have also exhibited high accuracy and speed
when applied to large databases.
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
This assumption is called class-conditional independence.
• It is made to simplify the computations involved and, in this sense, is
considered “naïve”. That is, features are conditionally independent,
given the target class.

33
Bayes’ Theorem: Basics
❑ Let X be a data tuple (“evidence”).
❑ Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
❑ For classification problems, determine P(H|X), the probability that the hypothesis H
holds given the “evidence” or observed data tuple X.
❑ In other words, we are looking for the probability that tuple X belongs to class C,
given that we know the attribute description of X.
❑ P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
❑ Ex: suppose H is the hypothesis that our customer will buy a computer. Then P(H|X)
reflects the probability that customer X will buy a computer given that we know the
customer’s age and income.
❑ In contrast, P(H)is the prior probability, or a priori probability, of H. For our example,
this is the probability that any given customer will buy a computer, regardless of age,
income, or any other information, for that matter.
❑ Similarly, P(X|H)is the conditional probability of X conditioned on H. That is, it is the
probability that a customer, X, is 35 years old and earns $40,000, given that we know
the customer will buy a computer. In classification, P(X|H)is also often called
likelihood.
❑ P(X) is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000. In classification, P(X) is
also often called marginal probability. 34
Bayes’ Theorem: Basics

35
Bayes’ Theorem :Example 1
You have planned a picnic, but the morning is cloudy. What is the probability that it will rain today?
From historical data, we know that:
• 50% of rainy days start off cloudy.
• 40% of all days start off cloudy.
• This is a dry month, and only 3 out of 30 days tend to be rainy.
• Using Bayes' Theorem, calculate the probability that today will be rainy given that the morning is
cloudy.

The probability that it will rain given that the morning is cloudy is 12.5%.
So, while the morning clouds might look worrying, there's still an 87.5% chance that it won't rain!
36
Bayes’ Theorem :Example 2
A factory has two machines, A and B. Machine A produces 60% of the total items, and Machine B
produces 40%. It is known that 2% of the items produced by Machine A are defective, while 5% of
the items produced by Machine B are defective. If an item is selected at random and found to be
defective, what is the probability that it was produced by Machine B?
Bayes' theorem states:

The probability that a randomly selected defective item was produced by Machine B is 0.625 or 62.5%. 37
Bayes’ Theorem : Exercises

1. A school has 60% boys and 40% girls. It is known that 70% of the boys wear glasses,
while 50% of the girls wear glasses. If a student wearing glasses is randomly selected, what
is the probability that the student is a boy?

2. A certain disease affects 0.1% (1 in 1000 people) of a population. A test for the disease is
99% accurate for positive cases (true positive rate) and 98% accurate for negative cases (true
negative rate). If a randomly selected person tests positive, what is the probability that they
actually have the disease?

38
Bayes’ Theorem : Exercise 1

The probability that a randomly selected student wearing


glasses is a boy is 67.7% (or 0.677).
39
Solutions: Exercise 2

If a person tests positive, the probability that they actually have the disease is 4.72%.
40
Naïve Bayes Classifier: Making
a Naïve Bayes Assumption
1. Let D be a training set of tuples and their associated class labels. Each tuple is
represented by an n-dimensional attribute vector, X = (x1,x2,...,xn).
2. Suppose that there are m classes, C1, C2,…, Cm. The classifier will predict that X
belongs to the class having the highest posterior probability, conditioned on X. That is, the
naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if

By Bayes’ theorem:

3. As P(X) is constant for all classes, we only need to find out which class maximizes
P(X|Ci)P(Ci). If the class prior probabilities are not known, then it is commonly assumed
that the classes are equally likely, that is, P(C1) = P(C2) =···=P(Cm), and we would
therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci). Note that the class
prior probabilities may be estimated by P(Ci)=|Ci,D|/|D|, where |Ci,D| is the number of
training tuples of class Ci in D.
41
Naïve Bayes Classifier..
Categorical and continuous-valued attribute
4. To reduce computation in evaluating P(X|Ci), the naïve assumption of class-
conditional independence is made. This presumes that the attributes’ values are
conditionally independent of one another, given the class label of the tuple
(i.e., there are no dependence relationships among the attributes, if we know
which class the tuple belongs to). Thus

Recall that here xk refers to the value of attribute Ak for tuple X. For each attribute, we
look at whether the attribute is categorical or continuous-valued. For instance, to
compute P(X|Ci), we consider the following:

a. If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having


the value xk for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.

42
Naïve Bayes Classifier
b. If Ak is continuous-valued attribute, it is typically assumed to have a
Gaussian distribution with a mean μ and standard deviation σ, defined by

5. To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if
P(X|Ci)P(Ci)>P(X|Cj)P(Cj) for 1≤j ≤m,j =i.

In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the
maximum.
43
Naïve Bayes Classifier :Example 1
Classify the tuple X=(age=youth, income = By Bayes’ theorem:
medium, student = yes, credit_rating = fair)
using Naïve Bayes classifier.
Find the class that maximizes P(X|Ci)P(Ci).

The naïve Bayesian classifier predicts that


tuple X belongs to the class Ci if and only if

For categorical Attribute:

The prior probability of each class can be


computed based on the training tuples:

P(buys_computer = yes) =9/14=0.643


P(buys_computer = no) =5/14=0.357.
44
Naïve Bayes Classifier :Example 1
To compute P(X|Ci), for i =1, 2, we compute the Using these probabilities, we obtain
following conditional probabilities:
P(X|buys_computer = yes) =
P(age = youth | buys_computer = yes) ×P(income =
P(age = youth | buys_computer = yes) =2/9=0.222
medium | buys_computer = yes) ×P(student = yes |
buys_computer = yes) ×P(credit_rating = fair |
P(age = youth | buys_computer = no) =3/5=0.600 buys_computer = yes)

P(income = medium | buys_computer = yes) =0.222×0.444×0.667×0.667=0.044.


=4/9=0.444
Similarly, P(X|buys_computer = no)
P(income = medium | buys_computer = no) =0.600×0.400×0.200×0.400=0.019.
=2/5=0.400
To find the class, Ci, that maximizes P(X|Ci)P(Ci), we
P(student = yes | buys_computer = yes) =6/9=0.667 compute

P(student = yes | buys_computer = no) =1/5=0.200 P(X|buys_computer = yes)P(buys_computer = yes) =


0.044×0.643=0.028
P(credit_rating = fair | buys_computer = yes) = 6/9 =
P(X|buys_computer = no)P(buys_computer = no)
0.667
=0.019×0.357=0.007.

P(credit_rating = fair | buys_computer = no) = 2/5 = Therefore, the naïve Bayesian classifier
0.400.
predicts buys_computer = yes for tuple X 45
Naïve Bayes Classifier: Exercise 1

Classify the tuple X=(Outlook=Sunny,


Temp=mild, Humidity=Normal,
Windy=True) using Naïve Bayes
classifier.

46
Naïve Bayes Classifier : Exercise 1
By Bayes’ theorem:

Find the class that maximizes P(X|Ci)P(Ci).

The naïve Bayesian classifier predicts that


tuple X belongs to the class Ci if and only if

For categorical Attribute:

The prior probability of each class can be


computed based on the training tuples:

P(Play Golf= yes) =9/14=0.643


P(Play Golf = no) =5/14=0.357.
47
Naïve Bayes Classifier : Exercise 1
To compute P(X|Ci), for i =1, 2, we compute the Using these probabilities, we obtain
following conditional probabilities:
P(X| Play Golf= yes) =
P(Outlook=Sunny | Play Golf= yes) =3/9=0.333 P(Outlook=Sunny| Play Golf= yes) × P(Temp=mild |
Play Golf= yes) × P(Humidity=Normal | Play Golf=
P(Outlook=Sunny | Play Golf = no) =2/5=0.400
yes) × P(Windy=True | Play Golf= yes)

P(Temp=mild | Play Golf= yes) =4/9=0.444 =0.333×0.444×0.667× 0.333 =0.030

P(Temp=mild | Play Golf = no) =2/5=0.400 Similarly, P(X| Play Golf = no)
=0.400×0.400×0.200×0.600=0.019.
P(Humidity=Normal | Play Golf= yes) =6/9=0.667
To find the class, Ci, that maximizes P(X|Ci)P(Ci), we
compute
P(Humidity=Normal | Play Golf = no) =1/5=0.200
P(X| Play Golf= yes)P(Play Golf= yes) = 0.030
P(Windy=True | Play Golf= yes) = 3/9 = 0.333 ×0.643=0.021

P(Windy=True | Play Golf = no) = 3/5 = 0.600. P(X| Play Golf= no)P(Play Golf= no)
=0.019×0.357=0.007.

Therefore, the naïve Bayesian classifier predicts Play


Golf= yes for tuple X
48
Avoiding the
Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional
probability be non-zero
– Otherwise, the predicted probability will be zero
p X|𝐶𝑖 = ς𝑘 𝑝 𝑥𝑘 𝐶𝑖 ) = 𝑝 𝑥1 𝐶𝑖 ) ∙ 𝑝 𝑥2 𝐶𝑖 ) ∙∙∙∙∙ 𝑝 𝑥𝑛 𝐶𝑖 )
• Example. Suppose a dataset with 1,000 tuples:
income = low (0), income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 (or a small integer) to each case
Prob(income = low) = 1/(1000 + 3)
Prob(income = medium) = (990 + 1)/(1000 + 3)
Prob(income = high) = (10 + 1)/(1000 + 3)
– The “corrected” probability estimates are close to their
“uncorrected” counterparts
49
Naïve Bayes Classifier: Strength vs.
Weakness
• Strength
– Performance: A naïve Bayesian classifier, has
comparable performance with decision tree and
selected neural network classifiers
– Incremental: Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct—prior knowledge can
be combined with observed data

50
Naïve Bayes Classifier: Strength vs.
Weakness
• Weakness
– Assumption: attributes conditional independence,
therefore loss of accuracy
• E.g., Patient’s Profile: (age, family history),
• Patient’s Symptoms: (fever, cough),
• Patient’s Disease: (lung cancer, diabetes).
• Dependencies among these cannot be modeled by
Naïve Bayes Classifier
– How to deal with these dependencies?
Use Bayesian Belief Networks (chapter 7)

51
Summary
Classification is commonly required for data
mining applications
Decision tree and Bayesian methods are
important and popular techniques to use for
classification

52

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy