0% found this document useful (0 votes)
11 views68 pages

Multimedia Application L8

The document discusses Naive Bayes classifiers, including their types, training methods, and applications in text classification tasks such as sentiment analysis and spam detection. It explains Bayes' theorem, the bag-of-words representation, and the importance of feature selection in classification. Additionally, it covers optimization techniques and the relationship of Naive Bayes to language modeling.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views68 pages

Multimedia Application L8

The document discusses Naive Bayes classifiers, including their types, training methods, and applications in text classification tasks such as sentiment analysis and spam detection. It explains Bayes' theorem, the bag-of-words representation, and the importance of feature selection in classification. Additionally, it covers optimization techniques and the relationship of Naive Bayes to language modeling.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Multimedia

Application
By

Minhaz Uddin Ahmed, PhD


Department of Computer Engineering
Inha University Tashkent.
Email: minhaz.ahmed@gmail.com
Content
 Naive Bayes Classifiers
 Training the Naive Bayes Classifier
 Worked example
 Optimizing for Sentiment Analysis
 Naive Bayes for other text classification tasks
 Naive Bayes as a Language Model
Bayes theorem

 Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which


is used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing
the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Types Of Naive Bayes:

 There are three types of Naive Bayes model under the scikit-learn
library:
Gaussian: It is used in classification and it assumes that features
follow a normal distribution.
Multinomial: It is used for discrete counts. For example, let’s say, we
have a text classification problem. Here we can consider Bernoulli trials
which is one step further and instead of “word occurring in the document”,
we have “count how often word occurs in the document”, you can think of
it as “number of times outcome number x_i is observed over the n trials”.
Bernoulli: The binomial model is useful if your feature vectors are
binary. One application would be text classification with ‘bag of words’
model where the 1s & 0s are “word occurs in the document” and “word
does not occur in the document” respectively.
Example
Example solution

 Solution:
 P(A|B) = (P(B|A) * P(A) )/ P(B)
1. Mango:
 P(X | Mango) = P(Ye | Yellow ) * P(Sweet | Mango) * P(Long | Mango)
a)P(Yellow | Mango) = (P(Mango | Yellow) * P(Yellow) )/ P (Mango)
 = ((350/800) * (800/1200)) / (650/1200)
 P(Yellow | Mango)= 0.53 →1
Text Classification

 Assigning subject categories, topics, or genres


 Spam detection
 Authorship identification
 Age/gender identification
 Language Identification
 Sentiment analysis
Who wrote which Federalist papers?

 1787-8: anonymous essays try to convince New York to ratify


U.S Constitution: Jay, Madison, Hamilton.
 Authorship of 12 of the letters in dispute
 1963: solved by Mosteller and Wallace using Bayesian
methods

James Madison Alexander Hamilton


Male or female author from a given
text
By 1925 present-day Vietnam was divided into three parts under French colonial
rule.
The southern region embracing Saigon and the Mekong delta was the colony of
Cochin-China;
the central area with its imperial capital at Hue was the protectorate of Annam …

Clara never failed to be astonished by the extraordinary felicity of her own name.
She found it hard to trust herself to the mercy of fate, which had managed over
the years
to convert her greatest shame into one of her greatest assets…
Text Classification: definition

 Input:
 a document d
 a fixed set of classes C = {c1, c2,…, cJ}

 Output: a predicted class c  C


Classification Methods:
Hand-coded rules
 Rules based on combinations of words or other features
 spam: black-list-address OR (“dollars” AND“have been selected”)
 Accuracy can be high
 If rules carefully refined by expert
 But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
 Input:
 a document d

 a fixed set of classes C = {c1, c2,…, cJ}


 A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

 Output:
 a learned classifier γ:d  c
Classification Methods:
Supervised Machine Learning
 Any kind of classifier

Naïve Bayes
Logistic regression
Support-vector machines
k-Nearest Neighbors
Naive Bayes Intuition

 Simple ("naive") classification method based on Bayes rule


 Relies on very simple representation of document
 Bag of words
The Bag of Words Representation

We preprocess the dataset by converting each email into a bag-


of-words representation, where each word is a feature and its
frequency in the email is its value. We also assign a label (spam
or not spam) to each email
The Bag of Words Representation
The bag of words representation

seen 2
sweet 1

γ whimsical
recommend
happy
1
1
1
)=c
( ... ...
Training

 We train the Naïve Bayes classifier on the labeled dataset. During


training, the classifier calculates the probabilities of each word
occurring in spam and not spam emails, as well as the prior
probabilities of spam and not spam emails in the dataset.
Prediction

Step 1: Given a new email, we convert it into a bag-of-words


representation.
Step 2: For each word in the email, we calculate its conditional
probability of occurring in spam and not spam emails based on the
probabilities learned during training.
Step 3: We multiply the conditional probabilities of all words in the
email and multiply them by the prior probabilities of spam and not spam
emails.
Step 4: We compare the calculated probabilities for spam and not spam,
and classify the email as spam or not spam based on the higher
probability.
Bayes’ Rule Applied to Documents
and Classes
 For a document d and a class c
Naive Bayes Classifier (I)

MAP is “maximum a
posteriori” = most
likely class

Bayes Rule

Dropping the
denominator
Learning the Multinomial Naive Bayes Model

 First attempt: maximum likelihood estimates


 simply use the frequencies in the data

𝑁𝑐
^ (𝑐 )=
𝑃 𝑗
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙
Parameter estimation

fraction of times word wi appears


among all words in documents of topic cj

 Create mega-document for topic j by concatenating all docs in this


topic
 Use frequency of w in mega-document
Problem with Maximum Likelihood

 What if we have seen no training documents with the word


fantastic and classified in the topic positive (thumbs-up)?

 Zero probabilities cannot be conditioned away, no matter the


other evidence!
Laplace (add-1) smoothing for Naïve
Bayes
Multinomial Naïve Bayes: Learning

 From training corpus, extract Vocabulary

 Calculate P(cj) terms • Calculate P(wk | cj) terms


 For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj
• For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
Unknown words
 What about unknown words
 that appear in our test data
 but not in our training data or vocabulary?
 We ignore them
 Remove them from the test document!
 Pretend they weren't there!
 Don't include any probability for them at all!
 Why don't we build an unknown word model?
 It doesn't help: knowing which class has more unknown
words is not generally helpful!
Stop words

 Some systems ignore stop words


 Stop words: very frequent words like the and a.
 Sort the vocabulary by word frequency in training set
 Call the top 10 or 50 words the stopword list.
 Remove all stop words from both training and test sets
 As if they were never there!
 But removing stop words doesn't usually help
• So in practice most NB algorithms use all words and don't use
stopword lists
Naive Bayes: Learning

Sentiment
Example:
A worked sentiment example with
add-1 smoothing
1. Prior from training:
^ (𝑐 )=
𝑃 𝑗
𝑁𝑐 𝑗 P(-) = 3/5
𝑁 𝑡𝑜𝑡𝑎𝑙
P(+) = 2/5
2. Drop "with"
3. Likelihoods from training:
𝑐𝑜𝑢𝑛𝑡 ( 𝑤 𝑖 , 𝑐 ) +1
𝑝 ( 𝑤 𝑖|𝑐 ) =
(∑ )
𝑐𝑜𝑢𝑛𝑡 (𝑤 ,𝑐 ) + ¿ 𝑉 ∨¿ ¿ 4. Scoring the test set:
𝑤 ∈𝑉
Optimizing for sentiment analysis

For tasks like sentiment, word occurrence seems to


be more important than word frequency.
The occurrence of the word fantastic tells us a
lot
The fact that it occurs 5 times may not tell us
much more.
Binary multinominal naive bayes, or binary NB
 Clip our word counts at 1
Binary Multinomial Naïve Bayes:
Learning

• Calculate P(wk | cj) terms


• From training corpus, extract Vocabulary • Remove duplicates in each doc:
• For each word type w in docj
 Calculate P(cj) terms • Retain only a single instance of w
 For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
Binary Multinomial Naive Bayes
on a test document d
First remove all duplicate words from d
Then compute NB using the same equation
Binary multinominal naive Bayes
Binary multinominal naive Bayes

Counts can still be 2! Binarization is within-doc!


More on Sentiment Classification

 I really like this movie

I really don't like this movie

Negation changes the meaning of "like" to negative.


Negation can also change negative to positive-ish
◦ Don't dismiss this film
◦ Doesn't let us get bored
Sentiment Classification: Lexicons

Sometimes we don't have enough labeled training data


In that case, we can make use of pre-built word lists
Called lexicons
There are various publicly available lexicons
MPQA Subjectivity Cues Lexicon

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.

Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.

 Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/


 6885 words from 8221 lemmas, annotated for intensity (strong/weak)
 2718 positive
 4912 negative
 + : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great
 − : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh,
hate
Using Lexicons in Sentiment
Classification
Add a feature that gets a count whenever a word
from the lexicon occurs
 E.g., a feature called "this word occurs in the positive
lexicon" or "this word occurs in the negative lexicon"
Now all positive words (good, great, beautiful,
wonderful) or negative words count for that feature.
Using 1-2 features isn't as good as using all the words.
• But when training data is sparse or not representative of
the test set, dense lexicon features can help
Naive Bayes in Other tasks: Spam
Filtering
 Spam Assassin Features:
 Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
 From: starts with many numbers
 Subject is all capitals
 HTML has a low ratio of text to image area
 "One hundred percent guaranteed"
 Claims you can be removed from the list
Naive Bayes in Language ID

 Determining what language a piece of text is


written in.
Features based on character n-grams do very well
 Important to train on lots of varieties of each
language
(e.g., American English varieties like African-American English,
or English varieties around the world like Indian English)
Summary: Naive Bayes is Not So
Naive
 Very Fast, low storage requirements

 Work well with very small amounts of training data

 Robust to Irrelevant Features

Irrelevant Features cancel each other without affecting results


 Very good in domains with many equally important features

Decision Trees suffer from fragmentation in such cases – especially if little


data
If assumed independence is
 Optimal if the independence assumptions hold:

correct, then it is the Bayes Optimal Classifier for problem


 A good dependable baseline for text classification
Naïve Bayes: Relationship to Language
Modeling

 Generative Model for Multinomial Naïve Bayes

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds


Naïve Bayes and Language
Modeling
 Naïvebayes classifiers can use any sort of
feature
 URL, email address, dictionaries, network features
 But if, as in the previous slides
 We use only word features
 we use all of the words in the text (not a subset)
 Then
 Naïve
bayes has an important similarity to language
modeling.
Each class = a unigram language model

 Assigning each word: P(word | c)


 Assigning each sentence: P(s|c)=Π P(word|c)

Class pos
0.1 I I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01 this
0.05 fun
0.1 film P(s | pos) = 0.0000005

Naïve Bayes: Relationship to Language
Modeling

 Precision,
 Recall,
 and F1

Evaluating Classifiers: How well does our classifier


work?
Evaluating Classifiers: How well
does our classifier work?
Let's first address binary classifiers:
• Is this email spam?
spam (+) or not spam (-)
• Is this post about Delicious Pie Company?
about Del. Pie Co (+) or not about Del. Pie Co(-)

We'll need to know


1. What did our classifier say about each email or post?
2. What should our classifier have said, i.e., the correct
answer, usually as defined by humans ("gold label")
Evaluating Classifiers: How well
does our classifier work?
 Let's first consider binary classifiers. E.g. spam or not-spam. Or
imagine that we are the proprietors of the Delicious Pie Company and
we want to find out what people are saying about our pies on social
media. We want to know if a particular social media post is talking
about our pies positive or negative.

To evaluate such a binary classifier, we'll need to know two things.


What our classifier said about each email or post, and what it should
have said, i.e. the correct answer, usually as defined by human
labelers.
First step in evaluation: The
confusion matrix
First step in evaluation: The
confusion matrix
The first step is the confusion matrix a table for visualizing how an
algorithm performs with respect to the human gold labels. here use two
dimensions (system output and gold labels), and each cell labels a set of
possible outcomes.

In the pie detection case, for example, true positives are posts that are
indeed about Delicious Pie (indicated by human-created gold labels) that
our system correctly said were about pie. False negatives are posts that
are indeed about pie but our system incorrectly labeled as pie. False
positives are posts that aren't about pie but our system incorrectly said
they were. And true negatives are non-pie-posts that are system
correctly said were not about pie
Accuracy on the confusion matrix
Accuracy on the confusion matrix

Here is the equation for accuracy: what percentage of all the


observations (for the spam or pie examples that means all emails
or tweets) our system labeled correctly.

Although accuracy might seem a natural metric, we generally


don’t use it for text classification tasks.
Why don't we use accuracy?

Accuracy doesn't work well when we're dealing


with uncommon or imbalanced classes
Suppose we look at 1,000,000 social media posts
to find Delicious Pie-lovers (or haters)
• 100 of them talk about our pie
• 999,900 are posts about something unrelated
Imagine the following simple classifier
Every post is "not about pie"
Accuracy re: pie posts100 posts are about pie; 999,900 aren't
Why don't we use accuracy?

Accuracy of our "nothing is pie" classifier


999,900 true negatives and 100 false negatives
Accuracy is 999,900/1,000,000 = 99.99%!
But useless at finding pie-lovers (or haters)!!
Which was our goal!
Accuracy doesn't work well for unbalanced classes
Most tweets are not about pie!
Why don't we use accuracy?

 But this fabulous ‘no pie’ classifier would be completely useless, since
it wouldn’t find a single one of the customer comments we are
looking for. In other words, accuracy is not a good metric when the
goal is to discover something that is rare, or at least not completely
balanced in frequency, which is a very common situation in the world.
Instead of accuracy we use
precision and recall

Precision: % of selected items that are correct


Recall: % of correct items that are selected
Precision and Recall

 Precision is out of the things the system selected (the set of emails
or tweets the system claimed were positive, i.e. spam or pie-related),
how many did it get right? how many were true positives, out of what
I selected (true positives _ false positives).

Recall is out of all the correct items that should have been positive,
what % of them did the system select? So out of all the things that
are gold positive, how many did the system find as true positives?

So precision is about how much garbage we included in our findings;


recall is more about making sure we didn't miss any treasure.
Precision and Recall

• 100 tweets talk about pie, 999,900 tweets don't


• Accuracy = 999,900/1,000,000 = 99.99%
But the Recall and Precision for this classifier are terrible:
Precision and Recall

 Recall and Precision will correctly evaluate our stupid


"just say no" classifier as a bad classifier. The recall
will be 0, since we returned no true positives out of
the 100 true pie tweets (0 + 100). Precision is
similarly 0 or in fact undefined, since both the
numerator and denominator are 0. The metrics
correctly assigns bad scores to our useless classifier.
 [to get high precision, a system should be very reluctant to guess – but then it
may miss somethings and have poor recall]
 [to get high recall, a system should be very willing to guess– but then it may return
some junk and have poor precision]
A combined measure: F1

 F1 is a combination of precision and recall.

F1 turns out to be the harmonic mean between precision and recall


F1 is a special case of the general
"F-measure"
F-measure is the (weighted) harmonic mean of precision and recall

F1 is a special case of F-measure with β=1, α=½


F1 is a special case of the general
"F-measure"
 F1 is a special case of the F-measure: weighted harmonic mean of precision and recall.
 The harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of
reciprocals.

 You can see here that F score is the harmonic mean, if we replace alpha with ½ we get 2/
(1/p + 1/r).
 The Harmonic mean of two values is closer to the minimum of the two numbers than
arithmetic or geometric mean, so it weighs the lower of the two numbers more heavily.

That is, if P and R are far apart, F will be nearer the lower value, which makes it a kind of conservative
mean in this situation. Thus to do well on F1, you have to do well on BOTH P and R.

Why the weights? in some applications you may care more about P or R. In practice we
mainly use the balanced measure with beta = 1 and alpha =1/2
Suppose we have more than 2
classes
 Lots of text classification tasks have more than two classes.
 Sentiment analysis (positive, negative, neutral) , named entities
(person, location, organization)
 We can define precision and recall for multiple classes like this 3-way
email task:
Suppose we have more than 2
classes
 Lots of classification tasks have more than two classes, like sentiment
could be 3-way. Consider the confusion matrix for a hypothetical 3-
way email categorization decision (urgent, normal, spam). Notice that
the system mistakenly labeled one spam document as urgent. We can
compute distinct precision and recall values for each class. For
example, the precision of the urgent category is 8 (the true positive
urgent) over the true positives + false positives (the 10 normal and
that 1 spam). The result, however, is 3 separate precision values and
3 separate recall values!
Reference

Chapter 4
Question
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy