Multimedia Application L8
Multimedia Application L8
Application
By
There are three types of Naive Bayes model under the scikit-learn
library:
Gaussian: It is used in classification and it assumes that features
follow a normal distribution.
Multinomial: It is used for discrete counts. For example, let’s say, we
have a text classification problem. Here we can consider Bernoulli trials
which is one step further and instead of “word occurring in the document”,
we have “count how often word occurs in the document”, you can think of
it as “number of times outcome number x_i is observed over the n trials”.
Bernoulli: The binomial model is useful if your feature vectors are
binary. One application would be text classification with ‘bag of words’
model where the 1s & 0s are “word occurs in the document” and “word
does not occur in the document” respectively.
Example
Example solution
Solution:
P(A|B) = (P(B|A) * P(A) )/ P(B)
1. Mango:
P(X | Mango) = P(Ye | Yellow ) * P(Sweet | Mango) * P(Long | Mango)
a)P(Yellow | Mango) = (P(Mango | Yellow) * P(Yellow) )/ P (Mango)
= ((350/800) * (800/1200)) / (650/1200)
P(Yellow | Mango)= 0.53 →1
Text Classification
Clara never failed to be astonished by the extraordinary felicity of her own name.
She found it hard to trust herself to the mercy of fate, which had managed over
the years
to convert her greatest shame into one of her greatest assets…
Text Classification: definition
Input:
a document d
a fixed set of classes C = {c1, c2,…, cJ}
Output:
a learned classifier γ:d c
Classification Methods:
Supervised Machine Learning
Any kind of classifier
Naïve Bayes
Logistic regression
Support-vector machines
k-Nearest Neighbors
Naive Bayes Intuition
seen 2
sweet 1
γ whimsical
recommend
happy
1
1
1
)=c
( ... ...
Training
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
Learning the Multinomial Naive Bayes Model
𝑁𝑐
^ (𝑐 )=
𝑃 𝑗
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙
Parameter estimation
Sentiment
Example:
A worked sentiment example with
add-1 smoothing
1. Prior from training:
^ (𝑐 )=
𝑃 𝑗
𝑁𝑐 𝑗 P(-) = 3/5
𝑁 𝑡𝑜𝑡𝑎𝑙
P(+) = 2/5
2. Drop "with"
3. Likelihoods from training:
𝑐𝑜𝑢𝑛𝑡 ( 𝑤 𝑖 , 𝑐 ) +1
𝑝 ( 𝑤 𝑖|𝑐 ) =
(∑ )
𝑐𝑜𝑢𝑛𝑡 (𝑤 ,𝑐 ) + ¿ 𝑉 ∨¿ ¿ 4. Scoring the test set:
𝑤 ∈𝑉
Optimizing for sentiment analysis
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.
c=China
Class pos
0.1 I I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01 this
0.05 fun
0.1 film P(s | pos) = 0.0000005
…
Naïve Bayes: Relationship to Language
Modeling
Precision,
Recall,
and F1
In the pie detection case, for example, true positives are posts that are
indeed about Delicious Pie (indicated by human-created gold labels) that
our system correctly said were about pie. False negatives are posts that
are indeed about pie but our system incorrectly labeled as pie. False
positives are posts that aren't about pie but our system incorrectly said
they were. And true negatives are non-pie-posts that are system
correctly said were not about pie
Accuracy on the confusion matrix
Accuracy on the confusion matrix
But this fabulous ‘no pie’ classifier would be completely useless, since
it wouldn’t find a single one of the customer comments we are
looking for. In other words, accuracy is not a good metric when the
goal is to discover something that is rare, or at least not completely
balanced in frequency, which is a very common situation in the world.
Instead of accuracy we use
precision and recall
Precision is out of the things the system selected (the set of emails
or tweets the system claimed were positive, i.e. spam or pie-related),
how many did it get right? how many were true positives, out of what
I selected (true positives _ false positives).
Recall is out of all the correct items that should have been positive,
what % of them did the system select? So out of all the things that
are gold positive, how many did the system find as true positives?
You can see here that F score is the harmonic mean, if we replace alpha with ½ we get 2/
(1/p + 1/r).
The Harmonic mean of two values is closer to the minimum of the two numbers than
arithmetic or geometric mean, so it weighs the lower of the two numbers more heavily.
That is, if P and R are far apart, F will be nearer the lower value, which makes it a kind of conservative
mean in this situation. Thus to do well on F1, you have to do well on BOTH P and R.
Why the weights? in some applications you may care more about P or R. In practice we
mainly use the balanced measure with beta = 1 and alpha =1/2
Suppose we have more than 2
classes
Lots of text classification tasks have more than two classes.
Sentiment analysis (positive, negative, neutral) , named entities
(person, location, organization)
We can define precision and recall for multiple classes like this 3-way
email task:
Suppose we have more than 2
classes
Lots of classification tasks have more than two classes, like sentiment
could be 3-way. Consider the confusion matrix for a hypothetical 3-
way email categorization decision (urgent, normal, spam). Notice that
the system mistakenly labeled one spam document as urgent. We can
compute distinct precision and recall values for each class. For
example, the precision of the urgent category is 8 (the true positive
urgent) over the true positives + false positives (the 10 normal and
that 1 spam). The result, however, is 3 separate precision values and
3 separate recall values!
Reference
Chapter 4
Question
Thank you