Classification
Classification
x = a single document
y = ancient greek
Classification
h(x) = y
h(μῆνιν ἄειδε θεὰ) = ancient grc
Classification
Let h(x) be the “true”
mapping. We never know it.
How do we find the best ĥ(x)
to approximate it?
One option: rule based
if x has characters in
unicode point range 0370-03FF:
ĥ(x) = greek
Classification
Supervised learning
task 𝓧 𝘧
• Political/product
opinion mining
Twitter sentiment →
Dodds et al. (2011), "Temporal patterns of happiness and information in a global social network:
Hedonometrics and Twitter" (PLoS One)
Sentiment as tone
Golder and Macy (2011), “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across
Diverse Cultures,” Science. Positive affect (PA) and negative affect (NA) measured with LIWC.
Sentiment Dictionaries
pos neg
• General Inquirer (1966)
unlimited lag
• MPQA subjectivity lexicon prudent contortions
(Wilson et al. 2005)
http://mpqa.cs.pitt.edu/lexicons/ superb fright
subj_lexicon/ closeness lonely
destined outrage
• NRC Word-Emotion Association
Lexicon (EmoLex), Mohammad blessing allegations
and Turney 2013
steadfastly disoriented
LIWC
• 73 separate lexicons designed for applications
social psychology
SLP3
Why is SA hard?
• Sentiment is a measure of a speaker’s private state,
which is unobservable.
x y
loved it! positive
terrible movie negative
not too shabby positive
ĥ(x)
• The classification function that we want to learn has
two different components:
words the 1 1
of 0 0
hate 0 9
Representation of text
only as the counts of genius 1 0
stupid 0 1
like 0 1
…
Naive Bayes
• Given access to <x,y> pairs in training data, we can
train a model to estimate the class probabilities for a
new review.
X 2 { 1,2,3,4,5,6}
X 2 { 1,2,3,4,5,6}
Two conditions:
1. Between 0 and 1: 0 P (X = x) 1
X
2. Sum of all probabilities = 1 P (X = x) = 1
x
Fair dice
fair
0.5
0.4
X 2 { 1,2,3,4,5,6}
0.3
0.2
0.1
0.0
1 2 3 4 5 6
Weighted dice
not fair
0.5
0.4
X 2 { 1,2,3,4,5,6}
0.3
0.2
0.1
0.0
1 2 3 4 5 6
Inference
X 2 { 1,2,3,4,5,6}
0.5
0.4
0.4
0.3
0.3
?
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6
1 2 3 4 5 6
0.0 0.1 0.2 0.3 0.4 0.5
1
2
3
fair
4
Probability
5
6
not fair
4
5
6
0.0 0.1 0.2 0.3 0.4 0.5
1
2
3
fair
4
Probability
5
6
not fair
4
5
6
Probability
2
fair not fair
0.5
0.5
6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6
fair not fair
0.5
0.5
6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6
fair not fair
0.5
0.5
1
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1
fair not fair
0.5
0.5
6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6
fair not fair
0.5
0.5
3
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3
fair not fair
0.5
0.5
6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6
fair not fair
0.5
0.5
6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6 6
fair not fair
0.5
0.5
3
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6 6 3
fair not fair
0.5
0.5
6
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6 6 3 6
fair not fair
0.5
0.5
?
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
1 2 3 4 5 6 1 2 3 4 5 6
Independence
• Two random variables are independent if:
• In general:
N
P(x1 , . . . , xn ) = P(xi)
i=1
• Information about one random variable (B) gives no
information about the value of another (A)
0.5
0.4
0.3
=.17 x .17 x .17
0.2
P( 2 6 6 | 0.1
0.0
) = 0.004913
1 2 3 4 5 6
not fair
0.5
0.4
= .1 x .5 x .5
P( | )
0.3
2 6 6
= 0.025
0.2
0.1
0.0
1 2 3 4 5 6
Data Likelihood
0.04
0.03
0.02
0.01
0
the of hate like stupid
Unigram probability
0.04
0.03
0.01
0
the of hate like stupid
0.04
0.03
0.01
0
the of hate like stupid
# the
P(X = the) =
#total words
Maximum Likelihood
Estimate
• This is a maximum likelihood estimate for P(X); the
parameter values for which the data we observe (X)
is most likely.
Maximum Likelihood
Estimate
2 6 6 1 6 3 6 6 3 6
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1 2 3 4 5 6
0.6
0.5
0.4
2 6 6 1 6 3 6 6 3 6
1 2 3 4 5 6
0 .5
0.4
θ2
0.2
1 2 3 4 5 6
0.5
1 2 3 4 5 6
Conditional Probability
P ( X = x|Y = y)
P(X i = hate | Y = )
Sentiment analysis
Y=Positive Y=Negative
Σ
log xi = log x i
i i
A simple classifier
• The classifier we just specified is a maximum likelihood
classifier, where we compare the likelihood of the data under
each class and choose the class with the highest likelihood
P (Y = y)P ( X = x|Y = y)
P (Y = y | X = x) = P
y P (Y = y)P ( X = x|Y = y)
P (Y = y)P ( X = x|Y = y)
P (Y = y | X = x) = P
y P (Y = y)P ( X = x|Y = y)
• 85% of the cabs in the city are Green and 15% are Blue.
• A witness identified the cab as Blue. The court tested the reliability of
the witness under the same circumstances that existed on the night of
the accident and concluded that the witness correctly identified each
one of the two colors 80% of the time and failed 20% of the time.
What is the probability that the cab involved in the accident was Blue rather
than Green knowing that this witness identified it as Blue?”
Prior
P(Y = Green) = 0.85
P(Y = Blue) = 0.15
Posterior
P(Y = Blue ∣ X = Blue)
#
P(Y = )=
#total texts
Smoothing
• Maximum likelihood estimates can fail miserably
when features are never observed with a particular
class.
0.6
2 4 6
0.2
0.1
0.0
1 2 3 4 5 6
Smoothing
• One solution: add a little probability mass to every
element.
0.4
0.3
E
0.2
0.1
0.0
1 2 3 4 5 6
0.6
0.5
0.4
smoothing with α =1
0.3
0.2
0.1
0.0
1 2 3 4 5 6
Naive Bayes training
Training a Naive Bayes classifier consists of estimating
these two quantities from training data for all classes y
P (Y = y)P ( X = x|Y = y)
P (Y = y | X = x) = P
y P (Y = y)P ( X = x|Y = y)
θ=
0.2
0.0
ˆ ni
θi=
N
Examples:
• Probability of a particular feature being true
(e.g., review contains “hate”)
N
1Σ
p̂mle = xi
N
i=1
Bernoulli Naive Bayes
data points
x1 x2 x3 x4 x5 x6 x7 x8
f1 1 0 0 0 1 1 0 0
f2 0 0 0 0 0 0 1 0
features
f3 1 1 1 1 1 0 0 1
f4 1 0 0 1 1 0 0 1
f5 0 0 0 0 0 0 0 0
Bernoulli Naive Bayes
Positive Negative
x1 x2 x3 x4 x5 x6 x7 x8 pMLE,P pMLE,N
f1 1 0 0 0 1 1 0 0 0.25 0.50
f2 0 0 0 0 0 0 1 0 0.00 0.25
f3 1 1 1 1 1 0 0 1 1.00 0.50
f4 1 0 0 1 1 0 0 1 0.50 0.50
f5 0 0 0 0 0 0 0 0 0.00 0.00
Tricks for SA
destined outrage
• NRC Word-Emotion Association
Lexicon (EmoLex), Mohammad blessing allegations
and Turney 2013
steadfastly disoriented
• Be sure to cover the reading!