0% found this document useful (0 votes)
23 views81 pages

Classification

Uploaded by

dogasancak01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views81 pages

Classification

Uploaded by

dogasancak01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Natural Language Processing

Lecture 2: Text classification


Classification
A mapping h from input data
x (drawn from instance
space 𝓧) to a label (or
labels) y from some
enumerable output space 𝘧
𝓧 = set of all documents
𝘧 = {english, mandarin, greek, … }

x = a single document
y = ancient greek
Classification

h(x) = y
h(μῆνιν ἄειδε θεὰ) = ancient grc
Classification
Let h(x) be the “true”
mapping. We never know it.
How do we find the best ĥ(x)
to approximate it?
One option: rule based

if x has characters in
unicode point range 0370-03FF:
ĥ(x) = greek
Classification
Supervised learning

Given training data in the


form of <x, y> pairs, learn
ĥ(x)
Text categorization problems

task 𝓧 𝘧

language ID text {english, mandarin, greek, … }

spam classification email {spam, not spam}

authorship attribution text {jk rowling, james joyce, … }

genre classification novel {detective, romance, gothic, … }

sentiment analysis text {postive, negative, neutral, mixed}


Sentiment analysis

• Document-level SA: is the entire text positive or


negative (or both/neither) with respect to an implicit
target?

• Movie reviews [Pang et al. 2002, Turney 2002]


Training data
“ … is a film which still causes real, not figurative,
positive chills to run along my spine, and it is certainly the
bravest and most ambitious fruit of Coppola's genius”

Roger Ebert, Apocalypse Now

• “I hated this movie. Hated hated hated


hated hated this movie. Hated it. Hated
every simpering stupid vacant
audience-insulting moment of it. Hated negative
the sensibility that thought anyone
would like it.”
Roger Ebert, North
• Implicit signal: star ratings

• Either treat as ordinal


regression problem ({1, 2,
3, 4, 5} or binarize the
labels into {pos, neg}
Sentiment analysis

• Is the text positive or


negative (or both/
neither) with respect
to an explicit target
within the text?

Hu and Liu (2004), “Mining and Summarizing


Customer Reviews”
Sentiment analysis

• Political/product
opinion mining
Twitter sentiment →

Job approval polls →

O’Connor et al (2010), “From Tweets to Polls: Linking Text


Sentiment to Public Opinion Time Series”
Sentiment as tone
• No longer the speaker’s attitude with respect to some
particular target, but rather the positive/negative tone
that is evinced.
Sentiment as tone

Dodds et al. (2011), "Temporal patterns of happiness and information in a global social network:
Hedonometrics and Twitter" (PLoS One)
Sentiment as tone

Golder and Macy (2011), “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across
Diverse Cultures,” Science. Positive affect (PA) and negative affect (NA) measured with LIWC.
Sentiment Dictionaries
pos neg
• General Inquirer (1966)
unlimited lag
• MPQA subjectivity lexicon prudent contortions
(Wilson et al. 2005)
http://mpqa.cs.pitt.edu/lexicons/ superb fright
subj_lexicon/ closeness lonely

• LIWC (Linguistic Inquiry and impeccably tenuously


Word Count, Pennebaker 2015) fast-paced plebeian

• AFINN (Nielsen 2011) treat mortification

destined outrage
• NRC Word-Emotion Association
Lexicon (EmoLex), Mohammad blessing allegations
and Turney 2013
steadfastly disoriented
LIWC
• 73 separate lexicons designed for applications
social psychology

SLP3
Why is SA hard?
• Sentiment is a measure of a speaker’s private state,
which is unobservable.

• Sometimes words are a good indicator of sentiment


(love, amazing, hate, terrible); many times it
requires deep world + contextual knowledge

“Valentine’s Day is being marketed as a Date Movie. I think it’s more


of a First-Date Movie. If your date likes it, do not date that person
again. And if you like it, there may not be a second date.”
Roger Ebert, Valentine’s Day
Classification
Supervised learning

Given training data in the


form of <x, y> pairs, learn
ĥ(x)

x y
loved it! positive
terrible movie negative
not too shabby positive
ĥ(x)
• The classification function that we want to learn has
two different components:

• the formal structure of the learning method


(what’s the relationship between the input and
output?) → Naive Bayes, logistic regression,
convolutional neural network, etc.

• the representation of the data


Representation for SA

• Only positive/negative words in MPQA

• Only words in isolation (bag of words)

• Conjunctions of words (sequential, skip ngrams,


other non-linear combinations)

• Higher-order linguistic structure (e.g., syntax)


“ … is a film which still causes real, not figurative,
chills to run along my spine, and it is certainly the
bravest and most ambitious fruit of Coppola's genius”

Roger Ebert, Apocalypse Now

“I hated this movie. Hated hated hated


hated hated this movie. Hated it. Hated
every simpering stupid vacant audience-
insulting moment of it. Hated the
sensibility that thought anyone would like
it.”
Roger Ebert, North
Bag of Apocalypse
now
North

words the 1 1

of 0 0

hate 0 9
Representation of text
only as the counts of genius 1 0

words that it contains bravest 1 0

stupid 0 1

like 0 1


Naive Bayes
• Given access to <x,y> pairs in training data, we can
train a model to estimate the class probabilities for a
new review.

• With a bag of words representation (in which each


word is independent of the other), we can use Naive
Bayes

• Probabilistic model; not as accurate as other models


(see next two classes) but fast to train and the
foundation for many other probabilistic techniques.
Random variable
• A variable that can take values within a fixed set
(discrete) or within some range (continuous).

X 2 { 1,2,3,4,5,6}

X 2 { the, a, dog, cat, r uns, to, stor e}


P (X = x)
Probability that the random variable X takes
the value x (e.g., 1)

X 2 { 1,2,3,4,5,6}

Two conditions:
1. Between 0 and 1: 0 P (X = x) 1
X
2. Sum of all probabilities = 1 P (X = x) = 1
x
Fair dice
fair

0.5
0.4
X 2 { 1,2,3,4,5,6}

0.3
0.2
0.1
0.0

1 2 3 4 5 6
Weighted dice
not fair

0.5
0.4
X 2 { 1,2,3,4,5,6}

0.3
0.2
0.1
0.0

1 2 3 4 5 6
Inference
X 2 { 1,2,3,4,5,6}

We want to infer the probability distribution


that generated the data we see.
fair
not fair
0.5

0.5
0.4

0.4
0.3

0.3
?
0.2

0.2
0.1

0.1
0.0

0.0
1 2 3 4 5 6
1 2 3 4 5 6
0.0 0.1 0.2 0.3 0.4 0.5

1
2
3

fair
4

Probability
5
6

0.0 0.1 0.2 0.3 0.4 0.5


1
2
3

not fair
4
5
6
0.0 0.1 0.2 0.3 0.4 0.5

1
2
3

fair
4

Probability
5
6

0.0 0.1 0.2 0.3 0.4 0.5


1
2
3

not fair
4
5
6
Probability
2
fair not fair
0.5

0.5
6
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6
fair not fair
0.5

0.5
6
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6
fair not fair
0.5

0.5
1
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1
fair not fair
0.5

0.5
6
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6
fair not fair
0.5

0.5
3
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3
fair not fair
0.5

0.5
6
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6
fair not fair
0.5

0.5
6
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6 6
fair not fair
0.5

0.5
3
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6 6 3
fair not fair
0.5

0.5
6
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Probability
2 6 6 1 6 3 6 6 3 6
fair not fair
0.5

0.5
?
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

1 2 3 4 5 6 1 2 3 4 5 6
Independence
• Two random variables are independent if:

P(A, B) = P(A) × P(B)

• In general:
N
P(x1 , . . . , xn ) = P(xi)
i=1
• Information about one random variable (B) gives no
information about the value of another (A)

P(A) = P(A | B) P(B) = P(B | A)


Data Likelihood
fair

0.5
0.4
0.3
=.17 x .17 x .17
0.2
P( 2 6 6 | 0.1
0.0
) = 0.004913
1 2 3 4 5 6

not fair
0.5
0.4

= .1 x .5 x .5
P( | )
0.3

2 6 6
= 0.025
0.2
0.1
0.0

1 2 3 4 5 6
Data Likelihood

• The likelihood gives us a way of discriminating


between possible alternative parameters, but also
a strategy for picking a single best* parameter
among all possibilities
Word choice as weighted
dice

0.04

0.03

0.02

0.01

0
the of hate like stupid
Unigram probability
0.04

0.03

positive reviews 0.02

0.01

0
the of hate like stupid

0.04

0.03

negative reviews 0.02

0.01

0
the of hate like stupid
# the
P(X = the) =
#total words
Maximum Likelihood
Estimate
• This is a maximum likelihood estimate for P(X); the
parameter values for which the data we observe (X)
is most likely.
Maximum Likelihood
Estimate
2 6 6 1 6 3 6 6 3 6
0.6
0.5
0.4
0.3
0.2
0.1
0.0

1 2 3 4 5 6
0.6
0.5
0.4
2 6 6 1 6 3 6 6 3 6

θ1 P(X | θ1) = 0.0000311040


0.3
0.2
0.1
0.0

1 2 3 4 5 6
0 .5
0.4

P(X | θ2) = 0.0000000992


0.3

θ2
0.2

(313x less likely)


0.1
0.0

1 2 3 4 5 6
0.5

P(X | θ3) = 0.0000031250


0.4
0.3

θ3 (10x less likely)


0.2
0.1
0.0

1 2 3 4 5 6
Conditional Probability
P ( X = x|Y = y)

• Probability that one random variable takes a


particular value given the fact that a different
variable takes another

P(X i = hate | Y = )
Sentiment analysis

“really really the worst movie ever”


Independence Assumption
really really the worst movie ever
x1 x2 x3 x4 x5 x6

P(really, really, the, worst, movie, ever) =


P(really) x P(really) x P(the) … P(ever)
Independence Assumption
really really the worst movie ever
x1 x2 x3 x4 x5 x6

We will assume the features are independent:

P(x1, x2, x3, x4, x6, x7 | c) = P(x1 | c)P(x2 | c) . . . P(x7 | c)


N
P(xi...xn | c) = P(xi | c)
i=1
A simple classifier
really really the worst movie ever

Y=Positive Y=Negative

P(X=really | Y=⊕) 0.0010 P(X=really | Y=⊖) 0.0012

P(X=really | Y=⊕) 0.0010 P(X=really | Y=⊖) 0.0012

P(X=the | Y=⊕) 0.0551 P(X=the | Y=⊖) 0.0518

P(X=worst | Y=⊕) 0.0001 P(X=worst | Y=⊖) 0.0004

P(X=movie | Y=⊕) 0.0032 P(X=movie | Y=⊖) 0.0045

P(X=ever | Y=⊕) 0.0005 P(X=ever | Y=⊖) 0.0005


A simple classifier
really really the worst movie ever

P(X = “really really the worst movie ever” | Y = ⊕)

P(X=really | Y=⊕) x P(X=really | Y=⊕) x P(X=the | Y=⊕) x P(X=worst |


Y=⊕) x P(X=movie | Y=⊕) x P(X=ever | Y=⊕)
= 6.00e-18

P(X = “really really the worst movie ever” | Y = ⊖)

P(X=really | Y=⊖) x P(X=really | Y=⊖) x P(X=the | Y=⊖) x P(X=worst |


Y=⊖) x P(X=movie | Y=⊖) x P(X=ever | Y=⊖)
= 6.20e-17
Aside: use logs
• Multiplying lots of small probabilities (all are under
1) can lead to numerical underflow (converging to
0)

Σ
log xi = log x i
i i
A simple classifier
• The classifier we just specified is a maximum likelihood
classifier, where we compare the likelihood of the data under
each class and choose the class with the highest likelihood

Likelihood: probability of data P(X = xi . . . xn | Y = y)


(here, under class y)

Prior probability of class y P(Y = y)


Bayes’ Rule
Prior belief that Y = y Likelihood of the data
(before you see any data) given that Y=y

P (Y = y)P ( X = x|Y = y)
P (Y = y | X = x) = P
y P (Y = y)P ( X = x|Y = y)

Posterior belief that Y=y given that X=x


Bayes’ Rule
Likelihood of “really really the
Prior belief that Y = positive
worst movie ever”
(before you see any data)
given that Y= positive

P (Y = y)P ( X = x|Y = y)
P (Y = y | X = x) = P
y P (Y = y)P ( X = x|Y = y)

This sum ranges over


Posterior belief that Y=positive given that
y=positive + y=negative
X=“really really the worst movie ever”
(so that it sums to 1)
Likelihood: probability of data
(here, under class y) P(X = xi . . . xn | Y = y)

Prior probability of class y P(Y = y)

Posterior belief in the probability P(Y = y | X = xi . . . xn )


of class y after seeing data
Naive Bayes Classifier
P(Y = )P (X = “really . . . ” | Y = )
P(Y = )P (X = “really . . . ” | Y = ) + P(Y = g)P (X = “really . . . ” | Y = g )

Let’s say P(Y=⊕) = P(Y=⊖) = 0.5


(i.e., both are equally likely a priori)

0.5 × (6.00 × 10—18)


0.5 × (6.00 × 10—18) + 0.5 × (6.2 × 10—17)

P(Y = | X = “really . . . ”) = 0.088


P(Y = g | X = “really . . . ”) = 0.912
Naive Bayes Classifier
• To turn probabilities into a classification decisions,
we just select the label with the highest posterior
probability

yˆ= arg max P(Y | X )


y Y

P(Y = | X = “really . . . ”) = 0.088


P(Y = g | X = “really . . . ”) = 0.912
Taxicab Problem
“A cab was involved in a hit and run accident at night. Two cab companies,
the Green and the Blue, operate in the city. You are given the following
data:

• 85% of the cabs in the city are Green and 15% are Blue.

• A witness identified the cab as Blue. The court tested the reliability of
the witness under the same circumstances that existed on the night of
the accident and concluded that the witness correctly identified each
one of the two colors 80% of the time and failed 20% of the time.

What is the probability that the cab involved in the accident was Blue rather
than Green knowing that this witness identified it as Blue?”

(Tversky & Kahneman 1981)


Y X

= true color of cab = reported color of cab

Prior
P(Y = Green) = 0.85
P(Y = Blue) = 0.15

P(X = Green ∣ Y = Blue) = 0.20


P(X = Blue ∣ Y = Blue) = 0.80
Likelihood
P(X = Green ∣ Y = Green) = 0.80
P(X = Blue ∣ Y = Green) = 0.20
Y X

= true color of cab = reported color of cab

Posterior
P(Y = Blue ∣ X = Blue)

What we care about is this posterior value (the probability


that the cab is blue given that the witness said it was
blue). We can’t measure it directly, but we can plug the
prior and likelihood into Bayes’ rule to get our answer.
Prior Belief
• Now let’s assume that there are 1000 times more positive reviews
than negative reviews.

• P(Y= negative) = 0.000999


• P(Y = positive) = 0.999001

0.999001 × (6.00 × 10—18)


0.999001 × (6.00 × 10—18) + 0.000999 × (6.2 × 10P (—17)

P(Y = | X = “really . . . ”) = 0.990


P(Y = g | X = “really . . . ”) = 0.010
Priors
• Priors can be informed (reflecting expert
knowledge) but in practice, but priors in Naive
Bayes are often simply estimated from training data

#
P(Y = )=
#total texts
Smoothing
• Maximum likelihood estimates can fail miserably
when features are never observed with a particular
class.
0.6

What’s the probability of:


0.5
0.4
0.3

2 4 6
0.2
0.1
0.0

1 2 3 4 5 6
Smoothing
• One solution: add a little probability mass to every
element.

maximum likelihood smoothed estimates


estimate
ni,y + α
P(x i | y) =
ni,y ny + Vα
P(xi | y) = same α for all xi
ny
ni,y + αi
ni,y = count of word i in class y P(xi | y) = ΣV
ny = number of words in y ny + j=1 α j
V = size of vocabulary
possibly different α for each xi
Smoothing
0.6
0.5
ML

0.4
0.3
E

0.2
0.1
0.0
1 2 3 4 5 6

0.6
0.5
0.4
smoothing with α =1

0.3
0.2
0.1
0.0
1 2 3 4 5 6
Naive Bayes training
Training a Naive Bayes classifier consists of estimating
these two quantities from training data for all classes y

P (Y = y)P ( X = x|Y = y)
P (Y = y | X = x) = P
y P (Y = y)P ( X = x|Y = y)

At test time, use those estimated probabilities to


calculate the posterior probability of each class y
and select the class with the highest probability
Apocalypse
North
now
• Naive Bayes’
the 1 1
independence assumption
can be killer of 0 0

• One instance of hate hate 0 91


makes seeing others much
more likely (each mention genius 1 0
does contribute the same
bravest 1 0
amount of information)
stupid 0 1
• We can mitigate this by not
reasoning over counts of like 0 1
tokens but by their
presence absence …
Multinomial Naive Bayes
Discrete distribution for modeling count data (e.g., word
counts; single parameter θ
0.4

θ=
0.2
0.0

the a dog cat runs to store

the a dog cat runs to store


3 1 0 1 0 2 0
531 209 13 8 2 331 1
Multinomial Naive Bayes
Maximum likelihood parameter estimate

ˆ ni
θi=
N

the a dog cat runs to store


count n 531 209 13 8 2 331 1
θ 0.48 0.19 0.01 0.01 0.00 0.30 0.00
Bernoulli Naive Bayes
• Binary event (true or false; {0, 1}) P(x = 1 | p) = p
• One parameter: p (probability of
an event occurring) P(x = 0 | p) = 1 —p

Examples:
• Probability of a particular feature being true
(e.g., review contains “hate”)
N

p̂mle = xi
N
i=1
Bernoulli Naive Bayes
data points

x1 x2 x3 x4 x5 x6 x7 x8

f1 1 0 0 0 1 1 0 0

f2 0 0 0 0 0 0 1 0
features

f3 1 1 1 1 1 0 0 1

f4 1 0 0 1 1 0 0 1

f5 0 0 0 0 0 0 0 0
Bernoulli Naive Bayes
Positive Negative

x1 x2 x3 x4 x5 x6 x7 x8 pMLE,P pMLE,N

f1 1 0 0 0 1 1 0 0 0.25 0.50

f2 0 0 0 0 0 0 1 0 0.00 0.25

f3 1 1 1 1 1 0 0 1 1.00 0.50

f4 1 0 0 1 1 0 0 1 0.50 0.50

f5 0 0 0 0 0 0 0 0 0.00 0.00
Tricks for SA

• Negation in bag of words: add negation marker to


all words between negation and end of clause
(e.g., comma, period) to create new vocab term [Das
and Chen 2001]

• I do not [like this movie]

• I do not like_NEG this_NEG movie_NEG


Sentiment Dictionaries
pos neg
• General Inquirer (1966)
unlimited lag
• MPQA subjectivity lexicon prudent contortions
(Wilson et al. 2005)
http://mpqa.cs.pitt.edu/lexicons/ superb fright
subj_lexicon/ closeness lonely

• LIWC (Linguistic Inquiry and impeccably tenuously


Word Count, Pennebaker 2015)
fast-paced plebeian
• AFINN (Nielsen 2011) treat mortification

destined outrage
• NRC Word-Emotion Association
Lexicon (EmoLex), Mohammad blessing allegations
and Turney 2013
steadfastly disoriented
• Be sure to cover the reading!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy