04 Textcat
04 Textcat
04 Textcat
• Given:
– A description of an instance, xX, where X is
the instance language or instance space.
– A fixed set of categories:
C={c1, c2,…cn}
• Determine:
– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and
whose range is C.
2
County vs. Country?
3
Who wrote which Federalist papers?
• unbelievably disappointing
• Full of zany characters and richly applied
satire, and some great plot twists
• this is the greatest screwball comedy ever
filmed
• It was pathetic. The worst part about it was
the boxing scenes.
•6
What is the subject of this article?
10
Example: County vs. Country?
• Given:
– A description of an instance, xX,
where X is the instance language or
instance space.
– A fixed set of categories:
C={c1, c2,…cn}
• Determine:
– The category of x: c(x)C, where c(x)
is a categorization function whose
domain is X and whose range is C.
11
Learning for Categorization
• A training example is an instance xX, paired
with its correct category c(x): <x, c(x)> for
an unknown categorization function, c.
• Given a set of training examples, D.
h(x)
c(x)
13
x
General Learning Issues
• Many hypotheses consistent with the training data.
• Bias
– Any criteria other than consistency with the training data
that is used to select a hypothesis.
• Classification accuracy
– % of instances classified correctly
– (Measured on independent test data.)
• Training time
– Efficiency of training algorithm
• Testing time
– Efficiency of subsequent classification
14
Generalization
15
Why is Learning Possible?
16
Bias
17
Bayesian Methods
• Learning and classification methods based
on probability theory.
– Bayes theorem plays a critical role in
probabilistic learning and classification.
– Uses prior probability of each category given
no information about an item.
• Categorization produces a posterior
probability distribution over the possible
categories given a description of an item.
18
Bayes’ Rule Applied to Documents and
Classes
P(d | c)P(c)
P(c | d) =
P(d)
Naïve Bayes Classifier (I)
P(d | c)P(c)
= argmax Bayes Rule
cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
Naïve Bayes Classifier (II)
P( x1 , x2 ,, xn | c)
• Bag of Words assumption: Assume position doesn’t
matter
The bag of words representation
γ(
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
The bag of words representation
γ(
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
The bag of words representation:
using a subset of words
γ(
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx several xxxxxxxxxxxxxxxxx
)=c
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
The bag of words representation
great 2
γ(
love 2
recommend
laugh
1
1
)=c
happy 1
... ...
Bag of words for document
classification
Test
?
document
Machine Garbage
parser Learning NLP Collection Planning GUI
language
label learning parser garbage planning ...
translation training tag collection temporal
… algorithm training memory reasoning
shrinkage translation optimization plan
network... language... region... language...
Multinomial Naïve Bayes Classifier
P( x1 , x2 ,, xn | c)
• Bag of Words assumption: Assume position doesn’t
matter
• Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
P( x1 ,, xn | c) P( x1 | c) P( x2 | c) P( x3 | c) ... P( xn | c)
Applying Multinomial Naive Bayes
Classifiers to Text Classification
count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV
count(wi , c) +1
P̂(wi | c) =
å (count(w, c))+1)
wÎV
count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
Multinomial Naïve Bayes: Learning
47
Easy to Implement
• But…
48
Probabilities: Important Detail!
49
Generative Model for Multinomial Naïve Bayes
c=China
•50
Naïve Bayes and Language Modeling
• Naïve bayes classifiers can use any sort of feature
– URL, email address, dictionaries, network features
• But if, as in the previous slides
– We use only word features
– we use all of the words in the text (not a subset)
• Then
– Naïve bayes has an important similarity to language
modeling.
•51
Each class = a unigram language model
Sec.13.2.1
Class pos
0.1 I
I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01 this
0.05 fun
0.1 film P(s | pos) = 0.0000005
…
Naïve Bayes as a Language Model Sec.13.2.1
55
Disadvantages
• Generative model
– Generally lower effectiveness than
discriminative techniques
56
Experimental Evaluation
57
Evaluation: Cross Validation
• Partition examples into k disjoint sets
• Now create k training sets
– Each set is union of all equiv classes except one
– So each set has (k-1)/k of the original training data
Train
Test
Test
…
Test
58
Cross-Validation (2)
• Leave-one-out
– Use if < 100 examples (rough estimate)
– Hold out one example, train on remaining examples
• 10-fold
– If have 100-1000’s of examples
• M of N fold
– Repeat M times
– Divide data into N folds, do N fold cross-validation
59
Evaluation Metrics
• Accuracy: no. of questions correctly answered
• Recall (for one label): measures how many instances of a label were
missed.
60
Precision & Recall
Multi-class situation:
Two class situation
Predicted
“P” “N”
Actual
P TP FN FP
TP
N FP TN
FP
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F-measure = 2pr/(p+r)
61
A typical precision-recall curve
62
Construct Better Features
• Ideas??
63
Issues in document representation
• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t represent as text
– Creation dates for docs
67
Case folding
• Reduce all letters to lower case
• Exception: upper case in mid-sentence
– e.g., General Motors
– Fed vs. fed
– SAIL vs. sail
• Homophones
– am, are, is be
– car, cars, car's, cars' car
•74
Properties of Text
• Word frequencies - skewed distribution
• `The’ and `of’ account for 10% of all words
• Six most common words account for 40%
75
From [Croft, Metzler & Strohman 2010]
Associate Press Corpus `AP89’
76
From [Croft, Metzler & Strohman 2010]
Middle Ground
77
Word Frequency
78
TF x IDF
wik tf ik * log( N / nk )
Tk term k in document Di
tfik frequency of term Tk in document Di
idfk inverse document frequency of term Tk in C
idfk log N
nk
N total number of documents in the collection C
nk the number of documents in C that contain Tk
79
Inverse Document Frequency
tf ik (1 log( N / nk ))
wik
k 1 ik
t
(tf ) 2
[1 log( N / nk )]2
81
The Real World Sec. 15.3.1
•82
No training data?
Sec. 15.3.1
Manually written rules
If (wheat or grain) and not (whole or bread) then
Categorize as grain
•83
Very little data? Sec. 15.3.1
•84
A reasonable amount of data? Sec. 15.3.1
• Perfect for all the clever classifiers (later)
– SVM
– Regularized Logistic Regression
• You can even use user-interpretable decision
trees
– Users like to hack
– Management likes quick fixes
•85
A huge amount of data? Sec. 15.3.1
•86
Real-world systems generally combine
• Automatic classification
• Manual review of uncertain/difficult/"new” cases
88
Multi-class Problems
•89
Evaluation:
Classic Reuters-21578 Data Set Sec. 15.2.4
• Most (over)used data set, 21,578 docs (each 90 types, 200 toknens)
• 9603 training, 3299 test articles (ModApte/Lewis split)
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
• Average document (with at least one category) has 1.24 classes
• Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories • Ship (197, 89)
• Money-fx (538, 179)
(#train, #test) • Grain (433, 149) • Wheat (212, 71)
• Crude (389, 189) • Corn (182, 56)
•90
Reuters Text Categorization data set
(Reuters-21578) document Sec. 15.2.4
</BODY></TEXT></REUTERS>
•91
Precision & Recall
Multi-class situation:
Two class situation
Predicted
“P” “N”
Actual
P TP FN FP
TP
N FP TN
FP
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F-measure = 2pr/(p+r)
92
Micro-‐ vs. Macro-‐Averaging
• If we have more than one class, how do we combine
multiple performance measures into one quantity?
• Macroaveraging
– Compute performance for each class, then average.
• Microaveraging
– Collect decisions for all classes, compute contingency table,
evaluate
93
Precision & Recall
Multi-class situation: Missed predictions
Aggregate
Average Macro Precision = Σpi/N
Average Macro Recall = Σri/N
Average Macro F-measure = 2pMrM/(pM+rM)
Classifier hallucinations
• Why?
– For some ML algorithms, a direct extension to the
multiclass case may be problematic.
• How?
– Many methods
•97
Methods
• One-vs-all
• All-pairs
• …
•98
One-vs-all
• Idea:
• Create many 1 vs other classifiers
– Classes = City, County, Country
– Classifier 1 = {City} {County, Country}
– Classifier 2 = {County} {City, Country}
– Classifier 3 = {Country} {City, County}
• Training time:
– For each class cm, train a classifier clm(x)
• replace (x,y) with
(x, 1) if y = cm
•99
(x, -1) if y != cm
An example: training
• x1 c1 …
for c2-vs-all:
• x2 c2 …
x1 -1
• x3 c1 …
x2 1 …
• x4 c3 …
x3 -1 …
x4 -1 …
for c1-vs-all:
x1 1 …
for c3-vs-all:
x2 -1 …
x1 -1…
x3 1 …
x2 -1…
x4 -1 …
x3 -1 …
x4 1…
•100
One-vs-all (cont)
•101
An example: testing
• x1 c1 …
for c1-vs-all:
• x2 c2 …
x ?? 1 0.7 -1 0.3
• x3 c1 …
• x4 c3 …
for c2-vs-all
x ?? 1 0.2 -1 0.8
three classifiers
for c3-vs-all
x ?? 1 0.6 -1 0.4
Test data:
x ?? f1 v1 …
=> what’s the system prediction for x?
•102
All-pairs (All-vs-All (AVA))
• Idea:
– For each pair of classes build a classifier
– {City vs. County}, {City vs Country}, {County vs.
Country}
– Ck2 classifiers: one classifier for each class pair.
• Training:
– For each pair (cm, cn) of classes, train a classifier clmn
• replace a training instance (x,y) with
(x, 1) if y = cm
(x, -1) if y = cn
otherwise ignore the instance
•103
An example: training
• x1 c1 …
for c2-vs-c3:
• x2 c2 …
x2 1 …
• x3 c1 …
x4 -1 …
• x4 c3 …
for c1-vs-c3:
for c1-vs-c2:
x1 1…
x1 1 …
x3 1 …
x2 -1 …
x4 -1 …
x3 1 …
•104
All-pairs (cont)
• Testing time: given a new example x
– Run each of the Ck2 classifiers on x
•105
An example: testing
• x1 c1 …
for c1-vs-c2:
• x2 c2 …
x ?? 1 0.7 -1 0.3
• x3 c1 …
• x4 c3 …
for c2-vs-c3
x ?? 1 0.2 -1 0.8
three classifiers
for c1-vs-c3
x ?? 1 0.6 -1 0.4
Test data:
x ?? f1 v1 …
=> what’s the system prediction for x?
•106
Error-correcting output codes
(ECOC)
• Proposed by (Dietterich and Bakiri, 1995)
• Idea:
– Each class is assigned a unique binary string of length n.
– Train n classifiers, one for each bit.
– Testing time: run n classifiers on x to get a n-bit string s,
and choose the class which is closest to s.
•107
An example: Digit Recognition
•108
Meaning of each column
•109
Another example: 15-bit code for a
10-class problem
•110
Hamming distance
• Definition: the Hamming distance between
two strings of equal length is the number of
positions for which the corresponding
symbols are different.
• Ex:
– 10111 and 10010
– 2143 and 2233
– Toned and roses
•111
How to choose a good error-
correcting code?
• Choose the one with large minimum
Hamming distance between any pair of
code words.
•112
Two properties of a good ECOC
•113
Summary
• Different methods:
– Direct multiclass, if possible
– One-vs-all (a.k.a. one-per-class): k-classifiers
– All-pairs: Ck2 classifiers
– ECOC: n classifiers (n is the num of columns)
•117