04 Textcat

Text Categorization
using Naïve Bayes

Mausam
(based on slides of Dan Weld, Dan
Jurafsky, Prabhakar Raghavan, Hinrich
Schutze, Guillaume Obozinski, David D.
Lewis, Fei Xia)
1
Categorization
• Given:
– A description of an instance, xX, where X is
the instance language or instance space.
– A fixed set of categories:
C={c1, c2,…cn}
• Determine:
– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and
whose range is C.
2
County vs. Country?
3
Who wrote which Federalist papers?
• 1787-8: anonymous essays try to convince

New York to ratify U.S Constitution: Jay,
Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1963: solved by Mosteller and Wallace
using Bayesian methods
James Madison Alexander Hamilton

Male or female author?
• The main aim of this article is to propose an exercise in stylistic analysis
which can be employed in the teaching of English language. It details the
design and results of a workshop activity on narrative carried out with
undergraduates in a university department of English. The methods proposed
are intended to enable students to obtain insights into aspects of cohesion and
narrative structure: insights, it is suggested, which are not as readily obtainable
through moreFemale writers
traditional use
techniques of stylistic analysis.
more first person/second person pronouns
• My aim in this article is to show
more that
gender giventhird
laiden a relevance theoretic approach to
person pronouns
utterance interpretation,
(overall moreit ispersonalization)
possible to develop a better understanding of
what some of these so-called apposition markers indicate. It will be argued
that the decision to put something in other words is essentially a decision
about style, a point which is, perhaps, anticipated by Burton-Roberts when he
describes loose apposition as a rhetorical device. However, he does not justify
this suggestion by giving the criteria for classifying a mode of expression as a
rhetorical device.
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3,
pp. 321–346
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied
satire, and some great plot twists
• this is the greatest screwball comedy ever
filmed
• It was pathetic. The worst part about it was
the boxing scenes.
•6
What is the subject of this article?
MeSH Subject Category Hierarchy

MEDLINE Article • Antogonists and
Inhibitors
• Blood Supply
? • Chemistry
• Drug Therapy
• Embryology
• Epidemiology
•7
• …
Text Classification
• Assigning documents to a fixed set of categories, e.g.
• Web pages
– Yahoo-like classification
– Assigning subject categories, topics, or genres
• Email messages
– Spam filtering
– Prioritizing
– Folderizing
• Blogs/Letters/Books
– Authorship identification
– Age/gender identification
• Reviews/Social media
– Language Identification
– Sentiment analysis
– …
Classification Methods:
Hand-coded rules
• Rules based on combinations of words or
other features
– spam: black-list-address OR (“dollars” AND
“have been selected”)
• Accuracy can be high
– If rules carefully refined by expert
• But building and maintaining these rules is
expensive
Learning for Text Categorization
• Hard to construct text categorization functions.

• Learning Algorithms:
– Bayesian (naïve)
– Neural network
– Logistic Regression
– Nearest Neighbor (case based)
– Support Vector Machines (SVM)
10
Example: County vs. Country?
• Given:
– A description of an instance, xX,
where X is the instance language or
instance space.
– A fixed set of categories:
C={c1, c2,…cn}
• Determine:
– The category of x: c(x)C, where c(x)
is a categorization function whose
domain is X and whose range is C.
11
Learning for Categorization
• A training example is an instance xX, paired
with its correct category c(x): <x, c(x)> for
an unknown categorization function, c.
• Given a set of training examples, D.
{< , county>, < , country>,…

• Find a hypothesized categorization function,
h(x), such that:   x, c( x)   D : h( x)  c( x)
Consistency
12
Function Approximation
May not be any perfect fit
Classification ~ discrete functions
h(x) = nigeria(x)  wire-transfer(x)
h(x)
c(x)
13
x
General Learning Issues
• Many hypotheses consistent with the training data.
• Bias
– Any criteria other than consistency with the training data
that is used to select a hypothesis.
• Classification accuracy
– % of instances classified correctly
– (Measured on independent test data.)
• Training time
– Efficiency of training algorithm
• Testing time
– Efficiency of subsequent classification
14
Generalization
• Hypotheses must generalize to correctly classify

instances not in the training data.
• Simply memorizing training examples is a

consistent hypothesis that does not generalize.
15
Why is Learning Possible?
Experience alone never justifies any

conclusion about any unseen instance.
Learning occurs when

PREJUDICE meets DATA!
16
Bias
• The nice word for prejudice is “bias”.
• What kind of hypotheses will you consider?

– What is allowable range of functions you use when
approximating?
• What kind of hypotheses do you prefer?
17
Bayesian Methods
• Learning and classification methods based
on probability theory.
– Bayes theorem plays a critical role in
probabilistic learning and classification.
– Uses prior probability of each category given
no information about an item.
• Categorization produces a posterior
probability distribution over the possible
categories given a description of an item.
18
Bayes’ Rule Applied to Documents and
Classes
• For a document d and a class c
P(d | c)P(c)
P(c | d) =
P(d)
Naïve Bayes Classifier (I)
cMAP = argmax P(c | d) MAP is “maximum a

posteriori” = most likely
cÎC class
P(d | c)P(c)
= argmax Bayes Rule
cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
Naïve Bayes Classifier (II)
cMAP = argmax P(d | c)P(c)

cÎC
Document d
 argmax P( x1 , x2 ,, xn | c) P(c) represented as
features x1..xn
cC
Naïve Bayes Classifier (IV)
cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

cC
O(|X|n•|C|) parameters How often does this class

occur?
Could only be estimated if a very,

very large number of training We can just count the
relative frequencies in a
examples was available. corpus
Multinomial Naïve Bayes Independence
Assumptions
P( x1 , x2 ,, xn | c)
• Bag of Words assumption: Assume position doesn’t
matter
The bag of words representation
I love this movie! It's sweet,

but with satirical humor. The
dialogue is great and the
γ(
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
I love this movie! It's sweet,

but with satirical humor. The
dialogue is great and the
γ(
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
)=c
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
The bag of words representation:
using a subset of words
x love xxxxxxxxxxxxxxxx sweet

xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
γ(
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xx several xxxxxxxxxxxxxxxxx
)=c
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxx
great 2
γ(
love 2
recommend
laugh
1
1
)=c
happy 1
... ...
Bag of words for document
classification
Test
?
document
Machine Garbage
parser Learning NLP Collection Planning GUI
language
label learning parser garbage planning ...
translation training tag collection temporal
… algorithm training memory reasoning
shrinkage translation optimization plan
network... language... region... language...
Multinomial Naïve Bayes Classifier
cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

cC
cNB = argmax P(c j )Õ P(x | c)

cÎC xÎX
Multinomial Naïve Bayes Independence
Assumptions
P( x1 , x2 ,, xn | c)
• Bag of Words assumption: Assume position doesn’t
matter
• Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.
P( x1 ,, xn | c)  P( x1 | c)  P( x2 | c)  P( x3 | c)  ...  P( xn | c)
Applying Multinomial Naive Bayes
Classifiers to Text Classification
positions  all word positions in test document
cNB = argmax P(c j )

c j ÎC
Õ P(xi | c j )
iÎ positions
Learning the Multinomial Naïve Bayes Model
• First attempt: maximum likelihood estimates

– simply use the frequencies in the data
doccount(C = c j )
P̂(c j ) =
N doc
count(wi , c j )
P̂(wi | c j ) =
å count(w, c j )
wÎV
Parameter estimation
count(wi , c j ) fraction of times word wi appears

P̂(wi | c j ) =
å count(w, c j ) among all words in documents of topic cj
wÎV
• Create mega-document for topic j by

concatenating all docs in this topic
– Use frequency of w in mega-document
Problem with Maximum LikelihoodSec.13.3
• What if we have seen no training documents
with the word fantastic and classified in the
topic positive (thumbs-up)?
count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV
• Zero probabilities cannot be conditioned

away, no matter the other evidence!
cMAP = argmax c P̂(c)Õ P̂(xi | c)
i
Laplace (add-1) smoothing for Naïve Bayes
count(wi , c) +1
P̂(wi | c) =
å (count(w, c))+1)
wÎV
count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
Multinomial Naïve Bayes: Learning
• From training corpus, extract Vocabulary

• Calculate P(cj) terms • Calculate P(wk | cj) terms
– For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
| docs j |
P(c j ) ¬ nk + a
| total # documents| P(wk | c j ) ¬
n + a | Vocabulary |
Naïve Bayes Time Complexity
• Training Time: O(|D|Ld + |C||V|))
where Ld is the average length of a document in D.
– Assumes V and all Di , ni, and nij pre-computed in
O(|D|Ld) time during one pass through all of the data.
– Generally just O(|D|Ld) since usually |C||V| < |D|Ld
• Test Time: O(|C| Lt)
where Lt is the average length of a test document.
• Very efficient overall, linearly proportional to the

time needed to just read in all the data.
47
Easy to Implement
• But…
• If you do… it probably won’t work…
48
Probabilities: Important Detail!
 We are multiplying lots of small numbers

Danger of underflow!
 0.557 = 7 E -18
 Solution? Use logs and add!

 p1 * p2 = e log(p1)+log(p2)
 Always keep in log form
49
Generative Model for Multinomial Naïve Bayes
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds
•50
Naïve Bayes and Language Modeling
• Naïve bayes classifiers can use any sort of feature
– URL, email address, dictionaries, network features
• But if, as in the previous slides
– We use only word features
– we use all of the words in the text (not a subset)
• Then
– Naïve bayes has an important similarity to language
modeling.
•51
Each class = a unigram language model
Sec.13.2.1
• Assigning each word: P(word | c)

• Assigning each sentence: P(s|c)=Π P(word|c)
Class pos
0.1 I
I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01 this
0.05 fun
0.1 film P(s | pos) = 0.0000005
…
Naïve Bayes as a Language Model Sec.13.2.1
• Which class assigns the higher probability

to s?
Model pos Model neg
0.1 I 0.2 I I love this fun film
0.1 love 0.001 love
0.1 0.1 0.01 0.05 0.1
0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1
0.05 fun 0.005 fun

0.1 film 0.1 film P(s|pos) > P(s|neg)
Example: Naïve Bayes in Spam Filtering
• SpamAssassin Features:
– Mentions Generic Viagra
– Online Pharmacy
– Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
– Phrase: impress ... girl
– From: starts with many numbers
– Subject is all capitals
– HTML has a low ratio of text to image area
– One hundred percent guaranteed
– Claims you can be removed from the list
– 'Prestigious Non-Accredited Universities'
– http://spamassassin.apache.org/tests_3_3_x.html
Advantages
• Simple to implement
– No numerical optimization, matrix algebra, etc
• Efficient to train and use
– Easy to update with new data
– Fast to apply
• Binary/multi-class
• Good in domains with many equally important features
– Decision Trees suffer from fragmentation in such cases –
especially if little data
• Comparatively good effectiveness with small training sets
• A good dependable baseline for text classification
– But we will see other classifiers that give better accuracy
55
Disadvantages
• Independence assumption wrong

– Absurd estimates of class probabilities
• Output probabilities close to 0 or 1
– Thresholds must be tuned; not set analytically
• Generative model
– Generally lower effectiveness than
discriminative techniques
56
Experimental Evaluation
Question: How do we estimate the

performance of classifier on unseen data?
• Can’t just at accuracy on training data – this
will yield an over optimistic estimate of
performance
• Solution: Cross-validation
• Note: this is sometimes called estimating
how well the classifier will generalize
57
Evaluation: Cross Validation
• Partition examples into k disjoint sets
• Now create k training sets
– Each set is union of all equiv classes except one
– So each set has (k-1)/k of the original training data
 Train 
Test
Test
…
Test
58
Cross-Validation (2)
• Leave-one-out
– Use if < 100 examples (rough estimate)
– Hold out one example, train on remaining examples
• 10-fold
– If have 100-1000’s of examples
• M of N fold
– Repeat M times
– Divide data into N folds, do N fold cross-validation
59
Evaluation Metrics
• Accuracy: no. of questions correctly answered
• Precision (for one label): accuracy when classification = label
• Recall (for one label): measures how many instances of a label were
missed.
• F-measure (for one label): harmonic mean of precision & recall.
• Area under Precision-recall curve (for one label): vary parameter to

show different points on p-r curve; take the area
60
Precision & Recall
Multi-class situation:
Two class situation
Predicted
“P” “N”
Actual
P TP FN FP
TP
N FP TN
FP
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F-measure = 2pr/(p+r)
61
A typical precision-recall curve
62
Construct Better Features
• Key to machine learning is having good

features
• In industrial data mining, large effort

devoted to constructing appropriate features
• Ideas??
63
Issues in document representation
Cooper’s concordance of Wordsworth was published in

1911. The applications of full-text retrieval are legion:
they include résumé scanning, litigation support and
searching published journals on-line.
• Cooper’s vs. Cooper vs. Coopers.

• Full-text vs. full text vs. {full, text} vs. fulltext.
• résumé vs. resume.
slide from Raghavan, Schütze,

Punctuation
• Ne’er: use language-specific, handcrafted

“locale” to normalize.
• State-of-the-art: break up hyphenated
sequence.
• U.S.A. vs. USA
• a.out

Numbers
• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t represent as text
– Creation dates for docs

Larson
Possible Feature Ideas
• Look at capitalization (may indicated a

proper noun)
• Look for commonly occurring sequences

• E.g. New York, New York City
• Limit to 2-3 consecutive words
• Keep all that meet minimum threshold (e.g.
occur at least 5 or 10 times in corpus)
67
Case folding
• Reduce all letters to lower case
• Exception: upper case in mid-sentence
– e.g., General Motors
– Fed vs. fed
– SAIL vs. sail

Thesauri and Soundex
• Handle synonyms and homographs and polysemes

– Hand-constructed equivalence classes
• e.g., car = automobile
• bark, bank, minute…
• Homophones

Spell Correction
• Look for all words within (say) edit distance

3 (Insert/Delete/Replace) at query time
– e.g., Alanis Morisette
• Spell correction is expensive and slows the
processing significantly
– Invoke only when index returns zero matches?

Lemmatization
• Reduce inflectional/variant forms to base form
– am, are, is  be
– car, cars, car's, cars'  car
the boy's cars are different colors


the boy car be different color
Stemming
• Are there different index terms?

– retrieve, retrieving, retrieval, retrieved, retrieves…
• Stemming algorithm:
– (retrieve, retrieving, retrieval, retrieved, retrieves) 
retriev
– Strips prefixes of suffixes (-s, -ed, -ly, -ness)
– Morphological stemming
• Problems: sand / sander & wand / wander
Copyright © Weld 2002-2007 72 72

Stemming Continued
• Can reduce vocabulary by ~ 1/3

• C, Java, Perl versions, python, c#
www.tartarus.org/~martin/PorterStemmer
• Criterion for removing a suffix
– Does "a document is about w1" mean the same as
– a "a document about w2"
• Problems: sand / sander & wand / wander
• Commercial SEs use giant in-memory tables
Copyright © Weld 2002-2007 73 73

Features Sec. 15.3.2
• Domain-specific features and weights: very

important in real performance
• Upweighting: Counting a word as if it occurred

twice:
– title words (Cohen & Singer 1996)
– first sentence of each paragraph (Murata, 1999)
– In sentences that contain title words (Ko et al, 2002)
•74
Properties of Text
• Word frequencies - skewed distribution
• `The’ and `of’ account for 10% of all words
• Six most common words account for 40%
75
From [Croft, Metzler & Strohman 2010]
Associate Press Corpus `AP89’
76
From [Croft, Metzler & Strohman 2010]
Middle Ground
• Very common words  bad features

• Language-based stop list:
words that bear little meaning
20-500 words
http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
• Subject-dependent stop lists
• Very rare words also bad features

Drop words appearing less than k times / corpus
77
Word Frequency
• Which word is more indicative of document similarity?

– ‘book,’ or ‘Rumplestiltskin’?
– Need to consider “document frequency”--- how frequently the
word appears in doc collection.
• Which doc is a better match for the query “Kangaroo”?

– One with a single mention of Kangaroos… or a doc that
mentions it 10 times?
– Need to consider “term frequency”--- how many times the
word appears in the current document.
78
TF x IDF
wik  tf ik * log( N / nk )
Tk  term k in document Di
tfik  frequency of term Tk in document Di
idfk  inverse document frequency of term Tk in C
idfk  log N 
 nk 
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk
79
Inverse Document Frequency
• IDF provides high values for rare words and

low values for common words
 10000 
log  0
 10000 
 10000 
log    0.301
 5000 
 10000 
log    2.698
 20 
 10000 
log  4
 1 
• Add 1 to avoid 0. 80
TF-IDF normalization
• Normalize the term weights

– so longer docs not given more weight (fairness)
– force all values to fall within a certain range: [0, 1]
tf ik (1  log( N / nk ))
wik 
k 1 ik
t
(tf ) 2
[1  log( N / nk )]2
81
The Real World Sec. 15.3.1
• I’m building a text classifier for real, now!

• What should I do?
•82
No training data?
Sec. 15.3.1
Manually written rules
If (wheat or grain) and not (whole or bread) then
Categorize as grain
• Need careful crafting

– Human tuning on development data
– Time-consuming: 2 days per class
•83
Very little data? Sec. 15.3.1
• Use Naïve Bayes

– Naïve Bayes is a “high-bias” algorithm (Ng and Jordan 2002
NIPS)
• Get more labeled data
– Find clever ways to get humans to label data for you
• Try semi-supervised training methods:
– Bootstrapping, EM over unlabeled documents, … (later)
•84
A reasonable amount of data? Sec. 15.3.1
• Perfect for all the clever classifiers (later)
– SVM
– Regularized Logistic Regression
• You can even use user-interpretable decision
trees
– Users like to hack
– Management likes quick fixes
•85
A huge amount of data? Sec. 15.3.1
• Can achieve high accuracy!

• At a cost:
– SVMs (train time) or kNN (test time) can be too slow
– Regularized logistic regression can be somewhat better
• Ng and Jordan 2002

– Generative (NB) approaches axymptotic error faster
– Discriminative (LR) has lower asymptotic error
•86
Real-world systems generally combine
• Automatic classification
• Manual review of uncertain/difficult/"new” cases
88
Multi-class Problems
•89
Evaluation:
Classic Reuters-21578 Data Set Sec. 15.2.4
• Most (over)used data set, 21,578 docs (each 90 types, 200 toknens)
• 9603 training, 3299 test articles (ModApte/Lewis split)
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
• Average document (with at least one category) has 1.24 classes
• Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories • Ship (197, 89)
• Money-fx (538, 179)
(#train, #test) • Grain (433, 149) • Wheat (212, 71)
• Crude (389, 189) • Corn (182, 56)
•90
Reuters Text Categorization data set
(Reuters-21578) document Sec. 15.2.4
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981"

NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow,
March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions
on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future
direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to
endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry,
the NPPC added. Reuter
</BODY></TEXT></REUTERS>
•91
Precision & Recall
Multi-class situation:
Two class situation
Predicted
“P” “N”
Actual
P TP FN FP
TP
N FP TN
FP
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F-measure = 2pr/(p+r)
92
Micro-‐ vs. Macro-‐Averaging
• If we have more than one class, how do we combine
multiple performance measures into one quantity?
• Macroaveraging
– Compute performance for each class, then average.
• Microaveraging
– Collect decisions for all classes, compute contingency table,
evaluate
93
Precision & Recall
Multi-class situation: Missed predictions
Aggregate
Average Macro Precision = Σpi/N
Average Macro Recall = Σri/N
Average Macro F-measure = 2pMrM/(pM+rM)
Average Micro Precision = ΣTPi/ ΣiColi

Average Micro Recall = ΣTPi/ ΣiRowi
Average Micro F-measure = 2pμrμ /(pμ+rμ)
Aren’t μ prec and μ recall the same?
Classifier hallucinations
Precision(class i) = TPi/(TPi+FPi) Precision(class 1) = 251/(Column1)

Recall(class i) = TPi/(TPi+FNi) Recall(class 1) = 251/(Row1)
F-measure(class i) = 2piri/(pi+ri) F-measure(class 1)) = 2piri/(pi+ri) 94
Highlights
• What?
– Converting a k-class problem to a binary problem.
• Why?
– For some ML algorithms, a direct extension to the
multiclass case may be problematic.
• How?
– Many methods
•97
Methods
• One-vs-all
• All-pairs
• Error-correcting Output Codes (ECOC)
• …
•98
One-vs-all
• Idea:
• Create many 1 vs other classifiers
– Classes = City, County, Country
– Classifier 1 = {City} {County, Country}
– Classifier 2 = {County} {City, Country}
– Classifier 3 = {Country} {City, County}
• Training time:
– For each class cm, train a classifier clm(x)
• replace (x,y) with
(x, 1) if y = cm
•99
(x, -1) if y != cm
An example: training
• x1 c1 …
for c2-vs-all:
• x2 c2 …
x1 -1
• x3 c1 …
x2 1 …
• x4 c3 …
x3 -1 …
x4 -1 …
for c1-vs-all:
x1 1 …
for c3-vs-all:
x2 -1 …
x1 -1…
x3 1 …
x2 -1…
x4 -1 …
x3 -1 …
x4 1…
•100
One-vs-all (cont)
• Testing time: given a new example x

– Run each of the k classifiers on x
– Choose the class cm with the highest

confidence score clm(x):
c* = arg maxm clm(x)
•101
An example: testing
• x1 c1 …
for c1-vs-all:
• x2 c2 …
x ?? 1 0.7 -1 0.3
• x3 c1 …
• x4 c3 …
for c2-vs-all
x ?? 1 0.2 -1 0.8
 three classifiers
for c3-vs-all
x ?? 1 0.6 -1 0.4
Test data:
x ?? f1 v1 …
=> what’s the system prediction for x?
•102
All-pairs (All-vs-All (AVA))
• Idea:
– For each pair of classes build a classifier
– {City vs. County}, {City vs Country}, {County vs.
Country}
– Ck2 classifiers: one classifier for each class pair.
• Training:
– For each pair (cm, cn) of classes, train a classifier clmn
• replace a training instance (x,y) with
(x, 1) if y = cm
(x, -1) if y = cn
otherwise ignore the instance
•103
An example: training
• x1 c1 …
for c2-vs-c3:
• x2 c2 …
x2 1 …
• x3 c1 …
x4 -1 …
• x4 c3 …
for c1-vs-c3:
for c1-vs-c2:
x1 1…
x1 1 …
x3 1 …
x2 -1 …
x4 -1 …
x3 1 …
•104
All-pairs (cont)
• Testing time: given a new example x
– Run each of the Ck2 classifiers on x
– Max-win strategy: Choose the class cm that

wins the most pairwise comparisons:
– Other coupling models have been proposed:

e.g., (Hastie and Tibshirani, 1998)
•105
An example: testing
• x1 c1 …
for c1-vs-c2:
• x2 c2 …
x ?? 1 0.7 -1 0.3
• x3 c1 …
• x4 c3 …
for c2-vs-c3
x ?? 1 0.2 -1 0.8
 three classifiers
for c1-vs-c3
x ?? 1 0.6 -1 0.4
Test data:
x ?? f1 v1 …
=> what’s the system prediction for x?
•106
Error-correcting output codes
(ECOC)
• Proposed by (Dietterich and Bakiri, 1995)
• Idea:
– Each class is assigned a unique binary string of length n.
– Train n classifiers, one for each bit.
– Testing time: run n classifiers on x to get a n-bit string s,
and choose the class which is closest to s.
•107
An example: Digit Recognition
•108
Meaning of each column
•109
Another example: 15-bit code for a
10-class problem
•110
Hamming distance
• Definition: the Hamming distance between
two strings of equal length is the number of
positions for which the corresponding
symbols are different.
• Ex:
– 10111 and 10010
– 2143 and 2233
– Toned and roses
•111
How to choose a good error-
correcting code?
• Choose the one with large minimum
Hamming distance between any pair of
code words.
• If the min Hamming distance is d, then the

code can correct at least (d-1)/2 single bit
errors.
•112
Two properties of a good ECOC
• Row separations: Each codeword should be

well-separated in Hamming distance from
each of the other codewords
• Column separation: Each bit-position

function fi should be uncorrelated with each
of the other fj.
•113
Summary
• Different methods:
– Direct multiclass, if possible
– One-vs-all (a.k.a. one-per-class): k-classifiers
– All-pairs: Ck2 classifiers
– ECOC: n classifiers (n is the num of columns)
• Some studies report that All-pairs and

ECOC work better than one-vs-all.
•117

04 Textcat

Uploaded by

Copyright:

Available Formats

04 Textcat

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 Textcat

Uploaded by

Copyright:

Available Formats

Text Categorization

using Naïve Bayes

• 1787-8: anonymous essays try to convince

James Madison Alexander Hamilton

MeSH Subject Category Hierarchy

• Hard to construct text categorization functions.

{< , county>, < , country>,…

• Hypotheses must generalize to correctly classify

• Simply memorizing training examples is a

Experience alone never justifies any

Learning occurs when

• The nice word for prejudice is “bias”.

• What kind of hypotheses will you consider?

• For a document d and a class c

cMAP = argmax P(c | d) MAP is “maximum a

cMAP = argmax P(d | c)P(c)

cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

O(|X|n•|C|) parameters How often does this class

Could only be estimated if a very,

I love this movie! It's sweet,

I love this movie! It's sweet,

x love xxxxxxxxxxxxxxxx sweet

cMAP  argmax P( x1 , x2 ,, xn | c) P(c)

cNB = argmax P(c j )Õ P(x | c)

positions  all word positions in test document

cNB = argmax P(c j )

• First attempt: maximum likelihood estimates

count(wi , c j ) fraction of times word wi appears

• Create mega-document for topic j by

• Zero probabilities cannot be conditioned

• From training corpus, extract Vocabulary

• Very efficient overall, linearly proportional to the

• If you do… it probably won’t work…

 We are multiplying lots of small numbers

 Solution? Use logs and add!

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

• Assigning each word: P(word | c)

• Which class assigns the higher probability

0.05 fun 0.005 fun

• Independence assumption wrong

Question: How do we estimate the

• Precision (for one label): accuracy when classification = label

• F-measure (for one label): harmonic mean of precision & recall.

• Area under Precision-recall curve (for one label): vary parameter to

• Key to machine learning is having good

• In industrial data mining, large effort

Cooper’s concordance of Wordsworth was published in

• Cooper’s vs. Cooper vs. Coopers.

slide from Raghavan, Schütze,

• Ne’er: use language-specific, handcrafted

slide from Raghavan, Schütze,

slide from Raghavan, Schütze,

• Look at capitalization (may indicated a

• Look for commonly occurring sequences

slide from Raghavan, Schütze,

• Handle synonyms and homographs and polysemes

slide from Raghavan, Schütze,

• Look for all words within (say) edit distance

slide from Raghavan, Schütze,