Lecture 7

Lecture 7: Text Classification and Naive Bayes
Information Retrieval
Computer Science Tripos Part II
Simone Teufel
Natural Language and Information Processing (NLIP) Group
Simone.Teufel@cl.cam.ac.uk
Lent 2014
328
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
Today: evaluation
329
Document
Collection
Indexer
Query Norm.
IR System
Query UI
Indexes
Clustering
Set of relevant
documents
Still: clustering your output documents
330
Document
Collection
Clustering
Indexer
Query Norm.
IR System
Query UI
Indexes
Clustering
Set of relevant
documents
Today: classifying your input documents
331
Overview
332
Overview
Hierarchical clustering
Imagine we now want to create a hierarchy in the form of a

binary tree.
Assumes a similarity measure for determining the similarity of
two clusters.
Up to now, our similarity measures were for documents.
We will look at different cluster similarity measures.
Main algorithm: HAC (hierarchical agglomerative clustering)
333
HAC: Basic algorithm
Start with each document in a separate cluster

Then repeatedly merge the two clusters that are most similar
Until there is only one cluster.
The history of merging is a hierarchy in the form of a binary
tree.
The standard way of depicting this history is a dendrogram.
334
A dendrogram
1.0 0.8 0.6 0.4 0.2 0.0
NYSE closing averages

Hog prices tumble
Oil prices slip
Ag trade reform.
Chrysler / Latin America
Japanese prime minister / Mexico
Fed holds interest rates steady
Fed to keep interest rates steady
Fed keeps interest rates steady
Fed keeps interest rates steady
Mexican markets
British FTSE index
War hero Colin Powell
War hero Colin Powell
Lloyd’s CEO questioned
Lloyd’s chief / U.S. grilling
Ohio Blue Cross
Lawsuit against tobacco companies
suits against tobacco firms
Indiana tobacco lawsuit
Viag stays positive
Most active stocks
CompuServe reports loss
Sprint / Internet access service
Planet Hollywood
Trocadero: tripling of revenues
Back−to−school spending is up
German unions split
Chains may raise prices
Clinton signs law
335
Term–document matrix to document–document matrix
Log frequency weighting
and cosine normalisation
SaS P(SaS,SaS) P(PaP,SaS)
SaS PaP WH
PaP P(SaS,PaP) P(PaP,PaP)
0.789 0.832 0.524 WH P(SaS,WH) P(PaP,WH)
0.515 0.555 0.465
SaS PaP
0.335 0.000 0.405
0.000 0.000 0.588
SaS 1 .94 .79

PaP .94 1 .69
WH .79 .69 1
SaS PaP WH
Applying the proximity metric to all pairs of documents. . .

creates the document-document matrix, which reports
similarities/distances between objects (documents)
The diagonal is trivial (identity)
As proximity measures are symmetric, the matrix is a triangle
336
Hierarchical clustering: agglomerative (BottomUp, greedy)
Given: a set X = x1 , ...xn of objects;

Given: a function sim : P(X) × P(X) → R
for i:= 1 to n do
ci := xi
C :=c1 , ... cn
j := n+1
while C > 1 do
(cn1 , cn2 ) := max(cu ,cv )∈C ×C sim(cu , cv )
cj := cn1 ∪ cn2
C := C { cn1 , cn2 } ∪ cj
j:=j+1
end
Similarity function sim : P(X) × P(X) → R measures similarity

between clusters, not objects
337
Computational complexity of the basic algorithm
First, we compute the similarity of all N × N pairs of

documents.
Then, in each of N iterations:
We scan the O(N × N) similarities to find the maximum
similarity.
We merge the two clusters with maximum similarity.
We compute the similarity of the new cluster with all other
(surviving) clusters.
There are O(N) iterations, each performing a O(N × N)
“scan” operation.
Overall complexity is O(N 3 ).
Depending on the similarity function, a more efficient
algorithm is possible.
338
Hierarchical clustering: similarity functions
Similarity between two clusters ck and cj (with similarity

measure s) can be interpreted in different ways:
Single Link Function: Similarity of two most similar members
sim(cu , cv ) = maxx∈cu ,y ∈ck s(x, y )
Complete Link Function: Similarity of two least similar
members
sim(cu , cv ) = minx∈cu ,y ∈ck s(x, y )
Group Average Function: Avg. similarity of each pair of group
members
sim(cu , cv ) = avgx∈cu ,y ∈ck s(x, y )
339
Example: hierarchical clustering; similarity functions
Cluster 8 objects a-h; Euclidean distances (2D) shown in diagram
a b c d
1
1.5
e f g h
b 1
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √5 2
√ 6.25 √10.25 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5 1
a b c d e f g
340
Single Link is O(n2)
b 1
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √5 2
√ 6.25 √10.25 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5 1
a b c d e f g
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √5 2
√ 6.25 √10.25 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5 1
a+b c d e f g
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √ 5 2
√ 6.25 √10.25 1
g 10.25 6.25 2 5 2.5 1.5 341
Single Link
After Step 4 (a–b, c–d, e–f, g–h merged):

c–d 1.5 √
e–f 2
√ 6.25
g–h 6.25 2 1.5
a–b c–d e–f
“min-min” at each step
342
Clustering Result under Single Link
a b c d
e f g h
a b c d e f g h
343
Complete Link
b 1
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f 5 2 6.25 10.25
√ √ √ 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5
1
a b c d e f g
After step 4 (a–b, c–d, e–f, g–h merged):

c–d 2.5 1.5
3.5 2.5
√ √ √
e–f 2 5 10.25 16.25
√ √ √
5 2 6.25 10.25
√ √ √
g–h 10.25 6.25 2 5 2.5 1.5
√ √ √
16.25 10.25 5 2 3.5 2.5
a–b c–d e–f
“max-min” at each step → ab/ef and cd/gh merges next
344
Clustering result under complete link
a b c d
e f g h
a b c d e f g h
Complete Link is O(n3 )
345
Example: gene expression data
An example from biology: cluster genes by function

Survey 112 rat genes which are suspected to participate in
development of CNS
Take 9 data points: 5 embryonic (E11, E13, E15, E18, E21), 3
postnatal (P0, P7, P14) and one adult
Measure expression of gene (how much mRNA in cell?)
These measures are normalised logs; for our purposes, we can
consider them as weights
Cluster analysis determines which genes operate at the same
time
346
Rat CNS gene expression data (excerpt)
gene genbank locus E11 E13 E15 E18 E21 P0 P7 P14 A

keratin RNKER19 1.703 0.349 0.523 0.408 0.683 0.461 0.32 0.081 0
cellubrevin s63830 5.759 4.41 1.195 2.134 2.306 2.539 3.892 3.953 2.72
nestin RATNESTIN 2.537 3.279 5.202 2.807 1.5 1.12 0.532 0.514 0.443
MAP2 RATMAP2 0.04 0.514 1.553 1.654 1.66 1.491 1.436 1.585 1.894
GAP43 RATGAP43 0.874 1.494 1.677 1.937 2.322 2.296 1.86 1.873 2.396
L1 S55536 0.062 0.162 0.51 0.929 0.966 0.867 0.493 0.401 0.384
NFL RATNFL 0.485 5.598 6.717 9.843 9.78 13.466 14.921 7.862 4.484
NFM RATNFM 0.571 3.373 5.155 4.092 4.542 7.03 6.682 13.591 27.692
NFH RATNFHPEP 0.166 0.141 0.545 1.141 1.553 1.667 1.929 4.058 3.859
synaptophysin RNSYN 0.205 0.636 1.571 1.476 1.948 2.005 2.381 2.191 1.757
neno RATENONS 0.27 0.704 1.419 1.469 1.861 1.556 1.639 1.586 1.512
S100 beta RATS100B 0.052 0.011 0.491 1.303 1.487 1.357 1.438 2.275 2.169
GFAP RNU03700 0 0 0 0.292 2.705 3.731 8.705 7.453 6.547
MOG RATMOG 0 0 0 0 0.012 0.385 1.462 2.08 1.816
GAD65 RATGAD65 0.353 1.117 2.539 3.808 3.212 2.792 2.671 2.327 2.351
pre-GAD67 RATGAD67 0.073 0.18 1.171 1.436 1.443 1.383 1.164 1.003 0.985
GAD67 RATGAD67 0.297 0.307 1.066 2.796 3.572 3.182 2.604 2.307 2.079
G67I80/86 RATGAD67 0.767 1.38 2.35 1.88 1.332 1.002 0.668 0.567 0.304
G67I86 RATGAD67 0.071 0.204 0.641 0.764 0.406 0.202 0.052 0.022 0
GAT1 RATGABAT 0.839 1.071 5.687 3.864 4.786 4.701 4.879 4.601 4.679
ChAT (*) 0 0.022 0.369 0.322 0.663 0.597 0.795 1.015 1.424
ACHE S50879 0.174 0.425 1.63 2.724 3.279 3.519 4.21 3.885 3.95
ODC RATODC 1.843 2.003 1.803 1.618 1.569 1.565 1.394 1.314 1.11
TH RATTOHA 0.633 1.225 1.007 0.801 0.654 0.691 0.23 0.287 0
NOS RRBNOS 0.051 0.141 0.675 0.63 0.86 0.926 0.792 0.646 0.448
GRa1 (#) 0.454 0.626 0.802 0.972 1.021 1.182 1.297 1.469 1.511
...
347
Rat CNS gene clustering – single link
0 20 40 60 80 100
NFM
GRg1
NFL
GFAP
NMDA1
cellubrevin
nestin
GAT1
EGFR
NFH
nAChRa7
mGluR5
actin
ACHE
mAChR2
GRg2
MOG
GRb1
Clustering of Rat Expression Data (Single Link/Euclidean)

NT3
GAD67
GAD6
SC25
trkB
CCO2
SC1
S100 beta
GRg3
mGluR7
GRa2
GRa5
keratin
GRb2
synaptophysin
EGF
G67I80/86
CRAF
CNTF
CCO1
GRa4
SOD
GRb3
GAP43
MAP2
pre-GAD67
GRa3
neno
statin
PTN
SC7
IGFR1
cfos
Ins2
NMDA2C
G67I86
cyclin A
InsR
trk
IP3R1
L1
NMDA2D
5HT2
nAChRa4
nAChRa5
NOS
TGFR
PDGFR
nAChRa3
H2AZ
FGFR
mAChR3
5HT1b
BDNF
CNTFR
IGF I
IP3R3
nAChRe
mGluR6
TCP
GDNF
SC6
mGluR8
mGluR2
PDGFb
nAChRa6
nAChRd
Ins1
5HT3
mGluR4
nAChRa2
mAChR4
cjun
PDG
NGFFa
DD63.2
mGluR1
NMDA2A
IGFR2
TH
Brm
ODC
IGF II
trkC
mGluR3
NMDA2
B 5HT1c
GRa1
bFGF
IP3R2
ChAT
aFGF
MK2
cyclin B
348
Rat CNS gene clustering – complete link
0 20 40 60 80 100
NFM
CCO1
Ins2
FGFR
GDNF
SC6
keratin
CCO2
IGFR2
cyclin A
TH
Brm
Clustering of Rat Expression Data (Complete Link/Euclidean)

ODC
IGF II
IGFR1
PTN
PDGFR
trk
TGFR
G67I80/86
CRAF
G67I86
SC7
NT3
SC1
MK2
cyclin B
trkB
GRg2
S100 beta
GRa2
GRa5
synaptophysin
GRg3
mGluR7
actin
IP3R1
mAChR4
PDG
Fa CNTF
SOD
5HT1c
GRa1
bFGF
IP3R2
ChAT
aFGF
NGF
cjun
DD63.2
mGluR1
NMDA2A
EGF
cfos
NOS
nAChRa3
NMDA2C
L1
5HT2
InsR
nAChRa4
nAChRa5
H2AZ
CNTFR
mGluR8
nAChRe
PDGFb
nAChRa6
nAChRd
Ins1
BDNF
IGF I
IP3R3
NMDA2D
TCP
mGluR4
nAChRa2
mAChR3
5HT1b
mGluR6
mGluR2
5HT3
GRa4
GRb2
GAP43
neno
statin
MAP2
pre-GAD67
GRa3
GRb3
trkC
mGluR3
NMDA2
B mGluR5
GAD6
GAD67
5
nAChRa7
EGFR
SC2
GAT1
NFH
MOG
ACHE
GRb1
mAChR2
cellubrevin
nestin
NFL
NMDA1
GFAP
GRg1
349
Rat CNS gene clustering – group average link
0 20 40 60 80 100
NFM
NFL
NMDA1
nestin
GAT1
cellubrevin
NFH
MOG
actin
SC2
NT3
SC1
MK2
cyclin B
keratin
CCO2
IP3R2
ChAT
aFGF
mGluR3
NMDA2
B 5HT1c
GRa1
bFGF
Clustering of Rat Expression Data (Av Link/Euclidean)

EGF
cfos
cjun
NGF
DD63.2
mGluR1
NMDA2A
CNTF
SOD
CCO1
IGFR2
cyclin A
TH
Brm
ODC
IGF II
IGFR1
PTN
PDGFR
trk
TGFR
Ins2
NMDA2C
mAChR3
5HT1b
NOS
nAChRa3
L1
5HT2
G67I86
InsR
nAChRa4
nAChRa5
IP3R1
mAChR4
PDG
Fa NMDA2D
H2AZ
FGFR
GDNF
SC6
TCP
mGluR2
mGluR6
mGluR8
5HT3
mGluR4
nAChRa2
nAChRe
PDGFb
nAChRa6
nAChRd
Ins1
CNTFR
BDNF
IGF I
IP3R3
CRAF
G67I80/86
SC7
trkB
S100 beta
GRa2
GRa5
GRg3
mGluR7
GRa4
GRb2
synaptophysin
GRb3
trkC
GAP43
pre-GAD67
GRa3
MAP2
neno
statin
EGFR
nAChRa7
ACHE
GRb1
mAChR2
GRg2
mGluR5
GAD6
GAD67
5
GFAP
GRg1
350
Flat or hierarchical clustering?
When a hierarchical structure is desired: hierarchical algorithm

For high efficiency, use flat clustering
For deterministic results, use HAC
Humans are bad at interpreting hierarchical clusterings (unless
cleverly visualised)
HAC also can be applied if K cannot be predetermined (can
start without knowing K )
351
Overview
A text classification task: Email spam filtering
From: ‘‘’’ <takworlld@hotmail.com>

Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for similar courses
I am 22 years old and I have already purchased 6 properties using the

methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
352
Topic classification
γ(d ′ ) =China
regions industries subject areas
classes: UK China poultry coffee elections sports

d′
first
congestion Olympics feed roasting recount diamond
training test private
London Beijing chicken beans votes baseball Chinese
set: set: airline
Parliament tourism pate arabica seat forward

Big Ben Great Wall ducks robusta run-off soccer
Windsor Mao bird flu Kenya TV ads team

the Queen communist turkey harvest campaign captain
353
Statistical/probabilistic classification methods
Possible to do classification manually or by rules

But: problems with scaling, expense
Text classification can be seen as a learning problem
(i) Supervised learning of a the classification function γ and
(ii) application of γ to classifying new documents
We will look at one of the simplest methods for doing this:
Naive Bayes
No free lunch: requires hand-classified training data
But this manual classification can be done by non-experts.
354
Examples of how search engines use classification
Language identification (classes: English vs. French etc.)

Automatic detection of spam pages (spam vs. nonspam)
Sentiment detection: is a movie or product review positive or
negative (positive vs. negative)
Topic-specific or vertical search – restrict search to a
“vertical” like “related to health” (relevant to vertical vs. not)
355
Formal definition of TC: Training
Given:
A document space X
Documents are represented in this space – typically some type
of high-dimensional space.
A fixed set of classes C = {c1 , c2 , . . . , cJ }
The classes are human-defined for the needs of an application
(e.g., spam vs. nonspam).
A training set D of labeled documents. Each labeled
document hd, ci ∈ X × C
Using a learning method or learning algorithm, we then wish to

learn a classifier γ that maps documents to classes:
γ:X→C
356
Formal definition of TC: Application/Testing
Given: a description d ∈ X of a document
Determine: γ(d) ∈ C, that is, the class most appropriate for d
357
The Naive Bayes classifier
The Naive Bayes classifier is a probabilistic classifier.

We compute the probability of a document d being in a class
c as follows:
Y
P(c|d) ∝ P(c) P(tk |c)
1≤k≤nd
nd is the length of the document. (number of tokens)

P(tk |c) is the conditional probability of term tk occurring in a
document of class c
P(tk |c) as a measure of how much evidence tk contributes
that c is the correct class.
P(c) is the prior probability of c.
If a document’s terms do not provide clear evidence for one
class vs. another, we choose the c with highest P(c).
358
Maximum a posteriori class
Our goal in Naive Bayes classification is to find the “best”

class.
The best class is the most likely or maximum a posteriori
(MAP) class c map :
Y
c map = arg max P̂(c|d) = arg max P̂(c) P̂(tk |c)
c∈C c∈C 1≤k≤nd
359
Taking the log
Multiplying lots of small probabilities can result in floating

point underflow.
Since log(xy ) = log(x) + log(y ), we can sum log probabilities
instead of multiplying probabilities.
Since log is a monotonic function, the class with the highest
score does not change.
So what we usually compute in practice is:
X
c map = arg max [log P̂(c) + log P̂(tk |c)]
c∈C 1≤k≤nd
360
Naive Bayes classifier
Classification rule:
X
c map = arg max [ log P̂(c) + log P̂(tk |c)]
c∈C 1≤k≤nd
Simple interpretation:
Each conditional parameter log P̂(tk |c) is a weight that
indicates how good an indicator tk is for c.
The prior log P̂(c) is a weight that indicates the relative
frequency of c.
The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the
class.
We select the class with the most evidence.
361
Parameter estimation take 1: Maximum likelihood
Estimate parameters P̂(c) and P̂(tk |c) from train data: How?
Prior:
Nc
P̂(c) =
N
Nc : number of docs in class c; N: total number of docs
Conditional probabilities:
Tct
P̂(t|c) = P
t ′ ∈V Tct ′
Tct is the number of tokens of t in training documents from

class c (includes multiple occurrences)
We’ve made a Naive Bayes independence assumption here:
P̂(tk1 |c) = P̂(tk2 |c), independent of positions k1 , k2
362
The problem with maximum likelihood estimates: Zeros
C =China
X1 =Beijing X2 =and X3 =Taipei X4 =join X5 =WTO
P(China|d) ∝ P(China) · P(Beijing|China) · P(and|China)

· P(Taipei|China) · P(join|China) · P(WTO|China)
If WTO never occurs in class China in the train set:

TChina,WTO 0
P̂(WTO|China) = P = P =0
t ′ ∈V TChina,t ′ t ′ ∈V TChina,t ′
363
The problem with maximum likelihood estimates: Zeros
If there are no occurrences of WTO in documents in class

China, we get a zero estimate:
TChina,WTO
P̂(WTO|China) = P =0
t ′ ∈V TChina,t ′
→ We will get P(China|d) = 0 for any document that

contains WTO!
364
To avoid zeros: Add-one smoothing
Before:
Tct
P̂(t|c) = P
t ′ ∈V Tct ′
Now: Add one to each count to avoid zeros:
Tct + 1 Tct + 1
P̂(t|c) = P = P
t ′ ∈V (Tct ′ + 1) ( t ′ ∈V Tct ′ ) + B
B is the number of bins – in this case the number of different

words or the size of the vocabulary |V | = M
365
Example
docID words in document in c = China?

training set 1 Chinese Beijing Chinese yes
2 Chinese Chinese Shanghai yes
3 Chinese Macao yes
4 Tokyo Japan Chinese no
test set 5 Chinese Chinese Chinese Tokyo Japan ?
Estimate parameters of Naive Bayes classifier

Classify test document
|textc | = 8
|textc | = 3
B=6 (number of tokens)
366
Example: Parameter estimates
Priors: P̂(c) = 3/4 and P̂(c) = 1/4
Conditional probabilities:
P̂(Chinese|c) = (5 + 1)/(8 + 6) = 6/14 = 3/7

P̂(Tokyo|c) = P̂(Japan|c) = (0 + 1)/(8 + 6) = 1/14
P̂(Chinese|c) = (1 + 1)/(3 + 6) = 2/9
P̂(Tokyo|c) = P̂(Japan|c) = (1 + 1)/(3 + 6) = 2/9
The denominators are (8 + 6) and (3 + 6) because the lengths of

textc and textc are 8 and 3, respectively, and because the constant
B is 6 as the vocabulary consists of six terms.
367
Example: Classification
P̂(c|d5 ) ∝ 3/4 · (3/7)3 · 1/14 · 1/14 ≈ 0.0003

P̂(c|d5 ) ∝ 1/4 · (2/9)3 · 2/9 · 2/9 ≈ 0.0001
Thus, the classifier assigns the test document to c = China.

The reason for this classification decision is that the three
occurrences of the positive indicator Chinese in d5 outweigh the
occurrences of the two negative indicators Japan and Tokyo.
368
Time complexity of Naive Bayes
mode time complexity

training Θ(|D|Lave + |C||V |)
testing Θ(La + |C|M a ) = Θ(|C|M a )
Lave : average length of a training doc, La : length of the test

doc, M a : number of distinct terms in the test doc, D: training
set, V : vocabulary, C: set of classes
Θ(|D|Lave ) is the time it takes to compute all counts. Note
that |D|Lave is T , the size of our collection.
Θ(|C||V |) is the time it takes to compute the conditional
probabilities from the counts.
Generally: |C||V | < |D|Lave
Test time is also linear (in the length of the test document).
Thus: Naive Bayes is linear in the size of the training set
(training) and the test document (testing). This is optimal.
369
Naive Bayes is not so naive
Multinomial model violates two independence assumptions

and yet...
Naive Bayes has won some competitions (e.g., KDD-CUP 97;
prediction of most likely donors for a charity)
More robust to nonrelevant features than some more complex
learning methods
More robust to concept drift (changing of definition of class
over time) than some more complex learning methods
Better than methods like decision trees when we have many
equally important features
A good dependable baseline for text classification (but not the
best)
Optimal if independence assumptions hold (never true for
text, but true for some domains)
Very fast; low storage requirements
370
Not covered
Derivation of NB formula
Evaluation of text classification
371
Summary: clustering and classification
Clustering is unsupervised learning

Partitional clustering
Provides less information but is more efficient (best: O(kn))
K -means
Complexity O(knmi)
Guaranteed to converge, non-optimal, dependence on initial
seeds
Minimize avg square within-cluster difference
Hierarchical clustering
Best algorithms O(n2 ) complexity
Single-link vs. complete-link (vs. group-average)
Hierarchical and non-hierarchical clustering fulfills different
needs (e.g. visualisation vs. navigation)
Text classification is supervised learning
Naive Bayes: simple baseline text classifier
372
Reading
MRS chapters 13.1-13.4 for text classification

Have a try – Weka: A data mining software package that
includes an implementation of Naive Bayes
373

Lecture 7

Uploaded by

Copyright:

Available Formats

Lecture 7

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 7

Uploaded by

Copyright:

Available Formats

Lecture 7: Text Classification and Naive Bayes

Natural Language and Information Processing (NLIP) Group

Still: clustering your output documents

Today: classifying your input documents

Imagine we now want to create a hierarchy in the form of a

Start with each document in a separate cluster

1.0 0.8 0.6 0.4 0.2 0.0

NYSE closing averages

SaS 1 .94 .79

Applying the proximity metric to all pairs of documents. . .

Given: a set X = x1 , ...xn of objects;

Similarity function sim : P(X) × P(X) → R measures similarity

First, we compute the similarity of all N × N pairs of

Similarity between two clusters ck and cj (with similarity

Cluster 8 objects a-h; Euclidean distances (2D) shown in diagram

After Step 4 (a–b, c–d, e–f, g–h merged):

After step 4 (a–b, c–d, e–f, g–h merged):

Complete Link is O(n3 )

An example from biology: cluster genes by function

gene genbank locus E11 E13 E15 E18 E21 P0 P7 P14 A

Clustering of Rat Expression Data (Single Link/Euclidean)

Clustering of Rat Expression Data (Complete Link/Euclidean)

Clustering of Rat Expression Data (Av Link/Euclidean)

When a hierarchical structure is desired: hierarchical algorithm

From: ‘‘’’ <takworlld@hotmail.com>

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using the

Change your life NOW !

regions industries subject areas

classes: UK China poultry coffee elections sports

Parliament tourism pate arabica seat forward

Windsor Mao bird flu Kenya TV ads team

Possible to do classification manually or by rules

Language identification (classes: English vs. French etc.)

Using a learning method or learning algorithm, we then wish to

Given: a description d ∈ X of a document

Determine: γ(d) ∈ C, that is, the class most appropriate for d

The Naive Bayes classifier is a probabilistic classifier.

nd is the length of the document. (number of tokens)

Our goal in Naive Bayes classification is to find the “best”

Multiplying lots of small probabilities can result in floating

Tct is the number of tokens of t in training documents from

X1 =Beijing X2 =and X3 =Taipei X4 =join X5 =WTO

P(China|d) ∝ P(China) · P(Beijing|China) · P(and|China)

If WTO never occurs in class China in the train set:

If there are no occurrences of WTO in documents in class

→ We will get P(China|d) = 0 for any document that

B is the number of bins – in this case the number of different

docID words in document in c = China?

Estimate parameters of Naive Bayes classifier

Priors: P̂(c) = 3/4 and P̂(c) = 1/4

P̂(Chinese|c) = (5 + 1)/(8 + 6) = 6/14 = 3/7

The denominators are (8 + 6) and (3 + 6) because the lengths of

P̂(c|d5 ) ∝ 3/4 · (3/7)3 · 1/14 · 1/14 ≈ 0.0003

Thus, the classifier assigns the test document to c = China.

mode time complexity

Lave : average length of a training doc, La : length of the test

Multinomial model violates two independence assumptions