Lecture 7
Lecture 7
Lecture 7
Information Retrieval
Computer Science Tripos Part II
Simone Teufel
Simone.Teufel@cl.cam.ac.uk
Lent 2014
328
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Set of relevant
documents
Today: evaluation
329
IR System Components
Document
Collection
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Clustering
Set of relevant
documents
330
IR System Components
Document
Collection
Clustering
Document Normalisation
Indexer
Query Norm.
IR System
Query UI
Indexes
Ranking/Matching Module
Clustering
Set of relevant
documents
331
Overview
332
Overview
Hierarchical clustering
333
HAC: Basic algorithm
334
A dendrogram
335
Term–document matrix to document–document matrix
Log frequency weighting
and cosine normalisation
SaS P(SaS,SaS) P(PaP,SaS)
SaS PaP WH
PaP P(SaS,PaP) P(PaP,PaP)
0.789 0.832 0.524 WH P(SaS,WH) P(PaP,WH)
0.515 0.555 0.465
SaS PaP
0.335 0.000 0.405
0.000 0.000 0.588
for i:= 1 to n do
ci := xi
C :=c1 , ... cn
j := n+1
while C > 1 do
(cn1 , cn2 ) := max(cu ,cv )∈C ×C sim(cu , cv )
cj := cn1 ∪ cn2
C := C { cn1 , cn2 } ∪ cj
j:=j+1
end
337
Computational complexity of the basic algorithm
338
Hierarchical clustering: similarity functions
339
Example: hierarchical clustering; similarity functions
a b c d
1
1.5
e f g h
b 1
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √5 2
√ 6.25 √10.25 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5 1
a b c d e f g
340
Single Link is O(n2)
b 1
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √5 2
√ 6.25 √10.25 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5 1
a b c d e f g
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √5 2
√ 6.25 √10.25 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5 1
a+b c d e f g
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f √ 5 2
√ 6.25 √10.25 1
g 10.25 6.25 2 5 2.5 1.5 341
Single Link
342
Clustering Result under Single Link
a b c d
e f g h
a b c d e f g h
343
Complete Link
b 1
c 2.5 1.5
d 3.5 2.5
√ 1
√ √
e 2
√ 5 √10.25 √16.25
f 5 2 6.25 10.25
√ √ √ 1
g √10.25 √6.25 2
√ 5 2.5 1.5
h 16.25 10.25 5 2 3.5 2.5
1
a b c d e f g
344
Clustering result under complete link
a b c d
e f g h
a b c d e f g h
345
Example: gene expression data
346
Rat CNS gene expression data (excerpt)
347
Rat CNS gene clustering – single link
0 20 40 60 80 100
NFM
GRg1
NFL
GFAP
NMDA1
cellubrevin
nestin
GAT1
EGFR
NFH
nAChRa7
mGluR5
actin
ACHE
mAChR2
GRg2
MOG
GRb1
348
Rat CNS gene clustering – complete link
0 20 40 60 80 100
NFM
CCO1
Ins2
FGFR
GDNF
SC6
keratin
CCO2
IGFR2
cyclin A
TH
Brm
349
Rat CNS gene clustering – group average link
0 20 40 60 80 100
NFM
NFL
NMDA1
nestin
GAT1
cellubrevin
NFH
MOG
actin
SC2
NT3
SC1
MK2
cyclin B
keratin
CCO2
IP3R2
ChAT
aFGF
mGluR3
NMDA2
B 5HT1c
GRa1
bFGF
350
Flat or hierarchical clustering?
351
Overview
A text classification task: Email spam filtering
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
352
Topic classification
γ(d ′ ) =China
353
Statistical/probabilistic classification methods
354
Examples of how search engines use classification
355
Formal definition of TC: Training
Given:
A document space X
Documents are represented in this space – typically some type
of high-dimensional space.
A fixed set of classes C = {c1 , c2 , . . . , cJ }
The classes are human-defined for the needs of an application
(e.g., spam vs. nonspam).
A training set D of labeled documents. Each labeled
document hd, ci ∈ X × C
γ:X→C
356
Formal definition of TC: Application/Testing
357
The Naive Bayes classifier
358
Maximum a posteriori class
359
Taking the log
360
Naive Bayes classifier
Classification rule:
X
c map = arg max [ log P̂(c) + log P̂(tk |c)]
c∈C 1≤k≤nd
Simple interpretation:
Each conditional parameter log P̂(tk |c) is a weight that
indicates how good an indicator tk is for c.
The prior log P̂(c) is a weight that indicates the relative
frequency of c.
The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the
class.
We select the class with the most evidence.
361
Parameter estimation take 1: Maximum likelihood
Estimate parameters P̂(c) and P̂(tk |c) from train data: How?
Prior:
Nc
P̂(c) =
N
Nc : number of docs in class c; N: total number of docs
Conditional probabilities:
Tct
P̂(t|c) = P
t ′ ∈V Tct ′
362
The problem with maximum likelihood estimates: Zeros
C =China
363
The problem with maximum likelihood estimates: Zeros
364
To avoid zeros: Add-one smoothing
Before:
Tct
P̂(t|c) = P
t ′ ∈V Tct ′
Now: Add one to each count to avoid zeros:
Tct + 1 Tct + 1
P̂(t|c) = P = P
t ′ ∈V (Tct ′ + 1) ( t ′ ∈V Tct ′ ) + B
365
Example
366
Example: Parameter estimates
Conditional probabilities:
367
Example: Classification
368
Time complexity of Naive Bayes
Derivation of NB formula
Evaluation of text classification
371
Summary: clustering and classification
372
Reading
373