IS 7118 Unit-9 Semantics
IS 7118 Unit-9 Semantics
IS 7118 Unit-9 Semantics
Wordform Lemma
banks bank
sung sing
duermes dormir
IS 7118:NLP Unit-9: Semantics, 4
Prof.R.K.Rao Bandaru
Lemmas have senses
• Homonymy
• Polysemy
• Metonym
• Synonymy /Antonym
• Hyponymy /Hypernym
• Meronym/Holonym
• Troponym
10
Metonymy
• Information Extraction
• Information Retrieval
• Question Answering
• Bioinformatics and Medical Informatics
• Machine Translation
There is some
hierarchical
information, for
example with
hyp-er/o-nomy
• Where it is:
http://wordnetweb.princeton.edu/perl/webwn
• Libraries
– Python: WordNet from NLTK
http://www.nltk.org/Home
– Java:
• JWNL, extJWNL on sourceforge
• a
path-based similarity
simpath(c1,c2) = 1/pathlen(c1,c2)
simpath(nickel,coin) = 1/2 = .5
simpath(fund,budget) = 1/2 = .5
simpath(nickel,currency) = 1/4 = .25
simpath(nickel,money) = 1/6 = .17
simpath(coinage,Richter scale) = 1/6 = .17
å count(w) …
43
IS 7118:NLP Unit-9: Semantics,
Prof.R.K.Rao Bandaru
Information content: definitions
• Information content:
IC(c) = -log P(c)
• Most informative subsumer (Lowest common subsumer)
LCS(c1,c2) = The most informative (lowest) node in the hierarchy
subsuming both c1 and c2
2logP(LCS(c1,c2 ))
simLin (c1,c2 ) =
logP(c1 )+ logP(c2 )
47
IS 7118:NLP Unit-9: Semantics,
Prof.R.K.Rao Bandaru
Lin similarity function
2logP(LCS(c1,c2 ))
simLin (A, B) =
logP(c1 )+ logP(c2 )
2logP(geological-formation)
simLin (hill,coast) =
logP(hill)+ logP(coast)
2ln0.00176
=
ln0.0000189 + ln0.0000216
=.59 IS 7118:NLP Unit-9: Semantics, 48
Prof.R.K.Rao Bandaru
Jiang-Cornath Distance
• Related to simLin – expressed as distance instead of
similarity:
• distJC(c1,c2)= 2× log P(LCS(c1,c2))-(log P(c1) + log P(c2))
1
simJC (c1, c 2) =
2 log P(LCS(c1, c2)) - (log P(c1) + log P(c2))
• WordNet::Similarity
http://wn-similarity.sourceforge.net/
– Web-based interface:
http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
C W
fij
pij = W C
åå fij
i=1 j=1 C W
p(w=information,c=data) = 6/19 = .32 å fij å fij
p(w=information) = 11/19 = .58
p(wi ) = j=1 p(c j ) = i=1
p(c=data) = 7/19 = .37 N N
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
68
p(context) 0.16 0.37 0.11 0.26 0.11
IS 7118:NLP Unit-9: Semantics,
Prof.R.K.Rao Bandaru
p(w,context) p(w)
computer data pinch result sugar
pij apricot 0.00 0.00 0.05 0.00 0.05 0.11
pmiij = log2 pineapple 0.00 0.00 0.05 0.00 0.05 0.11
pi* p* j
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
• pmi(information,data) = log2 (.32 / (.37*.58) ) = .57
(.57 using full precision)
PPMI(w,context)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.57 - 0.47 - 69
Weighing PMI
• PMI is biased toward infrequent events
• Various weighting schemes help alleviate this
– See Turney and Pantel (2010)
– Add-one smoothing can also help
Objectof
Object of Count PMI
Count
“drink”
“drink”
ittea 32 11.8
1.3
liquid
anything 32 10.5
5.2
wine
wine 22 9.3
anything
tea 23 5.2
11.8
it
liquid 23 1.3
10.5
similarity
Dot product Unit vectors
åi=1 viwi
N
v ·w v w
cos(v, w) = = · =
vw v w
åi=1 vi åi=1 wi
N 2 N 2
apricot 1 0 0
åi=1 viwi
N
v ·w v w digital 0 1 2
cos(v, w) = = · =
vw v w
åi=1 vi åi=1 wi
N 2 N 2 information 1 6 1
0+0+0
=0
1+0+0 0+1+ 4
IS 7118:NLP Unit-9: Semantics, 80
Prof.R.K.Rao Bandaru
Other possible similarity measures
• Out of the four measures given below the Cosine similarity is more popular
measure
???