F14 CS194 Lec 05 Natural Language
F14 CS194 Lec 05 Natural Language
Lecture 5
Natural Language Processing
The unigrams have higher counts and are able to detect influences
that are weak, while bigrams and trigrams capture strong
influences that are more specific.
e.g. “the white house” will generally have very different influences
from the sum of influences of “the”, “white”, “house”.
N-grams size
N-grams pose some challenges in feature set size.
If the original vocabulary size is |V|, the number of 2-grams is |V|2
While for 3-grams it is |V|3
The context is the set of words that occur near the word, i.e. at
displacements of …,-3,-2,-1,+1,+2,+3,… in each sentence where the
word occurs.
Woman
Queen
(ROOT
(S
(NP (DT the) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat))))))
Grammars
S NP VP
VP VB NP SBAR
SBAR IN S
Recursion in Grammars
“Nero played his lyre while Rome burned”.
PCFGs
Complex sentences can be parsed in many ways, most of which
make no sense or are extremely improbable (like Groucho’s
example).