0% found this document useful (0 votes)

33 views

F14 CS194 Lec 05 Natural Language

The document discusses natural language processing and related concepts. It introduces n-grams, which represent sequences of words, and how they are used to analyze language and build language models. It also covers parts of speech tagging, grammars, and parse trees for representing the structure of sentences.

Uploaded by

albaar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

F14 CS194 Lec 05 Natural Language

Uploaded by

albaar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Introduction to Data Science

Lecture 5
Natural Language Processing

CS 194 Fall 2014

John Canny
Outline for this Evening
• Project Suggestions Overview
• N-grams
• Grammars
• Parsing
• Dependencies
Reminders
• You should have teams created by now (if not email us
immediately)

• On Wednesday you should submit an ordered list of 5

preferences. They can be 5 of the recommended
projects, or:

• You can submit a different project topic of your own

design (but also submit 4 other preferences from the
list).
Tour of Project Suggestions
• Please take a look at the project suggestions online.

• Here is some background on them…

Natural Language Processing (NLP)
• In a recent survey (KDNuggets blog) of data scientists, 62%
reported working “mostly or entirely” with data about people.
Much of this data is text.

• In the suggested CS194-16 projects (a random sample of data

science projects around campus), nearly half involve natural
language text processing.

• NLP is a central part of mining large datasets.

Natural Language Processing
Some basic terms:
• Syntax: the allowable structures in the language: sentences,
phrases, affixes (-ing, -ed, -ment, etc.).
• Semantics: the meaning(s) of texts in the language.
• Part-of-Speech (POS): the category of a word (noun, verb,
preposition etc.).
• Bag-of-words (BoW): a featurization that uses a vector of word
counts (or binary) ignoring order.
• N-gram: for a fixed, small N (2-5 is common), an n-gram is a
consecutive sequence of words in a text.
Bag of words Featurization
Assuming we have a dictionary mapping words to a unique integer
id, a bag-of-words featurization of a sentence could look like this:
Sentence: The cat sat on the mat
word id’s: 1 12 5 3 1 14
The BoW featurization would be the vector:
Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1
position 1 3 5 12 14
In practice this would be stored as a sparse vector of (id, count)s:
(1,2),(3,1),(5,1),(12,1),(14,1)
Note that the original word order is lost, replaced by the order of
id’s.
N-grams
Because word order is lost, the sentence meaning is weakened.
This sentence has quite a different meaning but the same BoW
vector:
Sentence: The mat sat on the cat
word id s: 1 14 5 3 1 12
BoW featurization:
Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1

But word order is important, especially the order of nearby words.

N-grams capture this, by modeling tuples of consecutive words.

N-grams
Sentence: The cat sat on the mat
2-grams: the-cat, cat-sat, sat-on, on-the, the-mat
Notice how even these short n-grams “make sense” as linguistic
units. For the other sentence we would have different features:
Sentence: The mat sat on the cat
2-grams: the-mat, mat-sat, sat-on, on-the, the-cat
We can go still further and construct 3-grams:
Sentence: The cat sat on the mat
3-grams: the-cat-sat, cat-sat-on, sat-on-the, on-the-mat
Which capture still more of the meaning:
Sentence: The mat sat on the cat
3-grams: the-mat-sat, mat-sat-on, sat-on-the, on-the-cat
N-grams Features
Typically, its advantages to use multiple n-gram features in machine
learning models with text, e.g.

unigrams + bigrams (2-grams) + trigrams (3-grams).

The unigrams have higher counts and are able to detect influences
that are weak, while bigrams and trigrams capture strong
influences that are more specific.

e.g. “the white house” will generally have very different influences
from the sum of influences of “the”, “white”, “house”.
N-grams size
N-grams pose some challenges in feature set size.
If the original vocabulary size is |V|, the number of 2-grams is |V|2
While for 3-grams it is |V|3

Luckily natural language n-grams (including single words) have a

power law frequency structure. This means that most of the n-
grams you see are common. A dictionary that contains the most
common n-grams will cover most of the n-grams you see.
Power laws for N-grams
N-grams follow a power law distribution:
N-grams size
Because of this you may see values like this:

• Unigram dictionary size: 40,000

• Bigram dictionary size: 100,000
• Trigram dictionary size: 300,000

With coverage of > 80% of the features occurring in the text.

N-gram Language Models
N-grams can be used to build statistical models of texts.

When this is done, they are called n-gram language models.

An n-gram language model associates a probability with each n-

gram, such that the sum over all n-grams (for fixed n) is 1.

You can then determine the overall likelihood of a particular

sentence:
The cat sat on the mat
Is much more likely than
The mat sat on the cat
Skip-grams
We can also analyze the meaning of a particular word by looking at
the contexts in which it occurs.

The context is the set of words that occur near the word, i.e. at
displacements of …,-3,-2,-1,+1,+2,+3,… in each sentence where the
word occurs.

A skip-gram is a set of non-consecutive words (with specified

offset), that occur in some sentence.

We can construct a BoSG (bag of skip-gram) representation for

each word from the skip-gram table.
Skip-grams
Then with a suitable embedding (DNN or linear projection) of the
skip-gram features, we find that word meaning has an algebraic
structure:
Man
King

Woman
Queen

Man + (King – Man) + (Woman – Man) = Queen

Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in

Vector Space"
5-min break
Outline for this Evening
• Project Suggestions Overview
• N-grams
• Grammars
• Parsing
• Dependencies
Parts of Speech
Thrax’s original list (c. 100 B.C):
• Noun
• Verb
• Pronoun
• Preposition
• Adverb
• Conjunction
• Participle
• Article
Parts of Speech
Thrax’s original list (c. 100 B.C):
• Noun (boat, plane, Obama)
• Verb (goes, spun, hunted)
• Pronoun (She, Her)
• Preposition (in, on)
• Adverb (quietly, then)
• Conjunction (and, but)
• Participle (eaten, running)
• Article (the, a)
Parts of Speech (Penn Treebank 2014)
1. CC Coordinating conjunction 19. PRP$ Possessive pronoun
2. CD Cardinal number 20. RB Adverb
3. DT Determiner 21. RBR Adverb, comparative
4. EX Existential there 22. RBS Adverb, superlative
5. FW Foreign word 23. RP Particle
Preposition or subordinating 24. SYM Symbol
6. IN
conjunction 25. TO to
7. JJ Adjective 26. UH Interjection
8. JJR Adjective, comparative 27. VB Verb, base form
9. JJS Adjective, superlative 28. VBD Verb, past tense
10. LS List item marker 29. VBG Verb, gerund or present participle
11. MD Modal 30. VBN Verb, past participle
12. NN Noun, singular or mass Verb, non-3rd person singular
31. VBP
13. NNS Noun, plural present
14. NNP Proper noun, singular 32. VBZ Verb, 3rd person singular present
15. NNPS Proper noun, plural 33. WDT Wh-determiner
16. PDT Predeterminer 34. WP Wh-pronoun
17. POS Possessive ending 35. WP$ Possessive wh-pronoun
18. PRP Personal pronoun 36. WRB Wh-adverb
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node)
• S  NP VP
• S  NP VP PP
• NP  DT NN
• VP  VB NP
• VP  VBD
• PP  IN NP
• DT  “the”
• NN  “mat”, “cat”
• VBD  “sat”
• IN  “on”
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node) “the cat sat on the mat”
• S  NP VP
• S  NP VP PP (the cat) (sat) (on the mat)
• NP  DT NN (the cat), (the mat)
• VP  VB NP
• VP  VBD
• PP  IN NP
• DT  “the”
• NN  “mat”, “cat”
• VBD  “sat”
• IN  “on”
Grammars
English Grammars are context-free: the productions do not depend
on any words before or after the production.

The reconstruction of a sequence of grammar productions from a

sentence is called “parsing” the sentence.

It is most conveniently represented as a tree:

Parse Trees
“The cat sat on the mat”
Parse Trees
In bracket notation:

(ROOT
(S
(NP (DT the) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat))))))
Grammars

There are typically multiple ways to produce the same sentence.

Consider the statement by Groucho Marx:

“While I was in Africa, I shot an elephant in my pajamas”

“How he got into my pajamas, I don’t know”
Parse Trees
“…,I shot an elephant in my pajamas” -what people hear first
Parse Trees
Groucho’s version
Grammars
Recursion is common in grammar rules, e.g.
NP  NP RC
Because of this, sentences of arbitrary length are possible.
Recursion in Grammars
“Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo”.
Grammars

Its also possible to have “sentences” inside other sentences…

S  NP VP
VP  VB NP SBAR
SBAR  IN S
Recursion in Grammars
“Nero played his lyre while Rome burned”.
PCFGs
Complex sentences can be parsed in many ways, most of which
make no sense or are extremely improbable (like Groucho’s
example).

Probabilistic Context-Free Grammars (PCFGs) associate and learn

probabilities for each rule:
S  NP VP 0.3
S  NP VP PP 0.7

The parser then tries to find the most likely sequence of

productions that generate the given sentence. This adds more
realistic “world knowledge” and generally gives much better results.
Most state-of-the-art parsers these days use PCFGs.
Systems
• NLTK: Python-based NLP system. Many modules, good
visualization tools, but not quite state-of-the-art performance.

• Stanford Parser: Another comprehensive suite of tools (also POS

tagger), and state-of-the-art accuracy. Has the definitive
dependency module.

• Berkeley Parser: Slightly higher parsing accuracy (than Stanford)

but not as many modules.

• Note: high-quality parsing is usually very slow, but see:

https://github.com/dlwh/puck
Outline for this Evening
• Project Suggestions Overview
• N-grams
• Grammars
• Parsing
• Dependencies
Dependencies
In a constituency parse, there is no direct relation between the
constituents and words from the sentence (except for leaf nodes
which produce a single word).

In dependency parsing, the idea is to decompose the sentence into

relations directly between words.

This is an older, and some argue more natural, decomposition of the

sentence. It also often makes semantic interpretation (based on the
meanings of the words) easier.

Lets look at a simple example:

Dependencies
“The cat sat on the mat”
dependency tree parse tree

constituency labels of leaf nodes

Dependencies
From the dependency tree, we can obtain a “sketch” of the
sentence. i.e. by starting at the root we can look down one level to
get:
“cat sat on”
And then by looking for the object of the prepositional child, we get:
“cat sat on mat”
We can easily ignore determiners “a, the”.

And importantly, adjectival and adverbial modifiers generally

connect to their targets:
Dependencies
“Brave Merida prepared for a long, cold winter”
Dependencies
“Russell reveals himself here as a supremely gifted director of
actors”
Dependencies
Stanford dependencies are constructed from the output of a
constituency parser (so you can in principle use other parsers).

The mapping is based on hand-written regular expressions.

Dependency grammars have been widely used for sentiment

analysis and for semantic embedings of sentences.
Summary
• Project Suggestions Overview
• N-grams
• Grammars
• Parsing
• Dependencies

NLP Final
No ratings yet
NLP Final
72 pages
F15 CS194 Lec 05 Natural Language
No ratings yet
F15 CS194 Lec 05 Natural Language
69 pages
13 Ai Cse551 NLP 1 PDF
No ratings yet
13 Ai Cse551 NLP 1 PDF
50 pages
Unit - 3 NLP - R20
No ratings yet
Unit - 3 NLP - R20
21 pages
04 - Parsing in NLP
No ratings yet
04 - Parsing in NLP
39 pages
Grammars
No ratings yet
Grammars
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
11 pages
N Gram
No ratings yet
N Gram
6 pages
Overview of Linguistics
No ratings yet
Overview of Linguistics
19 pages
Chapter 6-NLP
No ratings yet
Chapter 6-NLP
8 pages
SMC Learning Module For Students in NLP001 1
No ratings yet
SMC Learning Module For Students in NLP001 1
44 pages
2019 Main
No ratings yet
2019 Main
9 pages
03 - Syntactic Analysis
No ratings yet
03 - Syntactic Analysis
23 pages
Question To Exam 1
No ratings yet
Question To Exam 1
10 pages
Natural Language Processing Artificial Intelligence
No ratings yet
Natural Language Processing Artificial Intelligence
81 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
Syntax Complete Powerpoints
No ratings yet
Syntax Complete Powerpoints
217 pages
Artificial Intelligence: Natural Language Processing II
No ratings yet
Artificial Intelligence: Natural Language Processing II
51 pages
N-Gram Models For Language Detection
No ratings yet
N-Gram Models For Language Detection
14 pages
Ngram Analysis
No ratings yet
Ngram Analysis
16 pages
n-grams
No ratings yet
n-grams
2 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Language Typology: The Order of Constituents
No ratings yet
Language Typology: The Order of Constituents
152 pages
Natural Language Processing_Notes_Unit 3
No ratings yet
Natural Language Processing_Notes_Unit 3
19 pages
Giáo Trinh Hình Thái
100% (1)
Giáo Trinh Hình Thái
57 pages
Class-Based N-Gram Models of Natural Language
No ratings yet
Class-Based N-Gram Models of Natural Language
14 pages
Lexis, Collocations and Grammar
No ratings yet
Lexis, Collocations and Grammar
54 pages
Very Excellent Syntax Semantics
No ratings yet
Very Excellent Syntax Semantics
70 pages
Linguistic Universals: Ian Roberts, Chapter 10
No ratings yet
Linguistic Universals: Ian Roberts, Chapter 10
11 pages
LESSON 14 - Morphological Description and Problems in Morphological Description
No ratings yet
LESSON 14 - Morphological Description and Problems in Morphological Description
8 pages
WAJ3102 - PPG 08 PPGWAJ3102 Topic1 Parts of Speech
No ratings yet
WAJ3102 - PPG 08 PPGWAJ3102 Topic1 Parts of Speech
23 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Generative Grammar Assignment
100% (5)
Generative Grammar Assignment
7 pages
CHOMSKY, N. Three Models For The Descriptions of Languages.
100% (1)
CHOMSKY, N. Three Models For The Descriptions of Languages.
4 pages
Linguistics 101: An Introduction To The Study of Language
No ratings yet
Linguistics 101: An Introduction To The Study of Language
25 pages
2-Introduction to NLP_part2
No ratings yet
2-Introduction to NLP_part2
27 pages
Intro Lesson - Grammar 1 - Comisión B - 2024
No ratings yet
Intro Lesson - Grammar 1 - Comisión B - 2024
8 pages
Book An Introduction To Generative Grammar With Exercises
No ratings yet
Book An Introduction To Generative Grammar With Exercises
152 pages
Syntax and Grammar: John Goldsmith Cognitive Neuroscience May 1999
No ratings yet
Syntax and Grammar: John Goldsmith Cognitive Neuroscience May 1999
31 pages
Meaningful Building-Block of Language: Name: Ajeng Yupika NIM: 1185030011 Class: 3A
No ratings yet
Meaningful Building-Block of Language: Name: Ajeng Yupika NIM: 1185030011 Class: 3A
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
57 pages
English Essays
No ratings yet
English Essays
439 pages
Unit - 4
No ratings yet
Unit - 4
21 pages
Language Models: Instructor: Rada Mihalcea Taught by Bonnie Dorr at Univ. of Maryland
No ratings yet
Language Models: Instructor: Rada Mihalcea Taught by Bonnie Dorr at Univ. of Maryland
74 pages
NLP-PT 1
No ratings yet
NLP-PT 1
15 pages
ai
No ratings yet
ai
13 pages
04 - Parsing in NLP
No ratings yet
04 - Parsing in NLP
39 pages
THE CREATOR OF GRAMMAR
No ratings yet
THE CREATOR OF GRAMMAR
19 pages
Handout Week 1 - Syntax
No ratings yet
Handout Week 1 - Syntax
18 pages
Syntax and Grammar: John Goldsmith Cognitive Neuroscience May 1999
No ratings yet
Syntax and Grammar: John Goldsmith Cognitive Neuroscience May 1999
31 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
TOEFL Test 1
No ratings yet
TOEFL Test 1
10 pages
FILE - 20201105 - 075545 - Giáo Trinh Hình Thái 2020
100% (1)
FILE - 20201105 - 075545 - Giáo Trinh Hình Thái 2020
57 pages
Francis_Katamba_English_Words-70
No ratings yet
Francis_Katamba_English_Words-70
3 pages
Goals of Grammar Types of Grammar: Efinitions of Key Terms: Grammatical Units (Words and Word Classes)
No ratings yet
Goals of Grammar Types of Grammar: Efinitions of Key Terms: Grammatical Units (Words and Word Classes)
6 pages
16 Syntax
No ratings yet
16 Syntax
15 pages
Intro To Linguistic
No ratings yet
Intro To Linguistic
117 pages
Speech and Language Processing
No ratings yet
Speech and Language Processing
26 pages
TAE Training and Education Implementation Guide
No ratings yet
TAE Training and Education Implementation Guide
20 pages
TAE Companion Volume Release 4.0 May 2022
No ratings yet
TAE Companion Volume Release 4.0 May 2022
33 pages
TAEASS403 2 Summary
No ratings yet
TAEASS403 2 Summary
3 pages
1 Peter Braad Olsesen - IAIS Global Seminar - InsurTech - McKinsey v1 0 PDF
No ratings yet
1 Peter Braad Olsesen - IAIS Global Seminar - InsurTech - McKinsey v1 0 PDF
7 pages
Recommendation Digital Government Strategies
No ratings yet
Recommendation Digital Government Strategies
12 pages
National Assessment Report
No ratings yet
National Assessment Report
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

F14 CS194 Lec 05 Natural Language

Uploaded by

F14 CS194 Lec 05 Natural Language

Uploaded by

Introduction to Data Science

CS 194 Fall 2014

• On Wednesday you should submit an ordered list of 5

• You can submit a different project topic of your own

• Here is some background on them…

• In the suggested CS194-16 projects (a random sample of data

• NLP is a central part of mining large datasets.

But word order is important, especially the order of nearby words.

N-grams capture this, by modeling tuples of consecutive words.

unigrams + bigrams (2-grams) + trigrams (3-grams).

Luckily natural language n-grams (including single words) have a

• Unigram dictionary size: 40,000

With coverage of > 80% of the features occurring in the text.

When this is done, they are called n-gram language models.

An n-gram language model associates a probability with each n-

You can then determine the overall likelihood of a particular

A skip-gram is a set of non-consecutive words (with specified

We can construct a BoSG (bag of skip-gram) representation for

Man + (King – Man) + (Woman – Man) = Queen

Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in

The reconstruction of a sequence of grammar productions from a

It is most conveniently represented as a tree:

There are typically multiple ways to produce the same sentence.

“While I was in Africa, I shot an elephant in my pajamas”

Its also possible to have “sentences” inside other sentences…

Probabilistic Context-Free Grammars (PCFGs) associate and learn

The parser then tries to find the most likely sequence of

• Stanford Parser: Another comprehensive suite of tools (also POS

• Berkeley Parser: Slightly higher parsing accuracy (than Stanford)

• Note: high-quality parsing is usually very slow, but see:

In dependency parsing, the idea is to decompose the sentence into

This is an older, and some argue more natural, decomposition of the

Lets look at a simple example:

constituency labels of leaf nodes

And importantly, adjectival and adverbial modifiers generally

The mapping is based on hand-written regular expressions.

Dependency grammars have been widely used for sentiment

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.