NLP Course Lecture01 Short
NLP Course Lecture01 Short
Autumn 2024
The course is delivered at ITMO University, Saint-Petersburg &
Bauman University, Moscow
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 1 / 77
Content
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 2 / 77
About the course
Content
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 3 / 77
About the course
Acknowledgements
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 4 / 77
About the course
Logistics
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 5 / 77
About the course
Grading policy
Quizzes – 8 x 4
Assignments – 25 x 2
Final project – 60
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 6 / 77
About the course
Course description
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 7 / 77
Research questions and NLP tasks
Content
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 8 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 9 / 77
Research questions and NLP tasks
Synonyms of NLP
Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing
Subtleties
Computational Linguistics is more regarded as a branch of
Linguistics, whose main purpose is to understand the mechanism
of human languages by means of computing
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks
Synonyms of NLP
Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing
Subtleties
Natural Language Processing is a branch of computer sciences
and artificial intelligent, whose main purpose is to develop
technologies to enable human-computer interactions using human
languages
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks
Synonyms of NLP
Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing
Subtleties
Natural Language Understanding is one of the two main
challenges in Natural Language Processing, while the other is
Natural Language Generation.
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks
Synonyms of NLP
Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing
Subtleties
Human Language Technologies mainly refer to NLP
technologies, but may also include other language related
technologies, include speech technologies, optical character
recognition (OCR), computer typesetting, etc.
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 11 / 77
Research questions and NLP tasks
We are getting used to the fact that human beings can understand
each other using language communication.
Although it is a natural result of evolution for human to obtain the
language competence.
It seems to be a miracle, due to its complexity.
No other species in this planet can use languages at the degree
as humans do.
The mechanism behind human languages is not fully discovered.
Understanding human languages by computer is difficult.
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 12 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 13 / 77
Research questions and NLP tasks
Research questions
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 14 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 15 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 16 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 17 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 18 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 19 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 20 / 77
Research questions and NLP tasks
Turing test
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 21 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 22 / 77
Research questions and NLP tasks
NLP Tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 23 / 77
Research questions and NLP tasks
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 24 / 77
Grammars and Automata
Content
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 25 / 77
Grammars and Automata
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 26 / 77
Grammars and Automata
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 27 / 77
Grammars and Automata
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 28 / 77
Grammars and Automata
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 29 / 77
Grammars and Automata
Turing machine
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 30 / 77
Grammars and Automata
Turing machine
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 31 / 77
Grammars and Automata
Turing machine
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 32 / 77
Grammars and Automata
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 33 / 77
Grammars and Automata
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 34 / 77
Grammars and Automata
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 35 / 77
Text segmentation and morphology analysis
Content
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 36 / 77
Text segmentation and morphology analysis
Text segmentation
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 37 / 77
Text segmentation and morphology analysis
Text segmentation
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 38 / 77
Text segmentation and morphology analysis
Thai
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 39 / 77
Text segmentation and morphology analysis
Chinese
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 40 / 77
Text segmentation and morphology analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 41 / 77
Text segmentation and morphology analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 42 / 77
Text segmentation and morphology analysis
http://what-when-how.com/how-to-build-a-digital-library/word-segmentation-and-sorting-digital-library/
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 43 / 77
Text segmentation and morphology analysis
Wang & Xu, Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation, IJCNLP 2017
Tags:
S: single character word
B: beginning character of a word
M: middle character of a word
E: end character of a word
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 44 / 77
Text segmentation and morphology analysis
Input
Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through three
rounds of last month’s U.S. Open. H.J. Heinz Company said it completed
the sale of its Ore-Ida frozen-food business catering to the service
industry to McCain Foods Ltd. for about $500 million. It’s the first group
action of its kind in Britain and one of only a handful of lawsuits against
tobacco companies outside the U.S.
Note: Text in red: change, text in blue: Keep
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 45 / 77
Text segmentation and morphology analysis
Output
Another ex-Golden Stater , Paul Stankowski from Oxnard , is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through three
rounds of last month ’s U.S. Open . H.J. Heinz Company said it
completed the sale of its Ore-Ida frozen-food business catering to the
service industry to McCain Foods Ltd. for about $ 500 million . It ’s the
first group action of its kind in Britain and one of only a handful of
lawsuits against tobacco companies outside the U.S. .
Note: Text in red: change, text in blue: Keep
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 46 / 77
Text segmentation and morphology analysis
Morphological analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 47 / 77
Text segmentation and morphology analysis
Stemming:
Ex: writing → writ + ing
Lemmatization:
Ex1: writing → write +V +Prog
Ex2: books → book +N +Pl
Ex3: writes → write +V +3Per +Sg
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 48 / 77
Text segmentation and morphology analysis
Ambiguity in morphology
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 49 / 77
Text segmentation and morphology analysis
Language variation
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 50 / 77
Text segmentation and morphology analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 51 / 77
Text segmentation and morphology analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 52 / 77
Text segmentation and morphology analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 53 / 77
Text segmentation and morphology analysis
English morphology
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 54 / 77
Text segmentation and morphology analysis
Three components
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 55 / 77
Text segmentation and morphology analysis
Rewrite rules
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 56 / 77
Text segmentation and morphology analysis
An example
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 57 / 77
Text segmentation and morphology analysis
An FST
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 58 / 77
Text segmentation and morphology analysis
Expanding FST
fox +N + Pl → fox ˆ s #
cat +N + Pl → cat ˆ s #
goose +N +Sg → goose #
goose +N +Pl → geese #
a slide from UW LING 570 by Fei Xia
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 59 / 77
Text segmentation and morphology analysis
ϵ → e / (s|x|z) ˆ _ s #
Input: ...(s|x|z) ˆs # immediate level
Output: ...(s|x|z)es # surface level
To reject (fox ˆs, foxs)
a slide from UW LING 570 by Fei Xia
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 60 / 77
Text segmentation and morphology analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 61 / 77
Text segmentation and morphology analysis
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 62 / 77
Word frequency and collocations
Content
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 63 / 77
Word frequency and collocations
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 64 / 77
Word frequency and collocations
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 65 / 77
Word frequency and collocations
Zipf’s Law
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 66 / 77
Word frequency and collocations
Zipf’s law
A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias
(dumps from October 2015) in a log-log scale.
(By SergioJimenez - Own work, CC BY-SA 4.0, from Wikipedia)
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 67 / 77
Word frequency and collocations
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 68 / 77
Word frequency and collocations
Examples of collocations
noun phrases like strong tea and weapons of mass destruction
phrasal verbs like to make up, and other phrases like the rich and
powerful.
Valid or invalid?
a stiff breeze but not a stiff wind (while either a strong breeze or a
strong wind is okay).
broad daylight (but not bright daylight or narrow darkness).
Manning & Schütze, Fundamentals of Statistical Natural Language Processing, 1999
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 69 / 77
Word frequency and collocations
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 70 / 77
Word frequency and collocations
Non-Compositionality
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 71 / 77
Word frequency and collocations
Non-Compositionality
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 72 / 77
Word frequency and collocations
Non-Substitutability
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 73 / 77
Word frequency and collocations
Non-Substitutability
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 74 / 77
Word frequency and collocations
Frequency
Mean and Variance of Distances between Words
Hypothesis Testing
t-test
χ2 test
likelihood ratio test
Mutual Information
Left and Right Context Entropy
C-Value
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 75 / 77
Word frequency and collocations
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 76 / 77
Summary
Content
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 77 / 77