0% found this document useful (0 votes)
16 views80 pages

NLP Course Lecture01 Short

ods.ai NLP course lecture 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views80 pages

NLP Course Lecture01 Short

ods.ai NLP course lecture 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Natural Language Processing

Lecture 01: Introduction to NLP

Valentin Malykh, MTS AI

Autumn 2024
The course is delivered at ITMO University, Saint-Petersburg &
Bauman University, Moscow
Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 1 / 77
Content

1 About the course

2 Research questions and NLP tasks

3 Grammars and Automata

4 Text segmentation and morphology analysis

5 Word frequency and collocations

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 2 / 77
About the course

Content

1 About the course

2 Research questions and NLP tasks

3 Grammars and Automata

4 Text segmentation and morphology analysis

5 Word frequency and collocations

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 3 / 77
About the course

Acknowledgements

Dr. Qun Liu, Huawei Noah’s Ark lab


Dr. Constantin Korikov, Huawei Noah’s Ark lab
Tasnima Sadekova, Huawei Noah’s Ark lab
Salavat Garifullin, ODS
students from previous runs

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 4 / 77
About the course

Logistics

Instructor: Dr. Valentin Malykh


TAs: Salavat Garifullin for ODS.ai & others
Lecture Time: 19.00, Thursdays
Seminar Time: 19.00, Tuesdays
Location: Online
Slides: will be available at the course platform before each class.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 5 / 77
About the course

Grading policy

Quizzes – 8 x 4
Assignments – 25 x 2
Final project – 60

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 6 / 77
About the course

Course description

Natural Language Processing (NLP) is a domain of research


whose objective is to analyze and understand human languages
and develop technologies to enable human machine interactions
with natural languages. NLP is an interdisciplinary field involving
linguistics, computer sciences and artificial intelligence. The goal
of this course is to provide students with comprehensive
knowledge of NLP. Students will be equipped with the principles
and theories of NLP, as well as various NLP technologies,
including rule-based, statistical and neural network ones. After
this course, students will be able to conduct NLP research and
develop state-of-the-art NLP systems.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 7 / 77
Research questions and NLP tasks

Content

1 About the course

2 Research questions and NLP tasks

3 Grammars and Automata

4 Text segmentation and morphology analysis

5 Word frequency and collocations

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 8 / 77
Research questions and NLP tasks

Natural language processing in Wikipedia

Natural language processing (NLP) is a subfield of linguistics,


computer science, information engineering, and artificial
intelligence concerned with the interactions between computers
and human (natural) languages, in particular how to program
computers to process and analyze large amounts of natural
language data.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 9 / 77
Research questions and NLP tasks

Synonyms of NLP

Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing

Subtleties
Computational Linguistics is more regarded as a branch of
Linguistics, whose main purpose is to understand the mechanism
of human languages by means of computing

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks

Synonyms of NLP

Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing

Subtleties
Natural Language Processing is a branch of computer sciences
and artificial intelligent, whose main purpose is to develop
technologies to enable human-computer interactions using human
languages

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks

Synonyms of NLP

Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing

Subtleties
Natural Language Understanding is one of the two main
challenges in Natural Language Processing, while the other is
Natural Language Generation.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks

Synonyms of NLP

Computational Linguistics
Natural Language Processing
Natural Language Understanding
Human Language Processing

Subtleties
Human Language Technologies mainly refer to NLP
technologies, but may also include other language related
technologies, include speech technologies, optical character
recognition (OCR), computer typesetting, etc.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 10 / 77
Research questions and NLP tasks

NLP as an interdisciplinary study

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 11 / 77
Research questions and NLP tasks

Understanding human languages is not easy

We are getting used to the fact that human beings can understand
each other using language communication.
Although it is a natural result of evolution for human to obtain the
language competence.
It seems to be a miracle, due to its complexity.
No other species in this planet can use languages at the degree
as humans do.
The mechanism behind human languages is not fully discovered.
Understanding human languages by computer is difficult.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 12 / 77
Research questions and NLP tasks

Understanding human languages is not easy

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 13 / 77
Research questions and NLP tasks

Research questions

How humans understand each other by using language


communication?

Is it possible to simulate human language behaviors without


understanding language mechanisms?

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 14 / 77
Research questions and NLP tasks

The way of NLP research

Unlike linguists who develop numerous theories to explain the


language mechanisms, NLP researchers try to simulate human
language behaviors by computing, not necessary to understand
the language mechanisms.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 15 / 77
Research questions and NLP tasks

A brief history of NLP

1960s-1990s: Rule-based approaches

1990s-2010s: Statistical approaches

2010s-present: Neural network (deep learning) approaches

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 16 / 77
Research questions and NLP tasks

Holy grails of NLP

Accurate machine translation between human languages

Free conversation between humans and computers

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 17 / 77
Research questions and NLP tasks

Accurate machine translation

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 18 / 77
Research questions and NLP tasks

The Tower of Babel

Oil painting by Pieter Bruegel the Elder, 1563, from Wikipedia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 19 / 77
Research questions and NLP tasks

Free human machine conversation

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 20 / 77
Research questions and NLP tasks

Turing test

By Juan Alberto Sánchez Margallo, CC BY 2.5, from Wikipedia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 21 / 77
Research questions and NLP tasks

Languages, human minds and the world

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 22 / 77
Research questions and NLP tasks

NLP Tasks

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 23 / 77
Research questions and NLP tasks

Classification Tagging Generation


n-grams TF-IDF regex templates
word embeddings word2vec word2vec
CNN CNN CNN ConvSeq2Seq
RNN LSTM LTSM LSTM
Transformers BERT BERT T5
LLM LLaMa LLaMa LLaMa

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 24 / 77
Grammars and Automata

Content

1 About the course

2 Research questions and NLP tasks

3 Grammars and Automata

4 Text segmentation and morphology analysis

5 Word frequency and collocations

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 25 / 77
Grammars and Automata

How can we define a language?

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 26 / 77
Grammars and Automata

How can we define a language?

A language can be defined as the set of sentences which can be


accepted by the speakers of that language.
It is not possible to define a natural language by enumerate all the
sentences, because the number of sentences in a natural
languages is infinite.
Two feasible ways to define a language with infinite sentences:
By a Grammar
By an Automaton

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 27 / 77
Grammars and Automata

Define a language by an automaton (1)

An automaton A is a abstract machine which:


Takes a symbol sequence S as input, and determines if A will
accept or reject S.
Has a finite number of states and a finite number of actions.
At each time step, S is in a state, and points to a position in S.
The current state and current symbol determines the action which
A will execute, which determines the next state of A and the next
position of S where A will point to.
Given a input S, A will run until it stops, and the final state of A
determines if A will accept or reject S.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 28 / 77
Grammars and Automata

Define a language by an automaton (2)

A language L can be defined by an automaton A as:


A word sequence S is a sentence of L, if and only if: when we input
S to A, A will stop in a finite number of time steps at an accept
state.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 29 / 77
Grammars and Automata

Turing machine

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 30 / 77
Grammars and Automata

Turing machine

A Turing machine consists of: (to be continued)


A tape divided into cells, one next to the other. Each cell contains a
symbol from some finite alphabet. The alphabet contains a special
blank symbol and some other symbols. The tape is assumed to be
arbitrarily extendable to the left and to the right.
A read/write head that can read and write symbols on the tape and
move the tape left and right one (and only one) cell at a time.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 31 / 77
Grammars and Automata

Turing machine

A Turing machine consists of: (continued)


A state register that stores the state of the Turing machine, one of
finitely many. Among these is the special start state with which the
state register is initialized.
A finite table of instructions that, given the state the machine is
currently in and the symbol it is reading on the tape, tells the
machine to do the following in sequence:
Either erase or write a symbol.
Move the head to the left or right cell.
Assume the same or a new state as prescribed.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 32 / 77
Grammars and Automata

Linear bounded automaton

A linear bounded automaton is a Turing Machine that satisfies the


following three conditions:
Its input alphabet includes two special symbols, serving as left and
right endmarkers.
Its transitions may not print other symbols over the endmarkers.
Its transitions may neither move to the left of the left endmarker nor
to the right of the right endmarker.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 33 / 77
Grammars and Automata

Finite state automaton / machine (FSA/FSM)

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 34 / 77
Grammars and Automata

Finite state automaton / machine (FSA/FSM)

A Finite State Automaton (FSA), or Finite State Machine (FSM),


consists of:
A finite number of states, while the FSM can be in one states at
each given time;
A head which read a symbol from a sequence of symbols as the
input. The head always goes to the next symbol at the next time
step;
A transition matrix which determines the next states of the FSM
according to the current states and the current symbol.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 35 / 77
Text segmentation and morphology analysis

Content

1 About the course

2 Research questions and NLP tasks

3 Grammars and Automata

4 Text segmentation and morphology analysis

5 Word frequency and collocations

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 36 / 77
Text segmentation and morphology analysis

Text segmentation

In NLP, text is segmented into units of various granularities, which


include:
Chapters and sections;
Paragraphs;
Sentences;
Clauses;
Phrases;
Words;
Morphemes (stems, suffixes, prefixes).

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 37 / 77
Text segmentation and morphology analysis

Text segmentation

Text segmentation is not straightforward in many cases:


For languages like Chinese, Japanese, Tibetan, Thai, there are no
spaces between words;
For languages like Thai and Tibetan, the delimiters between
sentences, clauses or phrases are ambiguous, which makes it hard
to segment sentences;
Even for English, sentence segmentation is not a trivial task,
because the full stop mark (.) is also used for abbreviations,
decimals, etc., which may or may not terminate a sentence.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 38 / 77
Text segmentation and morphology analysis

Thai

Spaces are not reliable boundaries between sentences.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 39 / 77
Text segmentation and morphology analysis

Chinese

There are not spaces between words.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 40 / 77
Text segmentation and morphology analysis

English sentence segmentation

Dot marks (.) are ambiguous:


Full stop: This is an apple.
Decimal: 235.6
Abbreviations: U.S. Ph.D. etc.
A dot mark can take multiple roles: He comes from U.S.
To segment English text into sentences, we need to determine
whether a dot mark is an end of sentence or not.
It can be solved as a classification problem.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 41 / 77
Text segmentation and morphology analysis

English sentence segmentation


— as a classification task

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 42 / 77
Text segmentation and morphology analysis

Chinese word segmentation

http://what-when-how.com/how-to-build-a-digital-library/word-segmentation-and-sorting-digital-library/

Chinese word segmentation may results in different meanings.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 43 / 77
Text segmentation and morphology analysis

Chinese word segmentation


— as a character tagging task

Wang & Xu, Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation, IJCNLP 2017

Tags:
S: single character word
B: beginning character of a word
M: middle character of a word
E: end character of a word

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 44 / 77
Text segmentation and morphology analysis

English word segmentation - Tokenization


— A example of Stanford Tokenizer

Input
Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through three
rounds of last month’s U.S. Open. H.J. Heinz Company said it completed
the sale of its Ore-Ida frozen-food business catering to the service
industry to McCain Foods Ltd. for about $500 million. It’s the first group
action of its kind in Britain and one of only a handful of lawsuits against
tobacco companies outside the U.S.
Note: Text in red: change, text in blue: Keep

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 45 / 77
Text segmentation and morphology analysis

English word segmentation - Tokenization


— A example of Stanford Tokenizer

Output
Another ex-Golden Stater , Paul Stankowski from Oxnard , is contending
for a berth on the U.S. Ryder Cup team after winning his first PGA Tour
event last year and staying within three strokes of the lead through three
rounds of last month ’s U.S. Open . H.J. Heinz Company said it
completed the sale of its Ore-Ida frozen-food business catering to the
service industry to McCain Foods Ltd. for about $ 500 million . It ’s the
first group action of its kind in Britain and one of only a handful of
lawsuits against tobacco companies outside the U.S. .
Note: Text in red: change, text in blue: Keep

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 46 / 77
Text segmentation and morphology analysis

Morphological analysis

To break word down into component morphemes and build a


structured representation
A morpheme is the minimal meaning-bearing unit in a language.
Stem: the morpheme that forms the central meaning unit in a word
Affix: prefix, suffix, infix, circumfix
Prefix: e.g., possible → impossible
Suffix: e.g., walk → walking
Infix: e.g., hingi → humingi (Tagalog)
Circumfix: e.g., sagen → gesagt (German)
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 47 / 77
Text segmentation and morphology analysis

Two slightly different tasks

Stemming:
Ex: writing → writ + ing
Lemmatization:
Ex1: writing → write +V +Prog
Ex2: books → book +N +Pl
Ex3: writes → write +V +3Per +Sg

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 48 / 77
Text segmentation and morphology analysis

Ambiguity in morphology

flies → fly +N +PL


flies → fly +V +3rd +Sg
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 49 / 77
Text segmentation and morphology analysis

Language variation

Analytic languages: e.g., Chinese; English as a language with


analytic tendency.
Synthetic flexive languages: e.g., Russian
Synthetic agglutinate languages: e.g., Turkish

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 50 / 77
Text segmentation and morphology analysis

Ways to combine morphemes to form words

Inflection: stem + gram. morpheme → same class


Ex: help + ed → helped
Derivation: Derivation: stem + gram. morpheme → different class
Ex: civil + -zation → civilization
Compounding: multiple stems
Ex: cabdriver, doghouse
Cliticization: stem + clitic
Ex: they’ll, she’s (*I don’t know who she is)
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 51 / 77
Text segmentation and morphology analysis

Finite state transducers (FSTs)

Finite State Transducers are an extension to Finite State


Machines, where an output symbol will be given for each input
symbol.
FSTs are commonly used tools for morphological analysis.
A FST can be used in a inverse direction with the input and the
output swapped.

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 52 / 77
Text segmentation and morphology analysis

Finite state transducers (FSTs)

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 53 / 77
Text segmentation and morphology analysis

English morphology

Affixes: prefixes, suffixes; no infixes, no circumfixes.


Inflectional:
Noun: -s
Verbs: -s, -ing, -ed, -ed
Adjectives: -er, -est
Derivational:
Ex: V + suf → N
computerize + -ation → computerization
kill + er → killer
Compound: pickup, database, heartbroken, etc.
Cliticization: ’m, ’ve, ’re, etc.
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 54 / 77
Text segmentation and morphology analysis

Three components

Lexicon: the list of stems and affixes, with associated features.


Ex1: book: N
Ex2: -s: +PL
Morphotactics:
Ex: +PL follows a noun
Orthographic rules (spelling rules): to handle exceptions that can
be dealt with by rules.
Ex3: ϵ → e / x ˆ _ s#
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 55 / 77
Text segmentation and morphology analysis

Rewrite rules

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 56 / 77
Text segmentation and morphology analysis

An example

a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 57 / 77
Text segmentation and morphology analysis

An FST

cat +N +PL → cat ˆs #


cat +N +Sg → cat #
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 58 / 77
Text segmentation and morphology analysis

Expanding FST

fox +N + Pl → fox ˆ s #
cat +N + Pl → cat ˆ s #
goose +N +Sg → goose #
goose +N +Pl → geese #
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 59 / 77
Text segmentation and morphology analysis

Representing orthographic rules as FSTs

ϵ → e / (s|x|z) ˆ _ s #
Input: ...(s|x|z) ˆs # immediate level
Output: ...(s|x|z)es # surface level
To reject (fox ˆs, foxs)
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 60 / 77
Text segmentation and morphology analysis

Representing orthographic rules as FSTs

(fox, fox): q0, q0, q0, q1


(fox#, fox#): q0, q0, q0, q1, q0
(foxˆz#, foxz#), q0, q0, q0, q1, q2, q1, q0
(foxˆs#, foxes#): q0, q0, q0, q1, q2, q3, q4, q0
(foxˆs, foxs): q0, q0, q0, q1, q2, q5
a slide from UW LING 570 by Fei Xia

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 61 / 77
Text segmentation and morphology analysis

Further reading on morphological analysis

Fei Xia, slides on morphological analysis


https://www.powershow.com/viewfl/6a39a-ZDc1Z/Morphological_analysis_powerpoint_ppt_presentation

Mans Hulden (2011), Morphological analysis with FSTs


https://fomafst.github.io/morphtut.html

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 62 / 77
Word frequency and collocations

Content

1 About the course

2 Research questions and NLP tasks

3 Grammars and Automata

4 Text segmentation and morphology analysis

5 Word frequency and collocations

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 63 / 77
Word frequency and collocations

Top 5000 words in American English

Statics from Corpus of the Contemporary American English


http://www.wordfrequency.info/

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 64 / 77
Word frequency and collocations

Top 5000 words in American English

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 65 / 77
Word frequency and collocations

Zipf’s Law

The frequency of any word is inversely proportional to its rank in


the frequency table:
1
p(wr ) ∝
r

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 66 / 77
Word frequency and collocations

Zipf’s law

A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias
(dumps from October 2015) in a log-log scale.
(By SergioJimenez - Own work, CC BY-SA 4.0, from Wikipedia)

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 67 / 77
Word frequency and collocations

Collocation or multi-word expression (MWE)

A COLLOCATION is an expression consisting of two or more


words that correspond to some conventional way of saying things.
The words together can mean more than their sum of parts
The Times of India, disk drive
hot dog, mother in law

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 68 / 77
Word frequency and collocations

Collocation or multi-word expression (MWE)

Examples of collocations
noun phrases like strong tea and weapons of mass destruction
phrasal verbs like to make up, and other phrases like the rich and
powerful.
Valid or invalid?
a stiff breeze but not a stiff wind (while either a strong breeze or a
strong wind is okay).
broad daylight (but not bright daylight or narrow darkness).
Manning & Schütze, Fundamentals of Statistical Natural Language Processing, 1999

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 69 / 77
Word frequency and collocations

Criteria for collocations (or MWE)

Typical criteria for collocations:


non-compositionality
non-substitutability
non-modifiability.
Collocations usually cannot be translated into other languages
word by word.
A phrase can be a collocation even if it is not consecutive (as in
the example knock ... door).
Manning & Schütze, Fundamentals of Statistical Natural Language Processing, 1999

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 70 / 77
Word frequency and collocations

Non-Compositionality

A phrase is compositional if the meaning can be predicted from


the meaning of the parts.
E.g. new companies
A phrase is non-compositional if the meaning cannot be predicted
from the meaning of the parts
E.g. hot dog
Manning & Schütze, Fundamentals of Statistical Natural Language Processing, 1999

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 71 / 77
Word frequency and collocations

Non-Compositionality

Collocations are not necessarily fully compositional in that there is


usually an element of meaning added to the combination.
E.g. strong tea
Idioms are the most extreme examples of non-compositionality
E.g. to hear it through the grapevine
Manning & Schütze, Fundamentals of Statistical Natural Language Processing, 1999

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 72 / 77
Word frequency and collocations

Non-Substitutability

We cannot substitute near-synonyms for the components of a


collocation.
For example
We can’t say yellow wine instead of white wine even though yellow
is as good a description of the color of white wine as white is (it is
kind of a yellowish white).
Manning & Schütze, Fundamentals of Statistical Natural Language Processing, 1999

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 73 / 77
Word frequency and collocations

Non-Substitutability

Many collocations cannot be freely modified with additional lexical


material or through grammatical transformations
(Non-modifiability).
E.g. white wine, but not whiter wine
E.g. mother in law, but not mother in laws
Manning & Schütze, Fundamentals of Statistical Natural Language Processing, 1999

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 74 / 77
Word frequency and collocations

Metrics for Collocation or MWE Extraction

Frequency
Mean and Variance of Distances between Words
Hypothesis Testing
t-test
χ2 test
likelihood ratio test
Mutual Information
Left and Right Context Entropy
C-Value

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 75 / 77
Word frequency and collocations

Further reading on collocation and MWE

Manning & Schütze, Fundamentals of Statistical Natural Language


Processing, 1999, Chapter 3 (A general introduction to collocation)
Katerina T. Frantzi, Sophia Ananiadou, Junichi Tsujii, The C-value /
NC-value Method of Automatic Recognition for Multi-word Terms, ECDL
1998: Research and Advanced Technology for Digital Libraries pp
585-604 (proposed the C-value metric)
Zhiyong Luo, Rou Song, An integrated method for Chinese unknown
word extraction, SIGHAN 2004. Barcelona, Spain. (proposed the
context entropy method)

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 76 / 77
Summary

Content

1 About the course

2 Research questions and NLP tasks

3 Grammars and Automata

4 Text segmentation and morphology analysis

5 Word frequency and collocations

Valentin Malykh & Qun Liu (MTS AI) Natural Language Processing Autumn 2024 77 / 77

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy