SPR 08 Algorithms
SPR 08 Algorithms
Linuguistic Algorithms
Speech Recognition 1 of 41
Contents
• Normalization
• Maximum Matching Word Segmentation
• Byte Pair Encoding
• Word Distance
• Porters Stemming Algorithm
Speech Recognition 2 of 41
Normalization
Need to normalize tokens
Information Retrieval: indexed text &
query terms must have same form.
We implicitly define equivalence classes of
tokens
Figure 1: Example from Article: Text
Remove periods etc: We probably want to
Normalization Using Encoder–Decoder
match U.S.A. and USA Networks Based on the Causal Feature
Asymmetric expansion also needed Extractor Adrián Javaloy and Ginés
sometimes: García-Mateos 2
Enter: window Search: window, windows
Enter: windows Search: Windows,
windows, window
Enter: Windows Search: Windows
Speech Recognition 3 of 41
Issues in Tokenization
Handling of apostrophes, minus, space and periods:
Finland’s capital → Finland Finlands Finland’s ?
what’re, I’m, isn’t → What are, I am, is not
Hewlett-Packard → Hewlett Packard ?
state-of-the-art → state of the art ?
Lowercase → lower-case lowercase lower case ?
San Francisco → one token or two?
m.p.h., PhD. → ??
One commonly used tokenization standard is known as the Penn Treebank tokenization
standard, used for the parsed corpora (treebanks) released by the Penn Treebank
tokenization Linguistic Data Consortium (LDC), the source of many useful datasets.
Speech Recognition 4 of 41
Tokenization Language Issues
French:
L’ensemble → one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
German noun compounds are not segmented:
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German information retrieval needs compound splitter
Asiatic languages (Chinese, Japanese, Thai) also difficult to tokenize
Speech Recognition 5 of 41
Maximum Matching Word Segmentation Algorithm
Speech Recognition 6 of 41
Example
Speech Recognition 7 of 41
Byte-pair Encoding Description
Some algorithms first learn from one (training) corpus, then they are applied to a test
corpus.
So does the Byte-pair Encoding, that can be used for tokenization.
beside Jurafsky book see https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
It was originally developed for text compression.
Used for tokenization in Transformer models at OpenAI (GPT) and Google (BART)
Speech Recognition 8 of 41
Description of BPE
Initialization of BPE:
1) The unique set of words (= word types) - probably after some normalization and
pre-tokenization of the corpus - is computed: The dictionary.
2) Then the symbols used in words are collected: The vocabulary.
3) Each word is represented as a sequence of characters plus a special end-of-word
symbol.
Steps in BPE:
1) At each step of the algorithm, the number of symbol pairs is counted.
2) Then find the most frequent pair (‘A’, ‘B’), and replace it with the new merged
symbol (‘AB’).
We continue to count and merge, creating new longer and longer character strings,
until we’ve done k merges; k is a parameter of the algorithm.
Speech Recognition 9 of 41
Example for BPE (1/5)
Training corpus contains words low, new, newer, but not lower.
Speech Recognition 10 of 41
Example for BPE (2/5)
Pairs of symbols are counted
Most frequent is the pair e r (9 occurrences)
These symbols are merged, creating a new symbol er that is treated as a single symbol
Count again:
Speech Recognition 11 of 41
Example for BPE (3/5)
Speech Recognition 12 of 41
Example for BPE (4/5)
Most frequent is now the pair n e (8 occurrences) -> merge
Count again:
Speech Recognition 13 of 41
Example for BPE (5/5)
Speech Recognition 14 of 41
Apply BPE on Test Sentence
After the tokens have been learned with training data, a test sentence can be tokenized:
• The learned merges are used by the token segmenter greedily on the test data in
the order they were learned. (Thus the frequencies in the test data don’t play a
role, just the frequencies in the training data).
• E.g. think of the test sentence The newer the lower.
• First, each word in a test sentence is segmented into characters.
• Then apply first rule: replace every instance of e r in the test corpus with er.
• Then second rule: replace every instance of er _ in the test corpus with er_ ,
and so on.
• By the end, if the test corpus contained the character sequence n e w e r, it
would be tokenized as a full word.
• But the characters of a new (unknown) word like l o w e r would be merged
into the two tokens low er_ .
Speech Recognition 15 of 41
Byte-pair Encoding Python Code
https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
Other description in
Python code for BPE learning algorithm from Sennrich et al. (2016): Neural Machine
Translation of Rare Words with Subword Units
https://www.aclweb.org/anthology/P16-1162/
(Hint: OOV is short for Out-Of-Vocabulary (word), meaning a test word)
An implementation can also be found on Github:
https://github.com/rsennrich/subword-nmt
Speech Recognition 16 of 41
Calculating the Similarity of Words
Application examples:
• Spell correction
• The user typed “graffe”. Which is closest? graf, graft, grail, giraffe
• Computational Biology
• Align two sequences of nucleotides
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• Resulting alignment:
-AGGCTATCACCTGACCTCCAGGCCGA–TGCCC—
TAG-CTATCAC–GACCGC–GGTCGATTTGCCCGAC
see https://textmining.wp.hs-hannover.de/UnscharfeSuche.html
1 def ngram(string,n):
2 mylist = []
3 if n < len(string):
4 for p in range(len(string) - n + 1) :
5 tg = string[p:p+n]
6 mylist.append(tg)
7 return mylist
Speech Recognition 19 of 41
jaccard function
1 def jaccard(A,B):
2 intersection = 0
3 for a in A:
4 if a in B:
5 intersection += 1
6 union = len(A) + len(B) - intersection
7 return float(intersection)/float(union)
Speech Recognition 20 of 41
Example Usage
# Check:
>>> str1 = 'intention'
>>> str2 = 'execution'
>>> n1 = ngram(str1,3)
>>> print(n1) # ['int', 'nte', 'ten', 'ent', 'nti', 'tio', 'ion']
>>> n2 = ngram(str2,3)
>>> print(n2) # ['exe', 'xec', 'ecu', 'cut', 'uti', 'tio', 'ion']
>>> print(jaccard(n1, n1))
1.0
>>> print(jaccard(n1, n2))
0.16666666666666666
Speech Recognition 21 of 41
Word Distance
Speech Recognition 22 of 41
Word Distance Principle
Instead of counting the operations, the two words are aligned.
At the beginning we have 2 pointers at the beginning of the words.
Every operation costs a single price, e.g. 1 (cost unit) for insertion and deletion and 2 (or also
1) for substitution.
The goal is to get both pointers to the end of the words with minimum costs.
If both pointers are moved forward reading the same character, there are no costs.
If both pointers are moved forward reading different characters, the substitution costs.
If the first pointer is moved one position and the second stays, the cost for a deletion is taken.
If the second pointer is moved and the first one stays on its position, a character is inserted.
For a cell in the diagonal, the minimum of the three operations is used.
Speech Recognition 23 of 41
Word Distance Visualization
• E.g. first pointer uses the word in the
first row (Geschichte), second
pointer first column (Gesichtet).
• Then horizontal rows indicate
deletions in the first word, e.g. row 5
(s), column 6 (c), the c is deleted.
• Columns indicate insertions, e.g. row
6, column 5, i
Speech Recognition 24 of 41
Stemming
Speech Recognition 25 of 41
Porters Algorithm Introduction and Definitions
The most common stemmer for English
see https://de.wikipedia.org/wiki/Porter-Stemmer-Algorithmus
Original Paper: https://tartarus.org/martin/PorterStemmer/
Definitions:
Consonant in a word letter other than A, E, I, O or U, and other than Y preceded by a
consonant
Vowel letter is not a consonant
Examples:
TOY the consonants are T and Y (because not-consonant O preceeds Y)
In SYZYGY they are S, Z and G.
Speech Recognition 26 of 41
Porters Algorithm Description
Speech Recognition 27 of 41
Porters Algorithm Lists
A consonant will be denoted by c, a vowel by v. A list ccc… of length greater than 0 will be denoted by
C, and a list vvv… of length greater than 0 will be denoted by V. Any word, or part of a word, therefore
has one of the four forms:
CVCV ... C
CVCV ... V
VCVC ... C
VCVC ... V
These may all be represented by the single form
[C]VCVC ... [V]
where the square brackets denote arbitrary presence of their contents.
Using (VC){m} to denote VC repeated m times, this may again be written as
[C](VC){m}[V].
m will be called the measure of any word or word part when represented in this form.
Speech Recognition 28 of 41
Measures Examples
The case m = 0 covers the null word. Here are some examples:
m=0 TR, EE, TREE, Y, BY.
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
Speech Recognition 29 of 41
Rules and Conditions
Speech Recognition 30 of 41
Conditions
The ‘condition’ part may also contain the following:
*S - the stem ends with S (and similarly for the other letters).
*v* - the stem contains a vowel.
*d - the stem ends with a double consonant (e.g. -TT, -SS).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
And the condition part may also contain expressions with and, or and not, so that
(m>1 and (*S or *T))
tests for a stem with m>1 ending in S or T, while
(*d and not (*L or *S or *Z))
tests for a stem ending with a double consonant other than L, S or Z.
Elaborate conditions like this are required only rarely.
Speech Recognition 31 of 41
Porters Algorithm Step 1a
The rules in a step are examined in sequence , and only one rule from a step can be
applied.
In a set of rules written beneath each other, only one is obeyed, and this will be the
one with the longest matching S1 for the given word.
Step 1 deals with plurals and past participles.
Step 1a
SSES -> SS caresses -> caress
IES -> I ponies -> poni
ties -> ti
SS -> SS caress -> caress
S -> cats -> cat
Speech Recognition 32 of 41
Porters Algorithm Step 1b
Speech Recognition 33 of 41
Porters Algorithm Step 2
Subsequent steps are more straightforward:
(m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition
rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate
Speech Recognition 34 of 41
Porters Algorithm Step 3
Speech Recognition 35 of 41
Porters Algorithm Step 4
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ
...
The suffixes are now removed. All that remains is a little tidying up.
Speech Recognition 36 of 41
Porters Algorithm Step 5a
Speech Recognition 37 of 41
Porters Algorithm Step 5b and Sequence
Speech Recognition 38 of 41
Porters Algorithm Examples
Example 1:
GENERALIZATIONS → GENERALIZATION (Step 1)
GENERALIZATION → GENERALIZE (Step 2)
GENERALIZE → GENERAL (Step 3)
GENERAL → GENER (Step 4)
Example 2:
OSCILLATORS → OSCILLATOR (Step 1)
OSCILLATOR → OSCILLATE (Step 2)
OSCILLATE → OSCILL (Step 4)
OSCILL → OSCIL (Step 5)
Speech Recognition 39 of 41
Porters Algorithm Evaluation
see https://tartarus.org/martin/PorterStemmer/def.txt
Suffix stripping of a vocabulary of 10,000 words
------------------------------------------------
Number of words reduced in step 1: 3597
" 2: 766
" 3: 327
" 4: 2424
" 5: 1373
Number of words not reduced: 3650
Speech Recognition 40 of 41
Thank you! Questions?
Speech Recognition 41 of 41