0% found this document useful (0 votes)
2 views41 pages

SPR 08 Algorithms

The document discusses various linguistic algorithms used in speech recognition, including normalization, tokenization, maximum matching word segmentation, byte pair encoding, word distance, and Porter's stemming algorithm. It highlights the importance of normalization for consistent token representation and outlines methods for segmenting words and calculating similarities. Additionally, it provides examples and Python code for implementing these algorithms.

Uploaded by

oweenbarranzuela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views41 pages

SPR 08 Algorithms

The document discusses various linguistic algorithms used in speech recognition, including normalization, tokenization, maximum matching word segmentation, byte pair encoding, word distance, and Porter's stemming algorithm. It highlights the importance of normalization for consistent token representation and outlines methods for segmenting words and calculating similarities. Additionally, it provides examples and Python code for implementing these algorithms.

Uploaded by

oweenbarranzuela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Speech Recognition

Linuguistic Algorithms

Prof. Dr.-Ing. Udo Garmann

DIT Faculty of Computer Science

Speech Recognition 1 of 41
Contents

• Normalization
• Maximum Matching Word Segmentation
• Byte Pair Encoding
• Word Distance
• Porters Stemming Algorithm

Speech Recognition 2 of 41
Normalization
Need to normalize tokens
Information Retrieval: indexed text &
query terms must have same form.
We implicitly define equivalence classes of
tokens
Figure 1: Example from Article: Text
Remove periods etc: We probably want to
Normalization Using Encoder–Decoder
match U.S.A. and USA Networks Based on the Causal Feature
Asymmetric expansion also needed Extractor Adrián Javaloy and Ginés
sometimes: García-Mateos 2
Enter: window Search: window, windows
Enter: windows Search: Windows,
windows, window
Enter: Windows Search: Windows

Speech Recognition 3 of 41
Issues in Tokenization
Handling of apostrophes, minus, space and periods:
Finland’s capital → Finland Finlands Finland’s ?
what’re, I’m, isn’t → What are, I am, is not
Hewlett-Packard → Hewlett Packard ?
state-of-the-art → state of the art ?
Lowercase → lower-case lowercase lower case ?
San Francisco → one token or two?
m.p.h., PhD. → ??
One commonly used tokenization standard is known as the Penn Treebank tokenization
standard, used for the parsed corpora (treebanks) released by the Penn Treebank
tokenization Linguistic Data Consortium (LDC), the source of many useful datasets.
Speech Recognition 4 of 41
Tokenization Language Issues
French:
L’ensemble → one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
German noun compounds are not segmented:
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German information retrieval needs compound splitter
Asiatic languages (Chinese, Japanese, Thai) also difficult to tokenize

Speech Recognition 5 of 41
Maximum Matching Word Segmentation Algorithm

Given a wordlist/dictionary of German words, and a string of a word that needs to be


segmented.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches the string starting at pointer
3) Move the pointer over the word in string
4) Go to 2

Speech Recognition 6 of 41
Example

Dictionary : [leben, versicherung, gesellschaft, angestellter, mitarbeiter, auto, ]


String: ‘Lebensversicherungsgesellschaftsangestellter’

1) Pointer at first letter L


2) Longest Word in Dictionary (case-insensitive): leben
3) Move pointer to letter s
4) No match with dictionary words
5) Move pointer to letter v
6) Longest Word in Dictionary (case-insensitive): versicherung
7) repeat …

Modern probabilistic segmentation algorithms even better.

Speech Recognition 7 of 41
Byte-pair Encoding Description

Some algorithms first learn from one (training) corpus, then they are applied to a test
corpus.
So does the Byte-pair Encoding, that can be used for tokenization.
beside Jurafsky book see https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
It was originally developed for text compression.
Used for tokenization in Transformer models at OpenAI (GPT) and Google (BART)

Speech Recognition 8 of 41
Description of BPE
Initialization of BPE:
1) The unique set of words (= word types) - probably after some normalization and
pre-tokenization of the corpus - is computed: The dictionary.
2) Then the symbols used in words are collected: The vocabulary.
3) Each word is represented as a sequence of characters plus a special end-of-word
symbol.
Steps in BPE:
1) At each step of the algorithm, the number of symbol pairs is counted.
2) Then find the most frequent pair (‘A’, ‘B’), and replace it with the new merged
symbol (‘AB’).
We continue to count and merge, creating new longer and longer character strings,
until we’ve done k merges; k is a parameter of the algorithm.
Speech Recognition 9 of 41
Example for BPE (1/5)
Training corpus contains words low, new, newer, but not lower.

freq dictionary vocabulary


5 low_ _, d, e, i, l, n,
o, r, s, t, w
2 lowest_
6 newer_
3 wider_
2 new_

Algorithm runs inside words. Input: Dictionary of words with frequencies.


Starting vocabulary of 11 letters (with end of word symbol _)

Speech Recognition 10 of 41
Example for BPE (2/5)
Pairs of symbols are counted
Most frequent is the pair e r (9 occurrences)
These symbols are merged, creating a new symbol er that is treated as a single symbol
Count again:

freq dictionary vocabulary


5 low_ _, d, e, i, l, n,
o, r, s, t, w,
er
2 lowest_
6 n e w er _
3 w i d er _
2 new_

Speech Recognition 11 of 41
Example for BPE (3/5)

Now the most frequent pair is er _ (9 occurrences)


Count again:

freq dictionary vocabulary


5 low_ _, d, e, i, l, n,
o, r, s, t, w,
er, er_
2 lowest_
6 n e w er_
3 w i d er_
2 new_

Speech Recognition 12 of 41
Example for BPE (4/5)
Most frequent is now the pair n e (8 occurrences) -> merge
Count again:

freq dictionary vocabulary


5 low_ _, d, e, i, l, n,
o, r, s, t, w,
er, er_, ne
2 lowest_
6 ne w er_
3 w i d er_
2 ne w _

Speech Recognition 13 of 41
Example for BPE (5/5)

The next merges of 2 symbols with the current vocabulary are:


(n, ew) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new
(l, o) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo
(lo, w) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo, low
(new, er_) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo, low, newer_
(low, _) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo, low, newer_ , low_

Speech Recognition 14 of 41
Apply BPE on Test Sentence
After the tokens have been learned with training data, a test sentence can be tokenized:
• The learned merges are used by the token segmenter greedily on the test data in
the order they were learned. (Thus the frequencies in the test data don’t play a
role, just the frequencies in the training data).
• E.g. think of the test sentence The newer the lower.
• First, each word in a test sentence is segmented into characters.
• Then apply first rule: replace every instance of e r in the test corpus with er.
• Then second rule: replace every instance of er _ in the test corpus with er_ ,
and so on.
• By the end, if the test corpus contained the character sequence n e w e r, it
would be tokenized as a full word.
• But the characters of a new (unknown) word like l o w e r would be merged
into the two tokens low er_ .

Speech Recognition 15 of 41
Byte-pair Encoding Python Code

https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
Other description in
Python code for BPE learning algorithm from Sennrich et al. (2016): Neural Machine
Translation of Rare Words with Subword Units
https://www.aclweb.org/anthology/P16-1162/
(Hint: OOV is short for Out-Of-Vocabulary (word), meaning a test word)
An implementation can also be found on Github:
https://github.com/rsennrich/subword-nmt

Speech Recognition 16 of 41
Calculating the Similarity of Words
Application examples:
• Spell correction
• The user typed “graffe”. Which is closest? graf, graft, grail, giraffe

• Computational Biology
• Align two sequences of nucleotides
AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• Resulting alignment:

-AGGCTATCACCTGACCTCCAGGCCGA–TGCCC—

TAG-CTATCAC–GACCGC–GGTCGATTTGCCCGAC

Also for Machine Translation, Information Extraction, Speech Recognition


Speech Recognition 17 of 41
NGram Overlap
see
http://textmining.wp.hs-hannover.de/UnscharfeSuche.html
Several tutorials:
http://textmining.wp.hs-hannover.de
Words may be devided into trigrams (= groups of 3 letters).
When two words have similar trigrams, they are similar in whole.
For the similarity of the set of trigrams a measure like the Jaccard-Coeffizient may be
used. It is defined like this:
jaccard(A, B) =
∣ (𝐴 ∩ 𝐵) ∣
∣ (𝐴 ∪ 𝐵) ∣
Speech Recognition 18 of 41
ngram function

see https://textmining.wp.hs-hannover.de/UnscharfeSuche.html
1 def ngram(string,n):
2 mylist = []
3 if n < len(string):
4 for p in range(len(string) - n + 1) :
5 tg = string[p:p+n]
6 mylist.append(tg)
7 return mylist

Speech Recognition 19 of 41
jaccard function

1 def jaccard(A,B):
2 intersection = 0
3 for a in A:
4 if a in B:
5 intersection += 1
6 union = len(A) + len(B) - intersection
7 return float(intersection)/float(union)

Speech Recognition 20 of 41
Example Usage

# Check:
>>> str1 = 'intention'
>>> str2 = 'execution'
>>> n1 = ngram(str1,3)
>>> print(n1) # ['int', 'nte', 'ten', 'ent', 'nti', 'tio', 'ion']
>>> n2 = ngram(str2,3)
>>> print(n2) # ['exe', 'xec', 'ecu', 'cut', 'uti', 'tio', 'ion']
>>> print(jaccard(n1, n1))
1.0
>>> print(jaccard(n1, n2))
0.16666666666666666

Speech Recognition 21 of 41
Word Distance

Aka Levenshtein Distance


Calculate costs for transforming one word into another.
Operations: Insertion, Deletion, Substitution (with “price” for each)
also see http://textmining.wp.hs-hannover.de/UnscharfeSuche.html

Speech Recognition 22 of 41
Word Distance Principle
Instead of counting the operations, the two words are aligned.
At the beginning we have 2 pointers at the beginning of the words.
Every operation costs a single price, e.g. 1 (cost unit) for insertion and deletion and 2 (or also
1) for substitution.
The goal is to get both pointers to the end of the words with minimum costs.
If both pointers are moved forward reading the same character, there are no costs.
If both pointers are moved forward reading different characters, the substitution costs.
If the first pointer is moved one position and the second stays, the cost for a deletion is taken.
If the second pointer is moved and the first one stays on its position, a character is inserted.
For a cell in the diagonal, the minimum of the three operations is used.

Speech Recognition 23 of 41
Word Distance Visualization
• E.g. first pointer uses the word in the
first row (Geschichte), second
pointer first column (Gesichtet).
• Then horizontal rows indicate
deletions in the first word, e.g. row 5
(s), column 6 (c), the c is deleted.
• Columns indicate insertions, e.g. row
6, column 5, i

Figure 2: Red values indicate min costs per


step, all costs are 1

Speech Recognition 24 of 41
Stemming

Reduce terms to their stems


Stemming is crude chopping of affixes language dependent
E.g., automate(s), automatic, automation all reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
(becomes)
for exampl compress and compress ar both accept as equival to compress
see https://en.wikipedia.org/wiki/Stemming

Speech Recognition 25 of 41
Porters Algorithm Introduction and Definitions
The most common stemmer for English
see https://de.wikipedia.org/wiki/Porter-Stemmer-Algorithmus
Original Paper: https://tartarus.org/martin/PorterStemmer/
Definitions:
Consonant in a word letter other than A, E, I, O or U, and other than Y preceded by a
consonant
Vowel letter is not a consonant
Examples:
TOY the consonants are T and Y (because not-consonant O preceeds Y)
In SYZYGY they are S, Z and G.

Speech Recognition 26 of 41
Porters Algorithm Description

Porters Stemmer applies rules.


The shortening rules consist of pairs of conditions and derivations for various suffixes
(word endings). The rules are summarized in groups that are processed one after the
other. Only one rule may be applied from each group.
Example: The first group contains the suffix shortening rules “sses” -> “s”, “ies” ->
“i” and “s” -> ”“, which lead to the derivations”libraries” -> “librari” and “Wikis” ->
“Wiki” run. A group that follows later consists of the rule “y” -> “i”, so that, for
example, the word “library” is traced back to the same root (“library” -> “librari”).

Speech Recognition 27 of 41
Porters Algorithm Lists
A consonant will be denoted by c, a vowel by v. A list ccc… of length greater than 0 will be denoted by
C, and a list vvv… of length greater than 0 will be denoted by V. Any word, or part of a word, therefore
has one of the four forms:
CVCV ... C
CVCV ... V
VCVC ... C
VCVC ... V
These may all be represented by the single form
[C]VCVC ... [V]
where the square brackets denote arbitrary presence of their contents.
Using (VC){m} to denote VC repeated m times, this may again be written as
[C](VC){m}[V].

m will be called the measure of any word or word part when represented in this form.

Speech Recognition 28 of 41
Measures Examples

The case m = 0 covers the null word. Here are some examples:
m=0 TR, EE, TREE, Y, BY.
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

Speech Recognition 29 of 41
Rules and Conditions

The rules for removing a suffix will be given in the form


(condition) S1 -> S2
This means that if a word ends with the suffix S1, and the stem before S1 satisfies the
given condition, S1 is replaced by S2.
The condition is usually given in terms of m, e.g.
(m > 1) EMENT ->
Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC,
since REPLAC is a word part for which m = 2.

Speech Recognition 30 of 41
Conditions
The ‘condition’ part may also contain the following:
*S - the stem ends with S (and similarly for the other letters).
*v* - the stem contains a vowel.
*d - the stem ends with a double consonant (e.g. -TT, -SS).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
And the condition part may also contain expressions with and, or and not, so that
(m>1 and (*S or *T))
tests for a stem with m>1 ending in S or T, while
(*d and not (*L or *S or *Z))
tests for a stem ending with a double consonant other than L, S or Z.
Elaborate conditions like this are required only rarely.
Speech Recognition 31 of 41
Porters Algorithm Step 1a
The rules in a step are examined in sequence , and only one rule from a step can be
applied.
In a set of rules written beneath each other, only one is obeyed, and this will be the
one with the longest matching S1 for the given word.
Step 1 deals with plurals and past participles.
Step 1a
SSES -> SS caresses -> caress
IES -> I ponies -> poni
ties -> ti
SS -> SS caress -> caress
S -> cats -> cat

Speech Recognition 32 of 41
Porters Algorithm Step 1b

(m>0) EED -> EE feed -> feed


agreed -> agree
(*v*) ED -> plastered -> plaster
bled -> bled
(*v*) ING -> motoring -> motor
sing -> sing

More rules are given as add-on to step 1b and a step 1c.

Speech Recognition 33 of 41
Porters Algorithm Step 2
Subsequent steps are more straightforward:
(m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition
rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate

Speech Recognition 34 of 41
Porters Algorithm Step 3

(m>0) ICATE -> IC triplicate -> triplic


(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good

Speech Recognition 35 of 41
Porters Algorithm Step 4
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ

...

The suffixes are now removed. All that remains is a little tidying up.
Speech Recognition 36 of 41
Porters Algorithm Step 5a

(m>1) E -> probate -> probat


rate -> rate
(m=1 and not *o) E -> cease -> ceas

Speech Recognition 37 of 41
Porters Algorithm Step 5b and Sequence

(m > 1 and *d and *L) -> single letter


controll -> control
roll -> roll
Each step is applied on the result of the previous step, e.g.
# ...
stem = self._step4(stem)
stem = self._step5a(stem)
stem = self._step5b(stem)
For an implementation see e.g.
https://www.nltk.org/_modules/nltk/stem/porter.html

Speech Recognition 38 of 41
Porters Algorithm Examples
Example 1:
GENERALIZATIONS → GENERALIZATION (Step 1)
GENERALIZATION → GENERALIZE (Step 2)
GENERALIZE → GENERAL (Step 3)
GENERAL → GENER (Step 4)
Example 2:
OSCILLATORS → OSCILLATOR (Step 1)
OSCILLATOR → OSCILLATE (Step 2)
OSCILLATE → OSCILL (Step 4)
OSCILL → OSCIL (Step 5)

Speech Recognition 39 of 41
Porters Algorithm Evaluation

see https://tartarus.org/martin/PorterStemmer/def.txt
Suffix stripping of a vocabulary of 10,000 words
------------------------------------------------
Number of words reduced in step 1: 3597
" 2: 766
" 3: 327
" 4: 2424
" 5: 1373
Number of words not reduced: 3650

Speech Recognition 40 of 41
Thank you! Questions?

Speech Recognition 41 of 41

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy