0% found this document useful (0 votes)

2 views41 pages

SPR 08 Algorithms

The document discusses various linguistic algorithms used in speech recognition, including normalization, tokenization, maximum matching word segmentation, byte pair encoding, word distance, and Porter's stemming algorithm. It highlights the importance of normalization for consistent token representation and outlines methods for segmenting words and calculating similarities. Additionally, it provides examples and Python code for implementing these algorithms.

Uploaded by

oweenbarranzuela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views41 pages

SPR 08 Algorithms

Uploaded by

oweenbarranzuela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Speech Recognition

Linuguistic Algorithms

Prof. Dr.-Ing. Udo Garmann

DIT Faculty of Computer Science

Speech Recognition 1 of 41
Contents

• Normalization
• Maximum Matching Word Segmentation
• Byte Pair Encoding
• Word Distance
• Porters Stemming Algorithm

Speech Recognition 2 of 41
Normalization
Need to normalize tokens
Information Retrieval: indexed text &
query terms must have same form.
We implicitly define equivalence classes of
tokens
Figure 1: Example from Article: Text
Remove periods etc: We probably want to
Normalization Using Encoder–Decoder
match U.S.A. and USA Networks Based on the Causal Feature
Asymmetric expansion also needed Extractor Adrián Javaloy and Ginés
sometimes: García-Mateos 2
Enter: window Search: window, windows
Enter: windows Search: Windows,
windows, window
Enter: Windows Search: Windows

Speech Recognition 3 of 41
Issues in Tokenization
Handling of apostrophes, minus, space and periods:
Finland’s capital → Finland Finlands Finland’s ?
what’re, I’m, isn’t → What are, I am, is not
Hewlett-Packard → Hewlett Packard ?
state-of-the-art → state of the art ?
Lowercase → lower-case lowercase lower case ?
San Francisco → one token or two?
m.p.h., PhD. → ??
One commonly used tokenization standard is known as the Penn Treebank tokenization
standard, used for the parsed corpora (treebanks) released by the Penn Treebank
tokenization Linguistic Data Consortium (LDC), the source of many useful datasets.
Speech Recognition 4 of 41
Tokenization Language Issues
French:
L’ensemble → one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
German noun compounds are not segmented:
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German information retrieval needs compound splitter
Asiatic languages (Chinese, Japanese, Thai) also diﬀicult to tokenize

Speech Recognition 5 of 41
Maximum Matching Word Segmentation Algorithm

Given a wordlist/dictionary of German words, and a string of a word that needs to be

segmented.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches the string starting at pointer
3) Move the pointer over the word in string
4) Go to 2

Speech Recognition 6 of 41
Example

Dictionary : [leben, versicherung, gesellschaft, angestellter, mitarbeiter, auto, ]

String: ‘Lebensversicherungsgesellschaftsangestellter’

1) Pointer at first letter L

2) Longest Word in Dictionary (case-insensitive): leben
3) Move pointer to letter s
4) No match with dictionary words
5) Move pointer to letter v
6) Longest Word in Dictionary (case-insensitive): versicherung
7) repeat …

Modern probabilistic segmentation algorithms even better.

Speech Recognition 7 of 41
Byte-pair Encoding Description

Some algorithms first learn from one (training) corpus, then they are applied to a test
corpus.
So does the Byte-pair Encoding, that can be used for tokenization.
beside Jurafsky book see https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
It was originally developed for text compression.
Used for tokenization in Transformer models at OpenAI (GPT) and Google (BART)

Speech Recognition 8 of 41
Description of BPE
Initialization of BPE:
1) The unique set of words (= word types) - probably after some normalization and
pre-tokenization of the corpus - is computed: The dictionary.
2) Then the symbols used in words are collected: The vocabulary.
3) Each word is represented as a sequence of characters plus a special end-of-word
symbol.
Steps in BPE:
1) At each step of the algorithm, the number of symbol pairs is counted.
2) Then find the most frequent pair (‘A’, ‘B’), and replace it with the new merged
symbol (‘AB’).
We continue to count and merge, creating new longer and longer character strings,
until we’ve done k merges; k is a parameter of the algorithm.
Speech Recognition 9 of 41
Example for BPE (1/5)
Training corpus contains words low, new, newer, but not lower.

freq dictionary vocabulary

5 low_ _, d, e, i, l, n,
o, r, s, t, w
2 lowest_
6 newer_
3 wider_
2 new_

Algorithm runs inside words. Input: Dictionary of words with frequencies.

Starting vocabulary of 11 letters (with end of word symbol _)

Speech Recognition 10 of 41
Example for BPE (2/5)
Pairs of symbols are counted
Most frequent is the pair e r (9 occurrences)
These symbols are merged, creating a new symbol er that is treated as a single symbol
Count again:

freq dictionary vocabulary

5 low_ _, d, e, i, l, n,
o, r, s, t, w,
er
2 lowest_
6 n e w er _
3 w i d er _
2 new_

Speech Recognition 11 of 41
Example for BPE (3/5)

Now the most frequent pair is er _ (9 occurrences)

Count again:

freq dictionary vocabulary

5 low_ _, d, e, i, l, n,
o, r, s, t, w,
er, er_
2 lowest_
6 n e w er_
3 w i d er_
2 new_

Speech Recognition 12 of 41
Example for BPE (4/5)
Most frequent is now the pair n e (8 occurrences) -> merge
Count again:

freq dictionary vocabulary

5 low_ _, d, e, i, l, n,
o, r, s, t, w,
er, er_, ne
2 lowest_
6 ne w er_
3 w i d er_
2 ne w _

Speech Recognition 13 of 41
Example for BPE (5/5)

The next merges of 2 symbols with the current vocabulary are:

(n, ew) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new
(l, o) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo
(lo, w) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo, low
(new, er_) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo, low, newer_
(low, _) _, d, e, i, l, n, o, r, s, t, w, r , er_ , ew, new, lo, low, newer_ , low_

Speech Recognition 14 of 41
Apply BPE on Test Sentence
After the tokens have been learned with training data, a test sentence can be tokenized:
• The learned merges are used by the token segmenter greedily on the test data in
the order they were learned. (Thus the frequencies in the test data don’t play a
role, just the frequencies in the training data).
• E.g. think of the test sentence The newer the lower.
• First, each word in a test sentence is segmented into characters.
• Then apply first rule: replace every instance of e r in the test corpus with er.
• Then second rule: replace every instance of er _ in the test corpus with er_ ,
and so on.
• By the end, if the test corpus contained the character sequence n e w e r, it
would be tokenized as a full word.
• But the characters of a new (unknown) word like l o w e r would be merged
into the two tokens low er_ .

Speech Recognition 15 of 41
Byte-pair Encoding Python Code

https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
Other description in
Python code for BPE learning algorithm from Sennrich et al. (2016): Neural Machine
Translation of Rare Words with Subword Units
https://www.aclweb.org/anthology/P16-1162/
(Hint: OOV is short for Out-Of-Vocabulary (word), meaning a test word)
An implementation can also be found on Github:
https://github.com/rsennrich/subword-nmt

Speech Recognition 16 of 41
Calculating the Similarity of Words
Application examples:
• Spell correction
• The user typed “graffe”. Which is closest? graf, graft, grail, giraffe

• Computational Biology
• Align two sequences of nucleotides
AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• Resulting alignment:

-AGGCTATCACCTGACCTCCAGGCCGA–TGCCC—

TAG-CTATCAC–GACCGC–GGTCGATTTGCCCGAC

Also for Machine Translation, Information Extraction, Speech Recognition

Speech Recognition 17 of 41
NGram Overlap
see
http://textmining.wp.hs-hannover.de/UnscharfeSuche.html
Several tutorials:
http://textmining.wp.hs-hannover.de
Words may be devided into trigrams (= groups of 3 letters).
When two words have similar trigrams, they are similar in whole.
For the similarity of the set of trigrams a measure like the Jaccard-Coeﬀizient may be
used. It is defined like this:
jaccard(A, B) =
∣ (𝐴 ∩ 𝐵) ∣
∣ (𝐴 ∪ 𝐵) ∣
Speech Recognition 18 of 41
ngram function

see https://textmining.wp.hs-hannover.de/UnscharfeSuche.html
1 def ngram(string,n):
2 mylist = []
3 if n < len(string):
4 for p in range(len(string) - n + 1) :
5 tg = string[p:p+n]
6 mylist.append(tg)
7 return mylist

Speech Recognition 19 of 41
jaccard function

1 def jaccard(A,B):
2 intersection = 0
3 for a in A:
4 if a in B:
5 intersection += 1
6 union = len(A) + len(B) - intersection
7 return float(intersection)/float(union)

Speech Recognition 20 of 41
Example Usage

# Check:
>>> str1 = 'intention'
>>> str2 = 'execution'
>>> n1 = ngram(str1,3)
>>> print(n1) # ['int', 'nte', 'ten', 'ent', 'nti', 'tio', 'ion']
>>> n2 = ngram(str2,3)
>>> print(n2) # ['exe', 'xec', 'ecu', 'cut', 'uti', 'tio', 'ion']
>>> print(jaccard(n1, n1))
1.0
>>> print(jaccard(n1, n2))
0.16666666666666666

Speech Recognition 21 of 41
Word Distance

Aka Levenshtein Distance

Calculate costs for transforming one word into another.
Operations: Insertion, Deletion, Substitution (with “price” for each)
also see http://textmining.wp.hs-hannover.de/UnscharfeSuche.html

Speech Recognition 22 of 41
Word Distance Principle
Instead of counting the operations, the two words are aligned.
At the beginning we have 2 pointers at the beginning of the words.
Every operation costs a single price, e.g. 1 (cost unit) for insertion and deletion and 2 (or also
1) for substitution.
The goal is to get both pointers to the end of the words with minimum costs.
If both pointers are moved forward reading the same character, there are no costs.
If both pointers are moved forward reading different characters, the substitution costs.
If the first pointer is moved one position and the second stays, the cost for a deletion is taken.
If the second pointer is moved and the first one stays on its position, a character is inserted.
For a cell in the diagonal, the minimum of the three operations is used.

Speech Recognition 23 of 41
Word Distance Visualization
• E.g. first pointer uses the word in the
first row (Geschichte), second
pointer first column (Gesichtet).
• Then horizontal rows indicate
deletions in the first word, e.g. row 5
(s), column 6 (c), the c is deleted.
• Columns indicate insertions, e.g. row
6, column 5, i

Figure 2: Red values indicate min costs per

step, all costs are 1

Speech Recognition 24 of 41
Stemming

Reduce terms to their stems

Stemming is crude chopping of aﬀixes language dependent
E.g., automate(s), automatic, automation all reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
(becomes)
for exampl compress and compress ar both accept as equival to compress
see https://en.wikipedia.org/wiki/Stemming

Speech Recognition 25 of 41
Porters Algorithm Introduction and Definitions
The most common stemmer for English
see https://de.wikipedia.org/wiki/Porter-Stemmer-Algorithmus
Original Paper: https://tartarus.org/martin/PorterStemmer/
Definitions:
Consonant in a word letter other than A, E, I, O or U, and other than Y preceded by a
consonant
Vowel letter is not a consonant
Examples:
TOY the consonants are T and Y (because not-consonant O preceeds Y)
In SYZYGY they are S, Z and G.

Speech Recognition 26 of 41
Porters Algorithm Description

Porters Stemmer applies rules.

The shortening rules consist of pairs of conditions and derivations for various suﬀixes
(word endings). The rules are summarized in groups that are processed one after the
other. Only one rule may be applied from each group.
Example: The first group contains the suﬀix shortening rules “sses” -> “s”, “ies” ->
“i” and “s” -> ”“, which lead to the derivations”libraries” -> “librari” and “Wikis” ->
“Wiki” run. A group that follows later consists of the rule “y” -> “i”, so that, for
example, the word “library” is traced back to the same root (“library” -> “librari”).

Speech Recognition 27 of 41
Porters Algorithm Lists
A consonant will be denoted by c, a vowel by v. A list ccc… of length greater than 0 will be denoted by
C, and a list vvv… of length greater than 0 will be denoted by V. Any word, or part of a word, therefore
has one of the four forms:
CVCV ... C
CVCV ... V
VCVC ... C
VCVC ... V
These may all be represented by the single form
[C]VCVC ... [V]
where the square brackets denote arbitrary presence of their contents.
Using (VC){m} to denote VC repeated m times, this may again be written as
[C](VC){m}[V].

m will be called the measure of any word or word part when represented in this form.

Speech Recognition 28 of 41
Measures Examples

The case m = 0 covers the null word. Here are some examples:
m=0 TR, EE, TREE, Y, BY.
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

Speech Recognition 29 of 41
Rules and Conditions

The rules for removing a suﬀix will be given in the form

(condition) S1 -> S2
This means that if a word ends with the suﬀix S1, and the stem before S1 satisfies the
given condition, S1 is replaced by S2.
The condition is usually given in terms of m, e.g.
(m > 1) EMENT ->
Here S1 is ‘EMENT’ and S2 is null. This would map REPLACEMENT to REPLAC,
since REPLAC is a word part for which m = 2.

Speech Recognition 30 of 41
Conditions
The ‘condition’ part may also contain the following:
*S - the stem ends with S (and similarly for the other letters).
*v* - the stem contains a vowel.
*d - the stem ends with a double consonant (e.g. -TT, -SS).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
And the condition part may also contain expressions with and, or and not, so that
(m>1 and (*S or *T))
tests for a stem with m>1 ending in S or T, while
(*d and not (*L or *S or *Z))
tests for a stem ending with a double consonant other than L, S or Z.
Elaborate conditions like this are required only rarely.
Speech Recognition 31 of 41
Porters Algorithm Step 1a
The rules in a step are examined in sequence , and only one rule from a step can be
applied.
In a set of rules written beneath each other, only one is obeyed, and this will be the
one with the longest matching S1 for the given word.
Step 1 deals with plurals and past participles.
Step 1a
SSES -> SS caresses -> caress
IES -> I ponies -> poni
ties -> ti
SS -> SS caress -> caress
S -> cats -> cat

Speech Recognition 32 of 41
Porters Algorithm Step 1b

(m>0) EED -> EE feed -> feed

agreed -> agree
(*v*) ED -> plastered -> plaster
bled -> bled
(*v*) ING -> motoring -> motor
sing -> sing

More rules are given as add-on to step 1b and a step 1c.

Speech Recognition 33 of 41
Porters Algorithm Step 2
Subsequent steps are more straightforward:
(m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition
rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate

Speech Recognition 34 of 41
Porters Algorithm Step 3

(m>0) ICATE -> IC triplicate -> triplic

(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good

Speech Recognition 35 of 41
Porters Algorithm Step 4
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ

...

The suﬀixes are now removed. All that remains is a little tidying up.
Speech Recognition 36 of 41
Porters Algorithm Step 5a

(m>1) E -> probate -> probat

rate -> rate
(m=1 and not *o) E -> cease -> ceas

Speech Recognition 37 of 41
Porters Algorithm Step 5b and Sequence

(m > 1 and d and L) -> single letter

controll -> control
roll -> roll
Each step is applied on the result of the previous step, e.g.
# ...
stem = self._step4(stem)
stem = self._step5a(stem)
stem = self._step5b(stem)
For an implementation see e.g.
https://www.nltk.org/_modules/nltk/stem/porter.html

Speech Recognition 38 of 41
Porters Algorithm Examples
Example 1:
GENERALIZATIONS → GENERALIZATION (Step 1)
GENERALIZATION → GENERALIZE (Step 2)
GENERALIZE → GENERAL (Step 3)
GENERAL → GENER (Step 4)
Example 2:
OSCILLATORS → OSCILLATOR (Step 1)
OSCILLATOR → OSCILLATE (Step 2)
OSCILLATE → OSCILL (Step 4)
OSCILL → OSCIL (Step 5)

Speech Recognition 39 of 41
Porters Algorithm Evaluation

see https://tartarus.org/martin/PorterStemmer/def.txt
Suffix stripping of a vocabulary of 10,000 words
------------------------------------------------
Number of words reduced in step 1: 3597
" 2: 766
" 3: 327
" 4: 2424
" 5: 1373
Number of words not reduced: 3650

Speech Recognition 40 of 41
Thank you! Questions?

Speech Recognition 41 of 41

SPR 06 Formal Lang CFG
No ratings yet
SPR 06 Formal Lang CFG
37 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
SPR 05 NLTK
No ratings yet
SPR 05 NLTK
18 pages
Words and Corpora J+M
No ratings yet
Words and Corpora J+M
49 pages
Assignment Fa24 MSDS 0006
No ratings yet
Assignment Fa24 MSDS 0006
14 pages
NLP Week 02
No ratings yet
NLP Week 02
55 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
Assignment No 1 - Genai
No ratings yet
Assignment No 1 - Genai
10 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Assignment No 1 - Genai Fa24-Msds-0007
No ratings yet
Assignment No 1 - Genai Fa24-Msds-0007
10 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
MS 10226A Programming in C With Visual Studio 2010 Trainer Handbook
0% (1)
MS 10226A Programming in C With Visual Studio 2010 Trainer Handbook
648 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
59 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
58 pages
Ni 2900
No ratings yet
Ni 2900
24 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Natural Language Processing From Scratch: Bruno Gonçalves
No ratings yet
Natural Language Processing From Scratch: Bruno Gonçalves
87 pages
Course2 Tokenization
No ratings yet
Course2 Tokenization
44 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Lec 10
No ratings yet
Lec 10
77 pages
Session 1
No ratings yet
Session 1
33 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
Byte-Pair Encoding Tokenization - Hugging Face NLP Course
No ratings yet
Byte-Pair Encoding Tokenization - Hugging Face NLP Course
17 pages
Tokenization
No ratings yet
Tokenization
26 pages
Formalizing BPE Tokenization
No ratings yet
Formalizing BPE Tokenization
12 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
30 pages
WordPiece Tokenization - Hugging Face NLP Course
No ratings yet
WordPiece Tokenization - Hugging Face NLP Course
12 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Tokenization
No ratings yet
Tokenization
7 pages
Manual de Past
No ratings yet
Manual de Past
225 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
No ratings yet
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
36 pages
Week 2
No ratings yet
Week 2
90 pages
Vinija's Notes - Natural Language Processing - Tokenizer
No ratings yet
Vinija's Notes - Natural Language Processing - Tokenizer
11 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
Scrambling Code Planning in UMTS Networks: Amit Pathak
100% (1)
Scrambling Code Planning in UMTS Networks: Amit Pathak
18 pages
Tutorial On Speech Recognition: Alex Acero Microsoft Research
No ratings yet
Tutorial On Speech Recognition: Alex Acero Microsoft Research
38 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Elmasri 6e Ch21
No ratings yet
Elmasri 6e Ch21
54 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Americas Technology - Software - Private Conference Takeaways - AI For Developers, Workflows and Scientific Breakthroughs
No ratings yet
Americas Technology - Software - Private Conference Takeaways - AI For Developers, Workflows and Scientific Breakthroughs
30 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
BMC ProactiveNet Administrator Guide
No ratings yet
BMC ProactiveNet Administrator Guide
582 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
CP Notes
No ratings yet
CP Notes
58 pages
Ribbon
No ratings yet
Ribbon
33 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
This Study Resource Was: Bahria University (Karachi Campus)
No ratings yet
This Study Resource Was: Bahria University (Karachi Campus)
4 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
35 pages
CS671A/CS671: Introduction To Natural Language Processing Mid-Semester Exam
No ratings yet
CS671A/CS671: Introduction To Natural Language Processing Mid-Semester Exam
7 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Chapter 5c - Frontend Vs Backend
0% (1)
Chapter 5c - Frontend Vs Backend
25 pages
Networking: What Is A Network ?
No ratings yet
Networking: What Is A Network ?
23 pages
Ericsson Business Phone 250
No ratings yet
Ericsson Business Phone 250
7 pages
Application Programming Interface
No ratings yet
Application Programming Interface
12 pages
Software Defined Networks
No ratings yet
Software Defined Networks
18 pages
FAQ - Delete All Snapshots and Consolidate Snapshots Feature
No ratings yet
FAQ - Delete All Snapshots and Consolidate Snapshots Feature
5 pages
ILW1501 - A Guide To The Moodle Proctoring Tool - The Form of Invigilation Used For All Online Assessments of The Module ILW1501 - 2024
No ratings yet
ILW1501 - A Guide To The Moodle Proctoring Tool - The Form of Invigilation Used For All Online Assessments of The Module ILW1501 - 2024
10 pages
Getting Started With I C Using MSSP On Pic18: View Code Examples On Github
No ratings yet
Getting Started With I C Using MSSP On Pic18: View Code Examples On Github
21 pages
GOOSE Configuration Example
100% (1)
GOOSE Configuration Example
24 pages
Top - AI - Tools - 1682394982 2023-04-25 03 - 56 - 28
No ratings yet
Top - AI - Tools - 1682394982 2023-04-25 03 - 56 - 28
11 pages
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
No ratings yet
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
57 pages
Seleen CV New
No ratings yet
Seleen CV New
2 pages
End Sem Answer Key 2023
No ratings yet
End Sem Answer Key 2023
4 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
DVR Dahua Hcvr4104he s2 Catalog
No ratings yet
DVR Dahua Hcvr4104he s2 Catalog
3 pages
EVALUASI USABILITY GAME ELKER 2043 (English)
No ratings yet
EVALUASI USABILITY GAME ELKER 2043 (English)
5 pages
Ac20 115B
No ratings yet
Ac20 115B
1 page
Lesson 2-Introduction To Simulation and Modeling
No ratings yet
Lesson 2-Introduction To Simulation and Modeling
7 pages
Csi 3310 Midterm Samples Solution
No ratings yet
Csi 3310 Midterm Samples Solution
5 pages
This Tutorial Will Show You How To Create Random Images That Are Loaded From External Sources
No ratings yet
This Tutorial Will Show You How To Create Random Images That Are Loaded From External Sources
5 pages
Case Study - School App - Project Charter
No ratings yet
Case Study - School App - Project Charter
3 pages
NTLS Creation
No ratings yet
NTLS Creation
3 pages
Best Lightweight Linux Distributions For Older Computers - Régi Gépre Operációs Rendszer 16 Alternativa
No ratings yet
Best Lightweight Linux Distributions For Older Computers - Régi Gépre Operációs Rendszer 16 Alternativa
1 page
Introduction to Formal Languages
From Everand
Introduction to Formal Languages
György E. Révész
2/5 (1)
The Genetic Code of All Languages,(Part 2.1; Numerals)
From Everand
The Genetic Code of All Languages,(Part 2.1; Numerals)
Moni Kanchan Panda
No ratings yet
The Genetic Code of All Languages; Part-5 (Hebrew)
From Everand
The Genetic Code of All Languages; Part-5 (Hebrew)
Moni Kanchan Panda
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SPR 08 Algorithms

Uploaded by

SPR 08 Algorithms

Uploaded by

Speech Recognition

Prof. Dr.-Ing. Udo Garmann

DIT Faculty of Computer Science

Given a wordlist/dictionary of German words, and a string of a word that needs to be

Dictionary : [leben, versicherung, gesellschaft, angestellter, mitarbeiter, auto, ]

1) Pointer at first letter L

Modern probabilistic segmentation algorithms even better.

freq dictionary vocabulary

Algorithm runs inside words. Input: Dictionary of words with frequencies.

freq dictionary vocabulary

Now the most frequent pair is er _ (9 occurrences)

freq dictionary vocabulary

freq dictionary vocabulary

The next merges of 2 symbols with the current vocabulary are:

Also for Machine Translation, Information Extraction, Speech Recognition

Aka Levenshtein Distance

Figure 2: Red values indicate min costs per

Reduce terms to their stems

Porters Stemmer applies rules.

The rules for removing a suﬀix will be given in the form

(m>0) EED -> EE feed -> feed

More rules are given as add-on to step 1b and a step 1c.

(m>0) ICATE -> IC triplicate -> triplic

(m>1) E -> probate -> probat

(m > 1 and d and L) -> single letter

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

SPR 08 Algorithms

Uploaded by

SPR 08 Algorithms

Uploaded by

Speech Recognition

Prof. Dr.-Ing. Udo Garmann

DIT Faculty of Computer Science

Given a wordlist/dictionary of German words, and a string of a word that needs to be

Dictionary : [leben, versicherung, gesellschaft, angestellter, mitarbeiter, auto, ]

1) Pointer at first letter L

Modern probabilistic segmentation algorithms even better.

freq dictionary vocabulary

Algorithm runs inside words. Input: Dictionary of words with frequencies.

freq dictionary vocabulary

Now the most frequent pair is er _ (9 occurrences)

freq dictionary vocabulary

freq dictionary vocabulary

The next merges of 2 symbols with the current vocabulary are:

Also for Machine Translation, Information Extraction, Speech Recognition

Aka Levenshtein Distance

Figure 2: Red values indicate min costs per

Reduce terms to their stems

Porters Stemmer applies rules.

The rules for removing a suﬀix will be given in the form

(m>0) EED -> EE feed -> feed

More rules are given as add-on to step 1b and a step 1c.

(m>0) ICATE -> IC triplicate -> triplic

(m>1) E -> probate -> probat

(m > 1 and *d and *L) -> single letter

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

(m > 1 and d and L) -> single letter