02 Text Processing - Regular Expressions-Text Normalization
02 Text Processing - Regular Expressions-Text Normalization
Language Processing
Text Processing
➢ Regular expressions
➢ Text Normalization
2
Regular expressions
3
Regular expressions are used everywhere
4
Regular expressions are used everywhere
5
Regular expressions
➢ Regular expressions: Formal language for defining text strings.
6
Regular expressions
• https://www.regexpal.com/
• https://regexr.com/
• https://www.regextester.com/
7
Regular Expressions Basics: Concatenation
• Note: Carat (^) signifies negation only when it's first in the list.
• Special characters like (*., , +, ?) lose their special meaning when
used inside [].
• \d : any digit
• \s : whitespace
• \w : alphanumeric character or underscore
➢ Capitalized versions negate the match:
• \D : any non-digit
➢ Combine square brackets and pipe for flexible patterns (e.g., lower/upper case
and string choices).
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck 12
Regular Expressions: Wildcards, optionality,
repetition
➢ Period (.): Acts as a wildcard, matching any character.
15
Regular Expressions: Grouping ()
➢ Example: Matching column labels like Column 1 Column 2 Column 3 in a
sequence.
➢ The Star (*) here applies only to the space before it, not the entire
sequence.
16
Regular Expressions: precedence
17
Regular Expressions: A note about Python
18
Regular Expressions: Substitutions
➢ Syntax: s/regexp1/pattern/
19
Regular Expressions: Substitutions
➢ For Python, we can built-in Regular Expression package re:
import re
Output: The9rain9in9Spain
20
Regular Expressions: Capture Groups
➢ Say we want to put angles around all numbers:
Example: s/([0-9]+)/<\1>/
print(x)
21
Regular Expressions: Capture Groups
➢ In complex patterns, we'll want to use more than one register; here's an example where
we first capture two strings, and then refer to them both in order:
Matches
But not
22
Regular Expressions: Example
23
False positives and false negatives
➢ The process we just went through was based on fixing two kinds of
errors:
24
False positives and false negatives
25
Words and Corpora
26
How many words in a sentence?
28
How many words in a sentence?
They lay back on the San Francisco grass and looked at the stars and their
➢ Word types: Unique words, count each word only once (e.g., "the" counted once).
➢ Word tokens: Count every word occurrence (e.g., "the" counted twice).
➢ The goal of the word count affects how you count words.
29
How many words in a corpus?
➢ Corpora (plural of corpus)
• Corpora: Collections or bodies of text
➢ N = number of tokens
➢ Heaps Law = Herdan's Law = where often .67 < β < .75
➢ i.e., vocabulary size grows with square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
30
Google N-grams 1 trillion 13+ million
Corpora
31
Corpora
34
Text Normalization
37
Issues in Tokenization
➢ Some languages (like Chinese, Japanese, Thai) don't use spaces to
separate words!
• How do we decide where the token boundaries should be?
➢ in Chinese it's common to just treat each character (zi) as a token.
• So, the segmentation step is very simple
➢ In other languages (like Thai and Japanese), more complex word
segmentation is required.
• The standard algorithms are neural sequence models trained by supervised
machine learning.
38
Another option for text tokenization
➢ Instead of:
• white-space segmentation
• single-character segmentation
➢ Some algorithms use corpus statistics to decide how to segment a
text into tokens:
• Use the data to tell us how to tokenize.
• Subword tokenization (because tokens can be parts of words as well as
• whole words)
39
Subword tokenization
➢ Three common algorithms:
• Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
• Unigram language modeling tokenization (Kudo, 2018)
• WordPiece (Schuster and Nakajima, 2012)
➢ All have 2 parts:
• A token learner that takes a raw training corpus and induces a vocabulary (a
set of tokens).
40
Byte Pair Encoding
Repeat:
• Choose the two symbols that are most frequently adjacent in
the training corpus (say 'A', 'B')
• Add a new merged symbol 'AB' to the vocabulary
• Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
41
Byte Pair Encoding: algorithm
42
Byte Pair Encoding
43
Byte Pair Encoding: Example
➢ Original corpus:
low low low low low lowest lowest newer newer newer
newer newer newer wider wider wider new new
-, d, e, i, l, n, o, r, s, t, w
44
Byte Pair Encoding: Example
Merge e r to er
45
Byte Pair Encoding: Example
Merge er _ to er_
46
Byte Pair Encoding: Example
Merge n e to ne
47
Byte Pair Encoding: Example
48
Byte Pair Encoding: Example
➢ On the test data, run each merge learned from the training data:
• Greedily
• In the order we learned them
• (test frequencies don't play a role)
➢ So: merge every e r to er, then merge er _ to er_, etc.
➢ Result:
• Test set "n e w e r _" would be tokenized as a full word: “newer_”
• Test set "l o w e r _" would be two tokens: "low er_"
49
Byte Pair Encoding
50
Word Normalization
• U.S.A. or USA
• uhhuh or uh-huh
• Fed or fed
• am, is, be, are
51
Word Normalization: Case folding
52
Word Normalization: Lemmatization
• am, are, is ➔ be
• car, cars, car's, cars’ ➔ car
• He is reading detective stories ➔ He be read detective story
53
Word Normalization: Lemmatization
➢ Morphemes:
• The small meaningful units that make up words
• Stems: The core meaning-bearing units
• Affixes: Parts that adhere to stems, often with grammatical
functions
➢ Morphological Parsers:
• Parse cats into two morphemes cat and s
54
Word Normalization: Stemming
This was not the map we found Thi wa not the map we found in
in Billy Bones’s chest, but an Billi Bone s chest but an accur
accurate copy, complete in all copi complet in all thing name
things-names and heights and and height and sound with the
soundings-with the single singl except of the red cross
exception of the red crosses and and the written note
the written notes.
55
Word Normalization: Porter Stemmer
56
Sentence Segmentation
➢ For sentence segmentation we can use symbols like (!, ? , or period “.”)
57
Thank You