0% found this document useful (0 votes)

2 views58 pages

02 Text Processing - Regular Expressions-Text Normalization

The document discusses the fundamentals of Natural Language Processing (NLP), focusing on regular expressions, text normalization, and the handling of words and corpora. It covers the applications of regular expressions in text processing, including preprocessing, data formatting, and text analysis, as well as the complexities of counting words and tokens in sentences and corpora. Additionally, it highlights the importance of text normalization in NLP tasks, detailing methods of tokenization and the challenges involved.

Uploaded by

esmailelhariri272

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views58 pages

02 Text Processing - Regular Expressions-Text Normalization

Uploaded by

esmailelhariri272

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

TM340 Natural

Language Processing

Text Processing

Regular Expressions - Text

Normalization - Words and Corpora

Based on slides by Dan Jurafsky and Chris Manning

Agenda

➢ Regular expressions

➢ Words and Corpora

➢ Text Normalization

2
Regular expressions

3
Regular expressions are used everywhere

➢ Regular expressions are fundamental to text

processing.

➢ They are integrated into every NLP toolkit and library.

➢ They are essential in all programming languages and

NLP applications.

4
Regular expressions are used everywhere

➢ Regular expressions are particularly useful for :

• Preprocessing: Prepare text data for advanced NLP techniques.

• Data Formatting: Standardize data for analysis.
• Text Analysis: Extract valuable insights from textual data.
• Text Search: Efficiently locate specific patterns within text.
• Web Scraping: Extract information from websites.
• Pattern Matching: Identify recurring structures in text.

5
Regular expressions
➢ Regular expressions: Formal language for defining text strings.

➢ Example: Search for mentions of woodchucks (also known as groundhogs)

➢ Consider different variations:

• woodchuck / Woodchuck
• woodchucks / Woodchucks
• groundhog / Groundhog
• groundhogs / Groundhogs
➢ Include both lower-case and upper-case forms.

➢ Ensure singular and plural forms are captured.

6
Regular expressions

➢ Regular expressions have different variations.

➢ Different tools handle them differently.

➢ You can use an online tester to experiment and learn.

• https://www.regexpal.com/
• https://regexr.com/
• https://www.regextester.com/

7
Regular Expressions Basics: Concatenation

➢ A sequence of simple characters is called Concatenation.

➢ Example: To search for woodchuck, simply type woodchuck.

➢ Search String: Can be a single character or a sequence of

characters.

➢ Case Sensitivity: Lowercase and uppercase letters are treated

differently.
• Example: woodchucks ≠ Woodchucks.
➢ Solution: Use Disjunction to match different cases or variations.
8
Regular Expressions Basics: Disjunctions

➢ Braces define character options: The string of

characters inside the braces specifies a disjunction of
characters to match. Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit

➢ Ranges using the dash [A-Z]

Pattern Matches Examples
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole 9
Regular Expressions: Negation in Disjunction

➢ Carat (^) as the first character inside [] negates the list.

• Note: Carat (^) signifies negation only when it's first in the list.
• Special characters like (*., , +, ?) lose their special meaning when
used inside [].

Pattern Matches Examples

[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^.] Not a period Our resident Djinn
[e^] Either e or ^ Look up ^ now 10
Regular Expressions: Convenient aliases
➢ Use backslash (\) to create special character matches:

• \d : any digit
• \s : whitespace
• \w : alphanumeric character or underscore
➢ Capitalized versions negate the match:

• \D : any non-digit

Pattern Expansion Matches Examples

\d [0-9] Any digit Fahreneit 451
\D [^0-9] Any non-digit Blue Moon
\w [a-ZA-Z0-9_] Any alphanumeric or _ Daiyu
\W [^\w] Not alphanumeric or _ Look!
\s [ \r\t\n\f] Whitespace (space, tab) Look␣up
\S [^\s] Not whitespace Look up 11
Regular Expressions: More Disjunction
➢ Pipe Symbol ( | ) acts as "or" between two strings.

➢ Square brackets select between individual characters.

➢ Pipe chooses between strings of characters (e.g., "groundhog" or "woodchuck").

➢ For disjunctions of single letters, use square brackets or pipe.

➢ Combine square brackets and pipe for flexible patterns (e.g., lower/upper case
and string choices).

➢ Question Mark (?): Makes the preceding character optional.

➢ Kleene Star (*): Matches 0 or more of the preceding character.

➢ Kleene Plus (+): Matches 1 or more of the preceding character.

Pattern Matches Examples

beg.n Any char begin begun beg3n beg n
woodchucks? Optional s woodchuck woodchucks
to* 0 or more occurrence of previous char t to too tooo
to+ 1 or more occurrence of previous char to too tooo toooo
to{2} Exactly the specified number of occurrences too tooo toooo
13
Regular Expressions: Anchors ^ $

➢ Carat (^) inside brackets [^] means negation.

➢ Carat (^) outside brackets matches the start of a line.

➢ $ matches the end of a line.

Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
14
Regular Expressions: Grouping ()

➢ Parenthesis operator ( ) is helpful when using counters

like Kleene*.

➢ Kleene* applies to a single character by default, not to

a whole sequence.

➢ To match repeated instances of a string, use

parenthesis ().

15
Regular Expressions: Grouping ()
➢ Example: Matching column labels like Column 1 Column 2 Column 3 in a
sequence.

➢ Column [0-9]+* matches a single column followed by any number of

spaces (it will not match any number of columns).

➢ The Star (*) here applies only to the space before it, not the entire
sequence.

➢ With the parentheses, we could rewrite the expression as (Column [0-9]

+*)* to match : the word Column, followed by a number and optional
spaces, the whole pattern repeated zero or more times.

16
Regular Expressions: precedence

➢ The table below shows the order in which Regular

Expression symbols are processed. The symbols at the
top have the highest priority, while those at the bottom
have the lowest.

17
Regular Expressions: A note about Python

➢ For Python, you need to type an extra backslashes!

• \n: in Python means the "newline" character, not a "slash"
followed by an "n".
• \\d+: to search for 1 or more digits
➢ Instead: use Python's raw string notation “r” for regex:
• Example#1: r"[wW]oodchuck" matches woodchuck or
Woodchuck
• Example#2: r"\d+" matches one or more digits instead of "\\d+"

18
Regular Expressions: Substitutions

➢ Use the "S" command to replace a matched string with

a substitute.

➢ Syntax: s/regexp1/pattern/

➢ Example: we can convert "colour" to "color" using

s/colour/color/

19
Regular Expressions: Substitutions
➢ For Python, we can built-in Regular Expression package re:

import re

txt = "The rain in Spain"

x = re.sub("\s", "9", txt)
print(x)

Output: The9rain9in9Spain

20
Regular Expressions: Capture Groups
➢ Say we want to put angles around all numbers:

the 35 boxes ➔ the <35> boxes

➢ Use parens () to "capture" a pattern into a numbered register (1, 2, 3…)

➢ Use \1 to refer to the contents of the register

Example: s/([0-9]+)/<\1>/

For Python you can use:

x = re.sub(r"([0-9]+)",r"<\1>", "the 35 boxes")

print(x)

21
Regular Expressions: Capture Groups
➢ In complex patterns, we'll want to use more than one register; here's an example where
we first capture two strings, and then refer to them both in order:

/the (.)er they (.), the \1er we \2/

Matches

the faster they ran, the faster we ran

But not

the faster they ran, the faster we ate

22
Regular Expressions: Example

➢ Write a regular expression to find all instances of the

word “the” in a text:

• the : Misses capitalized examples

• [tT]he : Incorrectly returns other or Theology
• \s[tT]he\s : All instances of the word “the” in a text

23
False positives and false negatives
➢ The process we just went through was based on fixing two kinds of
errors:

1. False negatives: Not matching things that we should have matched

(The)

2. False positives: Matching strings that we should not have matched

(there, then, other)

24
False positives and false negatives

➢ In NLP we are always dealing with these kinds of errors.

➢ Reducing error rate requires balancing two efforts:

• Increasing coverage (recall): Reducing false negatives

• Increasing accuracy (precision): Reducing false positives

25
Words and Corpora

26
How many words in a sentence?

➢ Counting words in a sentence can be complex.

➢ Example: "I do uh main- mainly business data

processing."
• Is "uh" a word?
• Is "main" (fragment) a word?
➢ Fragments like "main" and filled pauses like "uh" may or
may not be counted in certain applications.
27
How many words in a sentence?

➢ Example: "Seuss’s cat in the hat is different from other

cats! "

➢ Lemma: Words with the same stem, part of speech, and

meaning (e.g., "cat" and "cats" as the same lemma).

➢ Wordform: The exact form of a word, including its

inflections (e.g., "cat" and "cats" as different wordforms).

28
How many words in a sentence?
They lay back on the San Francisco grass and looked at the stars and their

➢ Word types: Unique words, count each word only once (e.g., "the" counted once).

➢ Word tokens: Count every word occurrence (e.g., "the" counted twice).

➢ How many tokens, types in the above sentence?

• 15 tokens (or 14)

• 13 types (or 12) (or 11?)
➢ San Francisco: Is it one word or two?

➢ They and their: Different wordforms but the same lemma.

➢ The goal of the word count affects how you count words.

29
How many words in a corpus?
➢ Corpora (plural of corpus)
• Corpora: Collections or bodies of text

➢ N = number of tokens

➢ V = vocabulary = set of types, |V| is size of vocabulary

➢ Heaps Law = Herdan's Law = where often .67 < β < .75

➢ i.e., vocabulary size grows with square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
30
Google N-grams 1 trillion 13+ million
Corpora

➢ Words are intentionally created; they don't appear out of

nowhere!

➢ Text is produced by:

• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific purpose or function.

31
Corpora

➢ Corpora vary along dimension like:

• Language: 7097 languages in the world
• Variety, like African American Language varieties.
• AAE Twitter posts might include forms like "iont" (I don't)
• Code (language) switching, e.g., Spanish/English, Hindi/English:
• Spanish/English: " Por primera vez veo a @username actually being hateful! It
was beautiful:) "
• Hindi/English: " dost tha or ra- hega ... dont wory ... but dherya rakhe "
• Genre: newswire, fiction, scientific articles, Wikipedia, etc.
• Author Demographics: writer's age, gender, ethnicity, etc.
32
Corpora

➢ Corpora vary along dimension like:

• Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
• Situation: In what situation was the text written?
• Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
• +Annotation process, language variety, demographics, etc.
33
Text Normalization

34
Text Normalization

➢ Every NLP task requires text normalization:

• Tokenizing (segmenting) words

• Normalizing word formats
• Segmenting sentences

➢ Normalizing words, and segmenting off sentences are

important initial steps in text processing
35
Text Tokenization

➢ Space-based tokenization is a very simple way to tokenize

• For languages that use space characters between words
(Arabic, Cyrillic, Greek, Latin, etc., based writing systems)
• Segment off a token between instances of spaces
➢ Unix provide tools for space-based tokenization
• You can try to the "tr" command as described in “Unix for
Poets.pdf” file in CLMS
• Given a text file, output the word tokens and their frequencies
36
Issues in Tokenization
➢ During tokenization, Can't just blindly remove punctuation:

• m.p.h., Ph.D., AT&T, cap’n

• prices ($45.55)
• dates (01/02/06)
• URLs (http://www.stanford.edu)
• hashtags (#nlproc)
• email addresses (someone@cs.colorado.edu)
➢ Clitic: a word that doesn't stand on its own

• "are" in we're, “m” in I’m, etc

➢ When should Multi-Word Expressions (MWE) be words?

• New York, rock ’n’ roll

37
Issues in Tokenization
➢ Some languages (like Chinese, Japanese, Thai) don't use spaces to
separate words!
• How do we decide where the token boundaries should be?
➢ in Chinese it's common to just treat each character (zi) as a token.
• So, the segmentation step is very simple
➢ In other languages (like Thai and Japanese), more complex word
segmentation is required.
• The standard algorithms are neural sequence models trained by supervised
machine learning.

38
Another option for text tokenization

➢ Instead of:
• white-space segmentation
• single-character segmentation
➢ Some algorithms use corpus statistics to decide how to segment a
text into tokens:
• Use the data to tell us how to tokenize.
• Subword tokenization (because tokens can be parts of words as well as
• whole words)
39
Subword tokenization
➢ Three common algorithms:
• Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
• Unigram language modeling tokenization (Kudo, 2018)
• WordPiece (Schuster and Nakajima, 2012)
➢ All have 2 parts:
• A token learner that takes a raw training corpus and induces a vocabulary (a
set of tokens).

➢ A token segmenter that takes a raw test sentence and tokenizes it

according to that vocabulary

40
Byte Pair Encoding

Let vocabulary be the set of all individual characters

= {A, B, C, D,…, a, b, c, d….}

Repeat:
• Choose the two symbols that are most frequently adjacent in
the training corpus (say 'A', 'B')
• Add a new merged symbol 'AB' to the vocabulary
• Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
41
Byte Pair Encoding: algorithm

42
Byte Pair Encoding

➢ Most subword algorithms are run inside space-

separated tokens.

➢ So, we commonly first add a special end-of-word

symbol '_' before space in training corpus

➢ Next, separate into letters.

43
Byte Pair Encoding: Example

➢ Original corpus:

low low low low low lowest lowest newer newer newer
newer newer newer wider wider wider new new

➢ Add end-of-word tokens, resulting in this vocabulary:

-, d, e, i, l, n, o, r, s, t, w

44
Byte Pair Encoding: Example

Merge e r to er

45
Byte Pair Encoding: Example

Merge er _ to er_

46
Byte Pair Encoding: Example

Merge n e to ne

47
Byte Pair Encoding: Example

➢ The next merges are:

48
Byte Pair Encoding: Example

➢ On the test data, run each merge learned from the training data:
• Greedily
• In the order we learned them
• (test frequencies don't play a role)
➢ So: merge every e r to er, then merge er _ to er_, etc.

➢ Result:
• Test set "n e w e r _" would be tokenized as a full word: “newer_”
• Test set "l o w e r _" would be two tokens: "low er_"
49
Byte Pair Encoding

➢ BPE tokens usually include:

• frequent words
• and frequent subwords
• Which are often morphemes like -est or –er

➢ A morpheme is the smallest meaning-bearing unit of a

language:
• unlikeliest has 3 morphemes un-, likely, and -est

50
Word Normalization

➢ Word Normalization: Putting words/tokens in a standard

format:

• U.S.A. or USA
• uhhuh or uh-huh
• Fed or fed
• am, is, be, are

51
Word Normalization: Case folding

➢ Applications like IR: reduce all letters to lower case

• Since users tend to use lower case
• Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail

➢ For sentiment analysis, Machine translation, Information

extraction
• Case is helpful (US versus us is important)

52
Word Normalization: Lemmatization

➢ Represent all words as their lemma, their shared root:

• am, are, is ➔ be
• car, cars, car's, cars’ ➔ car
• He is reading detective stories ➔ He be read detective story

53
Word Normalization: Lemmatization

➢ Lemmatization is done by Morphological Parsing

➢ Morphemes:
• The small meaningful units that make up words
• Stems: The core meaning-bearing units
• Affixes: Parts that adhere to stems, often with grammatical
functions
➢ Morphological Parsers:
• Parse cats into two morphemes cat and s
54
Word Normalization: Stemming

➢ Stemming is done by removing their prefixes and

suffixes in a basic way (Reduce terms to stems).

This was not the map we found Thi wa not the map we found in
in Billy Bones’s chest, but an Billi Bone s chest but an accur
accurate copy, complete in all copi complet in all thing name
things-names and heights and and height and sound with the
soundings-with the single singl except of the red cross
exception of the red crosses and and the written note
the written notes.

55
Word Normalization: Porter Stemmer

➢ Based on a series of rewrite rules run in series

• A cascade, in which output of each pass fed to next pass

➢ Some sample rules:

56
Sentence Segmentation
➢ For sentence segmentation we can use symbols like (!, ? , or period “.”)

➢ !, ? mostly unambiguous but period “.” is very ambiguous

• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
➢ Common algorithm: Tokenize first: use rules or ML to classify a period as either
(a) part of the word or (b) a sentence-boundary.
• An abbreviation dictionary can help
➢ Sentence segmentation can then often be done by rules based on this
tokenization.

57
Thank You

2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
NLP Chapter 5
No ratings yet
NLP Chapter 5
70 pages
Module 2 Chap1
No ratings yet
Module 2 Chap1
92 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
55 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
Chapter Two
No ratings yet
Chapter Two
72 pages
Mod 2
No ratings yet
Mod 2
49 pages
Chinese Proverbs
No ratings yet
Chinese Proverbs
32 pages
Get Unifying Business, Data, and Code: Designing Data Products With JSON Schema 1st Edition Ron Itelman Free All Chapters
No ratings yet
Get Unifying Business, Data, and Code: Designing Data Products With JSON Schema 1st Edition Ron Itelman Free All Chapters
50 pages
Morphological Analysis
No ratings yet
Morphological Analysis
118 pages
Jan Goyvaerts - All About Regular Expressions-Https - WWW - Regular-Expressions - Info - (2019)
No ratings yet
Jan Goyvaerts - All About Regular Expressions-Https - WWW - Regular-Expressions - Info - (2019)
206 pages
When I Was No Bigger Than A Huge by Jose Garcia Villa
No ratings yet
When I Was No Bigger Than A Huge by Jose Garcia Villa
4 pages
Text Proc
No ratings yet
Text Proc
55 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
CC 2
No ratings yet
CC 2
65 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
3b TextProcessing
No ratings yet
3b TextProcessing
32 pages
Group 5A-Lesson Plan - PRONUNCIATION
No ratings yet
Group 5A-Lesson Plan - PRONUNCIATION
8 pages
Revision Worksheet CH 1 and 2
No ratings yet
Revision Worksheet CH 1 and 2
2 pages
Regular Expressions
No ratings yet
Regular Expressions
20 pages
Lec02 1 BasicTextProcessing
No ratings yet
Lec02 1 BasicTextProcessing
47 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
Negation
No ratings yet
Negation
3 pages
COMP3 RegEx
No ratings yet
COMP3 RegEx
10 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
Regular Expressions & Automata
No ratings yet
Regular Expressions & Automata
62 pages
Regular Expressions: Luísa Coheur
No ratings yet
Regular Expressions: Luísa Coheur
22 pages
Upto Morphological Parsing
No ratings yet
Upto Morphological Parsing
19 pages
Regular Expressions: SESSION - 14 - 15 - 16
No ratings yet
Regular Expressions: SESSION - 14 - 15 - 16
42 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Regex
No ratings yet
Regex
24 pages
ĐỀ GỐC SỐ 2
No ratings yet
ĐỀ GỐC SỐ 2
8 pages
Lecture03 Regular Expressions 20092024 012539pm
No ratings yet
Lecture03 Regular Expressions 20092024 012539pm
36 pages
Functional Behavior
No ratings yet
Functional Behavior
9 pages
STD 8 Unit Test 3 Portion and Blue Print - 06112023 - 084825
No ratings yet
STD 8 Unit Test 3 Portion and Blue Print - 06112023 - 084825
3 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Gr7 Ch9 Rational Numbers WS 2
No ratings yet
Gr7 Ch9 Rational Numbers WS 2
3 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
CS8391 Notes PDF
No ratings yet
CS8391 Notes PDF
348 pages
3-Regular Expressions
No ratings yet
3-Regular Expressions
34 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
CS 491 Natural Language Processing Module 2: Basic Text Processing
No ratings yet
CS 491 Natural Language Processing Module 2: Basic Text Processing
24 pages
Esol 3 Unit Plan Modification
No ratings yet
Esol 3 Unit Plan Modification
40 pages
Python RegEx
No ratings yet
Python RegEx
8 pages
Regular Expression
No ratings yet
Regular Expression
15 pages
2 Regular Expression
No ratings yet
2 Regular Expression
23 pages
Howto Regex PDF
No ratings yet
Howto Regex PDF
20 pages
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
18 pages
Modal Auxiliries: By: Rising PKN Stan
No ratings yet
Modal Auxiliries: By: Rising PKN Stan
11 pages
Regular Expressions, Tok-Enization, Edit Distance
No ratings yet
Regular Expressions, Tok-Enization, Edit Distance
29 pages
Text Processing For NLP Understanding Regex
No ratings yet
Text Processing For NLP Understanding Regex
16 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Chapter 10
No ratings yet
Chapter 10
28 pages
2 Regular Expressions
No ratings yet
2 Regular Expressions
34 pages
Luyện tập 29.8
No ratings yet
Luyện tập 29.8
9 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
30 pages
Paresh Narendra Kamat
No ratings yet
Paresh Narendra Kamat
2 pages
Regex Clinic
100% (1)
Regex Clinic
148 pages
Studying
No ratings yet
Studying
2 pages
Regular Expressions
No ratings yet
Regular Expressions
35 pages
Network Security - 4.2 Reg Ex Primer
No ratings yet
Network Security - 4.2 Reg Ex Primer
3 pages
3 Regular Expression
No ratings yet
3 Regular Expression
15 pages
in My Classroom
No ratings yet
in My Classroom
1 page
Preposition Going Places
No ratings yet
Preposition Going Places
9 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
23 pages
Upstream Unit2 Passage
0% (1)
Upstream Unit2 Passage
2 pages
BCE Altalanos Angol B1 Szobeli.a2a
No ratings yet
BCE Altalanos Angol B1 Szobeli.a2a
10 pages
Enrique Ochoa: Solutions Architect at Blue Cross Blue Shield of Illinois, New Mexico, Oklahoma & Texas
No ratings yet
Enrique Ochoa: Solutions Architect at Blue Cross Blue Shield of Illinois, New Mexico, Oklahoma & Texas
6 pages
Personal Pronouns - Subject: Exercise: 1 (Change The Following Nouns Into The Personal Pronouns.)
100% (1)
Personal Pronouns - Subject: Exercise: 1 (Change The Following Nouns Into The Personal Pronouns.)
14 pages
Regular Expression Overview
No ratings yet
Regular Expression Overview
5 pages
Present Perfect Super Practice
No ratings yet
Present Perfect Super Practice
2 pages
Chapter Three Regular Expressions and Finite-State Automata
No ratings yet
Chapter Three Regular Expressions and Finite-State Automata
19 pages
1.tenses Chart
No ratings yet
1.tenses Chart
1 page
Cummins, J. (2000) - Language, Power and Pedagogy Bilingual Children in The Crossfire.
No ratings yet
Cummins, J. (2000) - Language, Power and Pedagogy Bilingual Children in The Crossfire.
22 pages
Regex Cheat Sheet
No ratings yet
Regex Cheat Sheet
10 pages
Grade 10 Afrikaans Tenses Term 1
No ratings yet
Grade 10 Afrikaans Tenses Term 1
2 pages
Report Text 1
No ratings yet
Report Text 1
15 pages
Dss 3
No ratings yet
Dss 3
21 pages
Media and Information Literacy Exam
100% (1)
Media and Information Literacy Exam
3 pages
21st Century Literature DLL
100% (3)
21st Century Literature DLL
12 pages
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Hunt A Killer: The Detective's Puzzle Book: True-Crime Inspired Ciphers, Codes, and Brain Games
From Everand
Hunt A Killer: The Detective's Puzzle Book: True-Crime Inspired Ciphers, Codes, and Brain Games
Hunt A Killer
No ratings yet
Conundrum: Crack the Ultimate Cipher Challenge
From Everand
Conundrum: Crack the Ultimate Cipher Challenge
Brian Clegg
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

02 Text Processing - Regular Expressions-Text Normalization

Uploaded by

02 Text Processing - Regular Expressions-Text Normalization

Uploaded by

TM340 Natural

Regular Expressions - Text

Based on slides by Dan Jurafsky and Chris Manning

➢ Words and Corpora

➢ Regular expressions are fundamental to text

➢ They are integrated into every NLP toolkit and library.

➢ They are essential in all programming languages and

➢ Regular expressions are particularly useful for :

• Preprocessing: Prepare text data for advanced NLP techniques.

➢ Example: Search for mentions of woodchucks (also known as groundhogs)

➢ Consider different variations:

➢ Ensure singular and plural forms are captured.

➢ Regular expressions have different variations.

➢ Different tools handle them differently.

➢ You can use an online tester to experiment and learn.

➢ A sequence of simple characters is called Concatenation.

➢ Example: To search for woodchuck, simply type woodchuck.

➢ Search String: Can be a single character or a sequence of

➢ Case Sensitivity: Lowercase and uppercase letters are treated

➢ Braces define character options: The string of

➢ Ranges using the dash [A-Z]

➢ Carat (^) as the first character inside [] negates the list.

Pattern Matches Examples

Pattern Expansion Matches Examples

➢ Square brackets select between individual characters.

➢ Pipe chooses between strings of characters (e.g., "groundhog" or "woodchuck").

➢ For disjunctions of single letters, use square brackets or pipe.

➢ Question Mark (?): Makes the preceding character optional.

➢ Kleene Star (*): Matches 0 or more of the preceding character.

➢ Kleene Plus (+): Matches 1 or more of the preceding character.

Pattern Matches Examples

➢ Carat (^) inside brackets [^] means negation.

➢ Carat (^) outside brackets matches the start of a line.

➢ $ matches the end of a line.

➢ Parenthesis operator ( ) is helpful when using counters

➢ Kleene* applies to a single character by default, not to

➢ To match repeated instances of a string, use

➢ Column [0-9]+* matches a single column followed by any number of

➢ With the parentheses, we could rewrite the expression as (Column [0-9]

➢ The table below shows the order in which Regular

➢ For Python, you need to type an extra backslashes!

➢ Use the "S" command to replace a matched string with

➢ Example: we can convert "colour" to "color" using

txt = "The rain in Spain"

the 35 boxes ➔ the <35> boxes

➢ Use parens () to "capture" a pattern into a numbered register (1, 2, 3…)

➢ Use \1 to refer to the contents of the register

For Python you can use:

x = re.sub(r"([0-9]+)",r"<\1>", "the 35 boxes")

/the (.*)er they (.*), the \1er we \2/

the faster they ran, the faster we ran

the faster they ran, the faster we ate

➢ Write a regular expression to find all instances of the

• the : Misses capitalized examples

1. False negatives: Not matching things that we should have matched

2. False positives: Matching strings that we should not have matched

➢ In NLP we are always dealing with these kinds of errors.

➢ Reducing error rate requires balancing two efforts:

• Increasing coverage (recall): Reducing false negatives

➢ Counting words in a sentence can be complex.

➢ Example: "I do uh main- mainly business data

➢ Example: "Seuss’s cat in the hat is different from other

➢ Lemma: Words with the same stem, part of speech, and

➢ Wordform: The exact form of a word, including its

➢ How many tokens, types in the above sentence?

• 15 tokens (or 14)

➢ They and their: Different wordforms but the same lemma.

➢ V = vocabulary = set of types, |V| is size of vocabulary

➢ Words are intentionally created; they don't appear out of

➢ Text is produced by:

➢ Corpora vary along dimension like:

➢ Corpora vary along dimension like:

➢ Every NLP task requires text normalization:

• Tokenizing (segmenting) words

➢ Normalizing words, and segmenting off sentences are

➢ Space-based tokenization is a very simple way to tokenize

/the (.)er they (.), the \1er we \2/