0% found this document useful (0 votes)
2 views58 pages

02 Text Processing - Regular Expressions-Text Normalization

The document discusses the fundamentals of Natural Language Processing (NLP), focusing on regular expressions, text normalization, and the handling of words and corpora. It covers the applications of regular expressions in text processing, including preprocessing, data formatting, and text analysis, as well as the complexities of counting words and tokens in sentences and corpora. Additionally, it highlights the importance of text normalization in NLP tasks, detailing methods of tokenization and the challenges involved.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views58 pages

02 Text Processing - Regular Expressions-Text Normalization

The document discusses the fundamentals of Natural Language Processing (NLP), focusing on regular expressions, text normalization, and the handling of words and corpora. It covers the applications of regular expressions in text processing, including preprocessing, data formatting, and text analysis, as well as the complexities of counting words and tokens in sentences and corpora. Additionally, it highlights the importance of text normalization in NLP tasks, detailing methods of tokenization and the challenges involved.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

TM340 Natural

Language Processing

Text Processing

Regular Expressions - Text


Normalization - Words and Corpora

Based on slides by Dan Jurafsky and Chris Manning


Agenda

➢ Regular expressions

➢ Words and Corpora

➢ Text Normalization

2
Regular expressions

3
Regular expressions are used everywhere

➢ Regular expressions are fundamental to text


processing.

➢ They are integrated into every NLP toolkit and library.

➢ They are essential in all programming languages and


NLP applications.

4
Regular expressions are used everywhere

➢ Regular expressions are particularly useful for :

• Preprocessing: Prepare text data for advanced NLP techniques.


• Data Formatting: Standardize data for analysis.
• Text Analysis: Extract valuable insights from textual data.
• Text Search: Efficiently locate specific patterns within text.
• Web Scraping: Extract information from websites.
• Pattern Matching: Identify recurring structures in text.

5
Regular expressions
➢ Regular expressions: Formal language for defining text strings.

➢ Example: Search for mentions of woodchucks (also known as groundhogs)

➢ Consider different variations:


• woodchuck / Woodchuck
• woodchucks / Woodchucks
• groundhog / Groundhog
• groundhogs / Groundhogs
➢ Include both lower-case and upper-case forms.

➢ Ensure singular and plural forms are captured.

6
Regular expressions

➢ Regular expressions have different variations.

➢ Different tools handle them differently.

➢ You can use an online tester to experiment and learn.

• https://www.regexpal.com/
• https://regexr.com/
• https://www.regextester.com/

7
Regular Expressions Basics: Concatenation

➢ A sequence of simple characters is called Concatenation.

➢ Example: To search for woodchuck, simply type woodchuck.

➢ Search String: Can be a single character or a sequence of


characters.

➢ Case Sensitivity: Lowercase and uppercase letters are treated


differently.
• Example: woodchucks ≠ Woodchucks.
➢ Solution: Use Disjunction to match different cases or variations.
8
Regular Expressions Basics: Disjunctions

➢ Braces define character options: The string of


characters inside the braces specifies a disjunction of
characters to match. Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit

➢ Ranges using the dash [A-Z]


Pattern Matches Examples
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole 9
Regular Expressions: Negation in Disjunction

➢ Carat (^) as the first character inside [] negates the list.

• Note: Carat (^) signifies negation only when it's first in the list.
• Special characters like (*., , +, ?) lose their special meaning when
used inside [].

Pattern Matches Examples


[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^.] Not a period Our resident Djinn
[e^] Either e or ^ Look up ^ now 10
Regular Expressions: Convenient aliases
➢ Use backslash (\) to create special character matches:

• \d : any digit
• \s : whitespace
• \w : alphanumeric character or underscore
➢ Capitalized versions negate the match:

• \D : any non-digit

Pattern Expansion Matches Examples


\d [0-9] Any digit Fahreneit 451
\D [^0-9] Any non-digit Blue Moon
\w [a-ZA-Z0-9_] Any alphanumeric or _ Daiyu
\W [^\w] Not alphanumeric or _ Look!
\s [ \r\t\n\f] Whitespace (space, tab) Look␣up
\S [^\s] Not whitespace Look up 11
Regular Expressions: More Disjunction
➢ Pipe Symbol ( | ) acts as "or" between two strings.

➢ Square brackets select between individual characters.

➢ Pipe chooses between strings of characters (e.g., "groundhog" or "woodchuck").

➢ For disjunctions of single letters, use square brackets or pipe.

➢ Combine square brackets and pipe for flexible patterns (e.g., lower/upper case
and string choices).

Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck 12
Regular Expressions: Wildcards, optionality,
repetition
➢ Period (.): Acts as a wildcard, matching any character.

➢ Question Mark (?): Makes the preceding character optional.

➢ Kleene Star (*): Matches 0 or more of the preceding character.

➢ Kleene Plus (+): Matches 1 or more of the preceding character.

Pattern Matches Examples


beg.n Any char begin begun beg3n beg n
woodchucks? Optional s woodchuck woodchucks
to* 0 or more occurrence of previous char t to too tooo
to+ 1 or more occurrence of previous char to too tooo toooo
to{2} Exactly the specified number of occurrences too tooo toooo
13
Regular Expressions: Anchors ^ $

➢ Carat (^) inside brackets [^] means negation.

➢ Carat (^) outside brackets matches the start of a line.

➢ $ matches the end of a line.


Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
14
Regular Expressions: Grouping ()

➢ Parenthesis operator ( ) is helpful when using counters


like Kleene*.

➢ Kleene* applies to a single character by default, not to


a whole sequence.

➢ To match repeated instances of a string, use


parenthesis ().

15
Regular Expressions: Grouping ()
➢ Example: Matching column labels like Column 1 Column 2 Column 3 in a
sequence.

➢ Column [0-9]+* matches a single column followed by any number of


spaces (it will not match any number of columns).

➢ The Star (*) here applies only to the space before it, not the entire
sequence.

➢ With the parentheses, we could rewrite the expression as (Column [0-9]


+*)* to match : the word Column, followed by a number and optional
spaces, the whole pattern repeated zero or more times.

16
Regular Expressions: precedence

➢ The table below shows the order in which Regular


Expression symbols are processed. The symbols at the
top have the highest priority, while those at the bottom
have the lowest.

17
Regular Expressions: A note about Python

➢ For Python, you need to type an extra backslashes!


• \n: in Python means the "newline" character, not a "slash"
followed by an "n".
• \\d+: to search for 1 or more digits
➢ Instead: use Python's raw string notation “r” for regex:
• Example#1: r"[wW]oodchuck" matches woodchuck or
Woodchuck
• Example#2: r"\d+" matches one or more digits instead of "\\d+"

18
Regular Expressions: Substitutions

➢ Use the "S" command to replace a matched string with


a substitute.

➢ Syntax: s/regexp1/pattern/

➢ Example: we can convert "colour" to "color" using


s/colour/color/

19
Regular Expressions: Substitutions
➢ For Python, we can built-in Regular Expression package re:

import re

txt = "The rain in Spain"


x = re.sub("\s", "9", txt)
print(x)

Output: The9rain9in9Spain

20
Regular Expressions: Capture Groups
➢ Say we want to put angles around all numbers:

the 35 boxes ➔ the <35> boxes

➢ Use parens () to "capture" a pattern into a numbered register (1, 2, 3…)

➢ Use \1 to refer to the contents of the register

Example: s/([0-9]+)/<\1>/

For Python you can use:

x = re.sub(r"([0-9]+)",r"<\1>", "the 35 boxes")

print(x)

21
Regular Expressions: Capture Groups
➢ In complex patterns, we'll want to use more than one register; here's an example where
we first capture two strings, and then refer to them both in order:

/the (.*)er they (.*), the \1er we \2/

Matches

the faster they ran, the faster we ran

But not

the faster they ran, the faster we ate

22
Regular Expressions: Example

➢ Write a regular expression to find all instances of the


word “the” in a text:

• the : Misses capitalized examples


• [tT]he : Incorrectly returns other or Theology
• \s[tT]he\s : All instances of the word “the” in a text

23
False positives and false negatives
➢ The process we just went through was based on fixing two kinds of
errors:

1. False negatives: Not matching things that we should have matched


(The)

2. False positives: Matching strings that we should not have matched


(there, then, other)

24
False positives and false negatives

➢ In NLP we are always dealing with these kinds of errors.

➢ Reducing error rate requires balancing two efforts:

• Increasing coverage (recall): Reducing false negatives


• Increasing accuracy (precision): Reducing false positives

25
Words and Corpora

26
How many words in a sentence?

➢ Counting words in a sentence can be complex.

➢ Example: "I do uh main- mainly business data


processing."
• Is "uh" a word?
• Is "main" (fragment) a word?
➢ Fragments like "main" and filled pauses like "uh" may or
may not be counted in certain applications.
27
How many words in a sentence?

➢ Example: "Seuss’s cat in the hat is different from other


cats! "

➢ Lemma: Words with the same stem, part of speech, and


meaning (e.g., "cat" and "cats" as the same lemma).

➢ Wordform: The exact form of a word, including its


inflections (e.g., "cat" and "cats" as different wordforms).

28
How many words in a sentence?
They lay back on the San Francisco grass and looked at the stars and their

➢ Word types: Unique words, count each word only once (e.g., "the" counted once).

➢ Word tokens: Count every word occurrence (e.g., "the" counted twice).

➢ How many tokens, types in the above sentence?

• 15 tokens (or 14)


• 13 types (or 12) (or 11?)
➢ San Francisco: Is it one word or two?

➢ They and their: Different wordforms but the same lemma.

➢ The goal of the word count affects how you count words.

29
How many words in a corpus?
➢ Corpora (plural of corpus)
• Corpora: Collections or bodies of text

➢ N = number of tokens

➢ V = vocabulary = set of types, |V| is size of vocabulary

➢ Heaps Law = Herdan's Law = where often .67 < β < .75

➢ i.e., vocabulary size grows with square root of the number of word tokens
Tokens = N Types = |V|
Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
30
Google N-grams 1 trillion 13+ million
Corpora

➢ Words are intentionally created; they don't appear out of


nowhere!

➢ Text is produced by:


• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific purpose or function.

31
Corpora

➢ Corpora vary along dimension like:


• Language: 7097 languages in the world
• Variety, like African American Language varieties.
• AAE Twitter posts might include forms like "iont" (I don't)
• Code (language) switching, e.g., Spanish/English, Hindi/English:
• Spanish/English: " Por primera vez veo a @username actually being hateful! It
was beautiful:) "
• Hindi/English: " dost tha or ra- hega ... dont wory ... but dherya rakhe "
• Genre: newswire, fiction, scientific articles, Wikipedia, etc.
• Author Demographics: writer's age, gender, ethnicity, etc.
32
Corpora

➢ Corpora vary along dimension like:


• Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
• Situation: In what situation was the text written?
• Collection process: If it is a subsample how was it sampled?
Was there consent? Pre-processing?
• +Annotation process, language variety, demographics, etc.
33
Text Normalization

34
Text Normalization

➢ Every NLP task requires text normalization:

• Tokenizing (segmenting) words


• Normalizing word formats
• Segmenting sentences

➢ Normalizing words, and segmenting off sentences are


important initial steps in text processing
35
Text Tokenization

➢ Space-based tokenization is a very simple way to tokenize


• For languages that use space characters between words
(Arabic, Cyrillic, Greek, Latin, etc., based writing systems)
• Segment off a token between instances of spaces
➢ Unix provide tools for space-based tokenization
• You can try to the "tr" command as described in “Unix for
Poets.pdf” file in CLMS
• Given a text file, output the word tokens and their frequencies
36
Issues in Tokenization
➢ During tokenization, Can't just blindly remove punctuation:

• m.p.h., Ph.D., AT&T, cap’n


• prices ($45.55)
• dates (01/02/06)
• URLs (http://www.stanford.edu)
• hashtags (#nlproc)
• email addresses (someone@cs.colorado.edu)
➢ Clitic: a word that doesn't stand on its own

• "are" in we're, “m” in I’m, etc


➢ When should Multi-Word Expressions (MWE) be words?

• New York, rock ’n’ roll

37
Issues in Tokenization
➢ Some languages (like Chinese, Japanese, Thai) don't use spaces to
separate words!
• How do we decide where the token boundaries should be?
➢ in Chinese it's common to just treat each character (zi) as a token.
• So, the segmentation step is very simple
➢ In other languages (like Thai and Japanese), more complex word
segmentation is required.
• The standard algorithms are neural sequence models trained by supervised
machine learning.

38
Another option for text tokenization

➢ Instead of:
• white-space segmentation
• single-character segmentation
➢ Some algorithms use corpus statistics to decide how to segment a
text into tokens:
• Use the data to tell us how to tokenize.
• Subword tokenization (because tokens can be parts of words as well as
• whole words)
39
Subword tokenization
➢ Three common algorithms:
• Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
• Unigram language modeling tokenization (Kudo, 2018)
• WordPiece (Schuster and Nakajima, 2012)
➢ All have 2 parts:
• A token learner that takes a raw training corpus and induces a vocabulary (a
set of tokens).

➢ A token segmenter that takes a raw test sentence and tokenizes it


according to that vocabulary

40
Byte Pair Encoding

Let vocabulary be the set of all individual characters

= {A, B, C, D,…, a, b, c, d….}

Repeat:
• Choose the two symbols that are most frequently adjacent in
the training corpus (say 'A', 'B')
• Add a new merged symbol 'AB' to the vocabulary
• Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
41
Byte Pair Encoding: algorithm

42
Byte Pair Encoding

➢ Most subword algorithms are run inside space-


separated tokens.

➢ So, we commonly first add a special end-of-word


symbol '_' before space in training corpus

➢ Next, separate into letters.

43
Byte Pair Encoding: Example

➢ Original corpus:

low low low low low lowest lowest newer newer newer
newer newer newer wider wider wider new new

➢ Add end-of-word tokens, resulting in this vocabulary:

-, d, e, i, l, n, o, r, s, t, w

44
Byte Pair Encoding: Example

Merge e r to er

45
Byte Pair Encoding: Example

Merge er _ to er_

46
Byte Pair Encoding: Example

Merge n e to ne

47
Byte Pair Encoding: Example

➢ The next merges are:

48
Byte Pair Encoding: Example

➢ On the test data, run each merge learned from the training data:
• Greedily
• In the order we learned them
• (test frequencies don't play a role)
➢ So: merge every e r to er, then merge er _ to er_, etc.

➢ Result:
• Test set "n e w e r _" would be tokenized as a full word: “newer_”
• Test set "l o w e r _" would be two tokens: "low er_"
49
Byte Pair Encoding

➢ BPE tokens usually include:


• frequent words
• and frequent subwords
• Which are often morphemes like -est or –er

➢ A morpheme is the smallest meaning-bearing unit of a


language:
• unlikeliest has 3 morphemes un-, likely, and -est

50
Word Normalization

➢ Word Normalization: Putting words/tokens in a standard


format:

• U.S.A. or USA
• uhhuh or uh-huh
• Fed or fed
• am, is, be, are

51
Word Normalization: Case folding

➢ Applications like IR: reduce all letters to lower case


• Since users tend to use lower case
• Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail

➢ For sentiment analysis, Machine translation, Information


extraction
• Case is helpful (US versus us is important)

52
Word Normalization: Lemmatization

➢ Represent all words as their lemma, their shared root:

• am, are, is ➔ be
• car, cars, car's, cars’ ➔ car
• He is reading detective stories ➔ He be read detective story

53
Word Normalization: Lemmatization

➢ Lemmatization is done by Morphological Parsing

➢ Morphemes:
• The small meaningful units that make up words
• Stems: The core meaning-bearing units
• Affixes: Parts that adhere to stems, often with grammatical
functions
➢ Morphological Parsers:
• Parse cats into two morphemes cat and s
54
Word Normalization: Stemming

➢ Stemming is done by removing their prefixes and


suffixes in a basic way (Reduce terms to stems).

This was not the map we found Thi wa not the map we found in
in Billy Bones’s chest, but an Billi Bone s chest but an accur
accurate copy, complete in all copi complet in all thing name
things-names and heights and and height and sound with the
soundings-with the single singl except of the red cross
exception of the red crosses and and the written note
the written notes.

55
Word Normalization: Porter Stemmer

➢ Based on a series of rewrite rules run in series

• A cascade, in which output of each pass fed to next pass


➢ Some sample rules:

56
Sentence Segmentation
➢ For sentence segmentation we can use symbols like (!, ? , or period “.”)

➢ !, ? mostly unambiguous but period “.” is very ambiguous


• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
➢ Common algorithm: Tokenize first: use rules or ML to classify a period as either
(a) part of the word or (b) a sentence-boundary.
• An abbreviation dictionary can help
➢ Sentence segmentation can then often be done by rules based on this
tokenization.

57
Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy