L02_Programming_RE plc
L02_Programming_RE plc
programming
CSE 307 – Principles of Programming Languages
Stony Brook University
http://www.cs.stonybrook.edu/~cse307
1
What are Regular Expressions?
Formal language representing a text pattern interpreted
by a regular expression processor
Used for matching, searching and replacing text
There are no variables and you cannot do
mathematical operations (such as: you cannot add
1+1) – it is not a programming language
Frequently you will hear them called regex or RE for
short (or pluralized "regexes")
2
(c) Paul Fodor (CS Stony Brook)
What are Regular Expressions?
Usage examples:
Test if a phone number has the correct number of digits
Test if an email address has the correct format
Test if a Social Security Number is in the correct format
Search a text for words that contain digits
Find duplicate words in a text
Replace all occurrences of "Bob" and "Bobby" with "Robert"
Count the number of times "science" is preceded by
"computer" or "information"
Convert a tab indentations file with spaces indentations
3
(c) Paul Fodor (CS Stony Brook)
What are Regular Expressions?
But what is "Matches"?
a text matches a regular expression if it is correctly
described by the regex
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m
<re.Match object; span=(0, 12), match='Isaac Newton'>
5
(c) Paul Fodor (CS Stony Brook)
History of Regular Expressions
grep evolved into egrep
but broke backward compatibility
Therefore, in 1986, everyone came together and defined POSIX
(Portable Operating Systems Interface)
Basic Regular Expressions (BREs)
Extended Regular Expressions (EREs)
1986: Henry Spencer releases the regex library in C
Many incorporated it in other languages and tools
1987: Larry Wall released Perl
Used Spencer's regex library
Added powerful features
Everybody wanted to have it in their languages: Perl Compatible RE
(PCRE) library, Java, Javascript, C#/VB/.NET, MySQL, PHP,
Python, Ruby
6
(c) Paul Fodor (CS Stony Brook)
Regular Expressions Engines
Main versions / standards:
PCRE
POSIX BRE
POSIX ERE
Very subtle differences
Mainly older UNIX tools that use POSIX BRE for compatibility reasons
In use:
Unix (POSIX BRE, POSIX ERE)
PHP (PCRE)
Apache (v1: POSIX ERE, v2: PCRE)
MySQL (POSIX ERE)
Each of these languages is improving, so check their manuals
7
(c) Paul Fodor (CS Stony Brook)
Python Regular Expressions
https://docs.python.org/3/library/re.html
It is more powerful than String splits:
>>> "ab bc cd".split()
['ab', 'bc', 'cd']
Import the re module:
import re
>>> re.split(" ", "ab bc cd")
['ab', 'bc', 'cd']
>>> m
<re.Match object; span=(3, 6), match='def'>
9
(c) Paul Fodor (CS Stony Brook)
Online Regular Expressions
https://regexpal.com
10
(c) Paul Fodor (CS Stony Brook)
Regular Expressions
Strings:
"car" matches "car"
"car" also matches the first three letters in "cartoon"
"car" does not match "c_a_r"
Similar to search in a word processor
Case-sensitive (by default): "car" does not match "Car"
Metacharacters:
Have a special meaning
Like mathematical operators
Transform char sequences into powerful patterns
Only a few characters to learn: \ . * + - { } [ ] ( ) ^ $ | ? : ! =
May have multiple meanings
Depend on the context in which they are used
Variation between regex engines
11
(c) Paul Fodor (CS Stony Brook)
The wildcard character
Like in card games: one card can replace any other card on the
pattern
Metacharacter Meaning
. Any character except newline
Examples:
"h.t" matches "hat", "hot", "heat"
".a.a.a" matches "banana", "papaya"
"h.t" does not match ""heat" or "Hot"
Common mistake:
"9.00" matches "9.00", but it also match "9500", "9-00"
We should write regular expressions to match what we
want and ONLY what we want (We don’t want to be overly
permissive, we don't want false positives, we want the
regular expression to match what we are not looking for)
12
(c) Paul Fodor (CS Stony Brook)
Escaping Metacharacter
Allow the use of metacharacters as characters:
"\." matches "."
Metacharacter Meaning
\ Escape the next character
"9\.00" matches only "9.00", but not "9500" or "9-00"
Match a backslash by escaping it with a backslash:
"\\" matches only "\"
"C:\\Users\\Paul" matches "C:\Users\Paul"
Only for metacharacters
literal characters should never be escaped because it gives them meaning, e.g., r"\n"
Sometimes we want both meanings
Example: we want to match files of the name: "1_report.txt", "2_report.txt",…
"._report\.txt" uses the first . as a wildcard and the second \. as the period itself
13
(c) Paul Fodor (CS Stony Brook)
Other special characters
Tabs: \t
Line returns: \r (line return), \n (newline), \r\n
Unicode codes: \u00A9
ASCII codes: \x00A9
14
(c) Paul Fodor (CS Stony Brook)
Character sets
Metacharacter Meaning
[ Begin character set
] End character set
15
(c) Paul Fodor (CS Stony Brook)
Character ranges
[a-z] = [abcdefghijklmnoprqstuxyz]
Range metacharacter - is only a character range when it is inside a
character set, a dash line otherwise
represent all characters between two characters
Examples:
[0-9]
[A-Za-z]
[0-9A-Za-z]
[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] matches phone "631-632-9820"
[0-9][0-9][0-9][0-9][0-9] matches zip code "90210"
[A-Z0-9][A-Z0-9][A-Z0-9] [A-Z0-9][A-Z0-9][A-Z0-9] matches Canadian zip codes,
such as, "VC5 B6T"
Caution:
What is [50-99]?
It is not {50,51,…,99}
It is same with [0-9]: the set contains already 5 and 9
16
(c) Paul Fodor (CS Stony Brook)
Negative Character sets
Metacharacter Meaning
^ Negate a character set
Caret (^) = not one of several characters
Add ^ as the first character inside a character set
Still represents one character
Examples:
[^aeiou] matches any one character that is not a lower case vowel
[^aeiouAEIOU] matches any one character that is not a vowel (non-vowel)
[^a-zA-Z] matches any one character that is not a letter
see[^mn] matches "seek" and "sees", but not "seem" or "seen"
see[^mn] matches "see " because space matches [^mn]
see[^mn] does not match "see" because there is no more character after see
17
(c) Paul Fodor (CS Stony Brook)
Metacharacters
Metacharacters inside Character sets are already escaped:
Do not need to escape them again
Examples:
h[o.]t matches "hot" and "h.t"
Exceptions: metacharacters that have to do with character sets: ]-^\
Examples:
[[\]] matches "[" or "]"
var[[(][0-9][)\]] matches "var()" or "var[]"
18
(c) Paul Fodor (CS Stony Brook)
Shorthand character sets
Shorthand Meaning Equivalent
\d Digit [0-9]
\w Word character [a-zA-z0-9_]
\s Whitespace [ \t\n\r]
\D Not digit [^0-9]
\W Not word character [^a-zA-z0-9_]
\S Not white space [^ \t\n\r]
20
(c) Paul Fodor (CS Stony Brook)
Repetition
Metacharacter Meaning
* Preceding item zero or more times
+ Preceding item one or more times
? Preceding item zero or one time
Examples:
apples* matches "apple" and "apples" and "applessssssss"
apples+ matches "apples" and "applessssssss"
apples? matches "apple" and "apples"
\d* matches "123"
colou?r matches "color" and "colour"
21
(c) Paul Fodor (CS Stony Brook)
Quantified Repetition
Metacharacter Meaning
{ Start quantified repetition of preceding item
} End quantified repetition of preceding item
{min, max}
min and max must be positive numbers
min must always be included
min can be 0
max is optional
Syntaxes:
\d{4,8} matches numbers with 4 to 8 digits
\d{4} matches numbers with exactly 4 digits
\d{4,} matches numbers with minimum 4 digits
\d{0,} is equivalent to \d*
22 \d{1,} is equivalent to(c) Paul
\d+ Fodor (CS Stony Brook)
Greedy Expressions
Standard repetition quantifiers are greedy:
expressions try to match the longest possible string
\d* matches the entire string "1234" and not just "123", "1",
or "23"
Lazy expressions:
matches as little as possible before giving control to the next
expression part
? makes the preceding quantifier into a lazy quantifier
*?
+?
{min,max}?
??
Example:
23 "apples??" matches "apple" in "apples"
(c) Paul Fodor (CS Stony Brook)
Grouping metacharacters
Metacharacter Meaning
( Start grouped expression
) End grouped expression
24
(c) Paul Fodor (CS Stony Brook)
Metacharacters
$ Matches the ending position of the string or the position
just before a string-ending newline.
In line-based tools, it matches the ending position of any line.
[hc]at$ matches "hat" and "cat", but only at the end of the string or line.
^ Matches the beginning of a line or string.
| The choice (also known as alternation or set union) operator matches
either the expression before or the expression after the operator.
For example, abc|def matches "abc" or "def".
\A Matches the beginning of a string (but not an internal line).
\z Matches the end of a string (but not an internal line).
25
(c) Paul Fodor (CS Stony Brook)
Summary: Frequently Used Regular Expressions
26
(c) Paul Fodor (CS Stony Brook)
Python match and search Functions
re.match(r, s) returns a match object if the regex r
matches at the start of string s
import re
regex = "\d{3}-\d{2}-\d{4}"
ssn = input("Enter SSN: ")
match1 = re.match(regex, ssn)
if match1 != None:
print(ssn, " is a valid SSN")
print("start position of the matched text is "
+ str(match1.start()))
print("start and end position of the matched text is "
+ str(match1.span()))
else:
print(ssn, " is not a valid SSN")
28
(c) Paul Fodor (CS Stony Brook)
Python match and search Functions
re.search(r, s) returns a match object if the regex r matches
anywhere in string s
import re
regex = "\d{3}-\d{2}-\d{4}"
text = input("Enter a text: ")
match1 = re.search(regex, text)
if match1 != None:
print(text, " contains a valid SSN")
print("start position of the matched text is "
+ str(match1.start()))
print("start and end position of the matched text is "
+ str(match1.span()))
else:
print(ssn, " does not contain a valid SSN")
30
(c) Paul Fodor (CS Stony Brook)
Findall
findall(pattern, string [, flags]) return a list of
strings giving all nonoverlapping matches of pattern in string. If there are
any groups in patterns, returns a list of groups, and a list of tuples if the
pattern has more than one group
>>> re.findall('<(.*?)>','<spam> /<ham><eggs>')
['spam', 'ham', 'eggs']
>>> re.findall('<(.*?)>/?<(.*?)>',
'<spam>/<ham> ... <eggs><cheese>')
[('spam', 'ham'), ('eggs', 'cheese')]
31
(c) Paul Fodor (CS Stony Brook)
Findall
sub(pattern, repl, string [, count, flags])
returns the string obtained by replacing the (first count) leftmost
nonoverlapping occurrences of pattern (a string or a pattern object) in
string by repl (which may be a string with backslash escapes that may
back-reference a matched group, or a function that is passed a single match
object and returns the replacement string).
compile(pattern [, flags]) compiles a regular expression
pattern string into a regular expression pattern object, for later matching.
32
(c) Paul Fodor (CS Stony Brook)
Groups
Groups: extract substrings matched by REs in '()' parts
(R) Matches any regular expression inside (), and delimits a group (retains
matched substring)
\N Matches the contents of the group of the same number N: '(.+) \1' matches
“42 42”
import re
patt = re.compile("A(.)B(.)C(.)") # saves 3 substrings
mobj = patt.match("A0B1C2") # each '()' is a group, 1..n
print(mobj.group(1), mobj.group(2), mobj.group(3))
patt = re.compile("A(.*)B(.*)C(.*)") # saves 3 substrings
mobj = patt.match("A000B111C222") # groups() gives all groups
print(mobj.groups())
print(re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups())
print(re.search("(?P<a>A|X)(?P<b>B|Y)(?P<c>C|Z)D",
"..AYCD..").groupdict())
patt = re.compile(r"[\t ]*#\s*define\s*([a-z0-9_]*)\s*(.*)")
mobj = patt.search(" # define spam 1 + 2 + 3") # parts of C #define
print(mobj.groups()) # \s is whitespace
33
(c) Paul Fodor (CS Stony Brook)
Groups
python re-groups.py
0 1 2
('000', '111', '222')
('A', 'Y', 'C')
{'a': 'A', 'c': 'C', 'b': 'Y'}
('spam', '1 + 2 + 3')
34
(c) Paul Fodor (CS Stony Brook)
Groups
When a match or search function or method is successful, you get back a
match object
group(g) group(g1, g2, ...) Return the substring that matched a
parenthesized group (or groups) in the pattern. Accept group numbers or names.
Group numbers start at 1; group 0 is the entire string matched by the pattern. Returns
a tuple when passed multiple group numbers, and group number defaults to 0 if
omitted
groups() Returns a tuple of all groups’ substrings of the match (for
group numbers 1 and higher).
start([group]) end([group]) Indices of the start and end of the
substring matched by group (or the entire matched string, if no group is
passed).
span([group]) Returns the two-item tuple: (start(group),
end(group))
35
(c) Paul Fodor (CS Stony Brook)