Re Expression 19 and 20
Re Expression 19 and 20
Python RegEx
A Regular Expression or RegEx is a special sequence of characters that uses a search
pattern to find a string or set of strings.
It can detect the presence or absence of a text by matching it with a particular pattern and
also can split a pattern into one or more sub-patterns.
Regex Module in Python
Python has a built-in module named “re” that is used for regular expressions in Python. We
can import this module by using the import statement.
Example: Importing re module in Python
# importing re module
import re
#Check if the string starts with "The" and ends with "Spain":
txt = "The rain in India"
x = re.search("^The.*India$", txt)
if x:
print("YES! We have a match!")
else:
print("No match")
RegEx Functions
The re module offers a set of functions that allows us to search a
string for a match:
Function Description
P a g e 1 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
x = re.findall("ai", txt)
print(x)
output→[ai]
txt = "The rain in India"
x = re.findall("in", txt)
print(x)
output→ ['in', 'in']
import re
#Return a list containing every occurrence of "ai":
txt = "The rain in India"
x = re.findall("dxi", txt)
print(x)
output→[]
import re
p = re.compile('[a-e]')
// creates a pattern object that matches any single character in
the range of 'a' to 'e', inclusive.
print(p.findall("Aye, said Mr. R D Sharma"))
output → which are the matching characters. ['e', 'a', 'd', 'a', 'a']
import re
p = re.compile('\w')
print(p.findall("G@@d Morn!n_ ."))
Output →['G', 'd', 'M', 'o', 'r', 'n', 'n', '_']
import re
# Compile the pattern
P a g e 2 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
p = re.compile(r'\w+')
# Test the pattern on a sample string
test_string = "Hello, World! 123 _test_"
matches = p.findall(test_string)
P a g e 3 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
We are creating a pattern object that matches one or more word
characters consecutively. In regular expression terms, a word
character is defined as any alphanumeric character (letters and
digits) plus the underscore (_). The + quantifier means "one or
more" of the preceding element.
if x:
print("Whitespace character found at position:", x.start())
else:
print("No whitespace character found.")
Output→
The first white-space character is located in position: 5
import re
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)
Output→None
P a g e 4 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
The split() Function
The split() function returns a list where the string has been split
at each match:
import re
#Split the string at every white-space character:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
output→ ['The', 'rain', 'in', 'Spain']
import re
#Split the string at the first white-space character:
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)
Output→ ['The', 'rain in Spain']
from re import split
print(split('\W+', 'WORDS, words , Words'))
print(split('\W+', 'W@rds, wor!s , Wo&ds'))
print(split('\w+', "W@r$s"))
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM'))
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))
// # Split the string based on one or more digit characters
Output
WORDS, words , Words
['WORDS', 'words', 'Words']
W@rds, wor!s , Wo&ds
['W', 'rds', 'wor', 's', 'Wo', 'ds']
\w+ W@r$s
['', '@', '$', '']
On 12th Jan 2016, at 11:02 AM
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']
P a g e 5 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
The sub() Function
The sub() function replaces the matches with the text of your
choice:
import re
#Replace all white-space characters with the digit "*":
txt = "The rain in Spain"
x = re.sub("\s", "*", txt)
print(x)
output→ The*rain*in*Spain
import re
#Replace all white-space characters with the digit "X":
txt = "The rain in Spain"
x = re.sub("\s", "X", txt, 2)
print(x)
Output→ TheXrainXin Spain
# Replace the first two whitespace characters with "X"
P a g e 6 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
Match Object
A Match Object is an object containing information about the
search and the result.
Note: If there is no match, the value None will be returned,
instead of the Match Object.
import re
#The search() function returns a Match object:
txt = "The rain in Spain"
x = re.search("ai", txt)
print(x)
Output→ <re.Match object; span=(5, 7), match='ai'>
The Match object has properties and methods used to retrieve
information about the search, and the result:
.span() returns a tuple containing the start-, and end positions of the
match.
.string returns the string passed into the function
.group() returns the part of the string where there was a match
P a g e 7 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
import re
#Search for an upper case "S" character in the beginning of a word,
and print its position:
import re
# Sample text
txt = "The rain in Spain stays mainly in the plain."
if x:
print("Found word:", x.group())
print("Starting position:", x.start())
print("Starting position:", x.span())
print("Original string:", x.string)
else:
print("No matching word found.")
Output→
Found word: Spain
Starting position: 12
Starting position: (12, 17)
Original string: The rain in Spain stays mainly in the plain.
P a g e 8 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
import re
regex = r"([a-zA-Z]+) (\d+)"
match = re.search(regex, "I was born on June 24")
if match != None:
print ("Match at index %s, %s" % (match.start(), match.end()))
print ("Full match: %s" % (match.group(0)))
print ("Month: %s" % (match.group(1)))
print ("Day: %s" % (match.group(2)))
else:
print ("The regex pattern does not match.")
output
Match at index 14, 21
Full match: June 24
Month: June
Day: 24
P a g e 9 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
import re
if match:
print("Match at index %s, %s" % (match.start(), match.end()))
print("Full match: %s" % (match.group(0))) # Entire match
print("Month: %s" % (match.group(1))) # First group (month)
print("Day: %s" % (match.group(2))) # Second group (day)
print("Year: %s" % (match.group(3))) # Third group (year)
else:
print("No match found.")
Output
Match at index 9, 19
Full match: 03/04/2025
Month: 03
Day: 04
Year: 2025
P a g e 10 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
P a g e 11 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
re.escape()
Returns string with all non-alphanumerics backslashed, this is
useful if you want to match an arbitrary literal string that may have
regular expression metacharacters in it.
re.escape() is used to escape special characters in a string, making
it safe to be used as a pattern in regular expressions. It ensures that
any characters with special meanings in regular expressions are
treated as literal characters.
import re
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))
Output
This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\]\,\ he\ said\ \ \ \^WoW
P a g e 12 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
Meta-characters
Metacharacters are the characters with special meaning.
To understand the RE analogy, Metacharacters are useful and
important. They will be used in functions of module re. Below is the
list of metacharacters.
Meta Description
Characters
\ Used to drop the special meaning of character
following it
P a g e 13 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
Special Sequences
Special sequences do not match for the actual character in the
string instead it tells the specific location in the search string where
the match must occur. It makes it easier to write commonly used
patterns.
List of special sequences
Special Description Examples
Sequence
\A Matches if the string \Afor for seeks
begins with the for the world
given character
\b Matches if the word \bse seeks
begins or ends with set
the given character.
\b(string) will check
for the beginning of
the word and
(string)\b will check
for the ending of the
word.
\B It is the opposite of \Bge together
the \b i.e. the string forge
should not start or
end with the given
regex.
\d Matches any \d 123
decimal digit, this is see1
equivalent to the set
class [0-9]
\D Matches any non- \D seeks
digit character, this seek1
is equivalent to the
set class [^0-9]
\s Matches any \s see ks
whitespace a bc a
character.
\S \S a bd
P a g e 14 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
Matches any non- abcd
whitespace
character
\w Matches any \w 123
alphanumeric seeKs4
character, this is
equivalent to the
class [a-zA-Z0-9_].
\W Matches any non- \W >$
alphanumeric see<>
character.
\Z Matches if the string ab\Z abcdab
ends with the given abababab
regex
txt = "set"
P a g e 15 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
match = re.search(r"se\b", txt)
print(match)
# Output: <re.Match object; span=(1, 3), match='et'>
P a g e 16 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
\B: It is the opposite of \b
import re
txt = "together"
match = re.search(r"\Bge", txt)
print(match) # Output: <re.Match object; span=(2, 4),
match='ge'>
txt = "forge"
match = re.search(r"ge\B", txt)
print(match) # Output: None
txt = "123"
match = re.search(r"\d", txt)
print(match) # Output: <re.Match object; span=(0, 1), match='1'>
txt = "see1"
match = re.search(r"\d", txt)
print(match) # Output: <re.Match object; span=(3, 4), match='1'>
txt = "seeks"
match = re.search(r"\D", txt)
print(match) # Output: <re.Match object; span=(0, 1), match='s'>
txt = "seek1"
match = re.search(r"\D", txt)
print(match) # Output: <re.Match object; span=(0, 1), match='s'>
P a g e 17 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
\s: Matches any whitespace character
import re
txt = "abcd"
match = re.search(r"\S", txt)
print(match) # Output: <re.Match object; span=(0, 1), match='a'>
txt = "123"
match = re.search(r"\w", txt)
print(match) # Output: <re.Match object; span=(0, 1), match='1'>
txt = "seeKs4"
match = re.search(r"\w", txt)
print(match) # Output: <re.Match object; span=(0, 1), match='s'>
txt = "see<>"
match = re.search(r"\W", txt)
print(match) # Output: <re.Match object; span=(3, 4), match='<'>
txt = "abcdab"
match = re.search(r"ab\Z", txt)
print(match) # Output: <re.Match object; span=(4, 6),
match='ab'>
txt = "abababab"
match = re.search(r"ab\Z", txt)
print(match) # Output: <re.Match object; span=(6, 8),
match='ab'>
P a g e 19 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
Sets for character matching
A Set is a set of characters enclosed in ‘[]’ brackets. Sets are used to
match a single character in the set of characters specified between
brackets. Below is the list of Sets:
Set Description
\{n,\} Quantifies the preceding character or group and
matches at least n occurrences.
* Quantifies the preceding character or group and
matches zero or more occurrences.
[0123] Matches the specified digits (0, 1, 2, or 3)
[^arn] matches for any character EXCEPT a, r, and n
\d Matches any digit (0-9).
[0-5][0-9] matches for any two-digit numbers from 00 and 59
\w Matches any alphanumeric character (a-z, A-Z, 0-9,
or _).
[a-n] Matches any lower case alphabet between a and n.
\D Matches any non-digit character.
[arn] matches where one of the specified characters (a, r,
or n) are present
[a-zA-Z] matches any character between a and z, lower case
OR upper case
[0-9] matches any digit between 0 and 9
txt = "Hellooo"
pattern = r'o{2,}'
matches = re.findall(pattern, txt)
print(matches) # Output: ['ooo']
txt = "1234"
pattern = r'[0123]'
matches = re.findall(pattern, txt)
print(matches) # Output: ['1', '2', '3']
txt = "garden"
pattern = r'[^arn]'
matches = re.findall(pattern, txt)
print(matches) # Output: ['g', 'd', 'e']
txt = "a1b2c3"
pattern = r'\d'
matches = re.findall(pattern, txt)
print(matches) # Output: ['1', '2', '3']
txt = "a1b2c3"
pattern = r'\D'
matches = re.findall(pattern, txt)
print(matches) # Output: ['a', 'b', 'c']
[arn]: Matches where one of the specified characters (a, r, or n) are present
import re
txt = "garden"
pattern = r'[arn]'
matches = re.findall(pattern, txt)
print(matches) # Output: ['a', 'r', 'n']
txt = "Hello123"
pattern = r'[a-zA-Z]'
matches = re.findall(pattern, txt)
print(matches) # Output: ['H', 'e', 'l', 'l', 'o']
P a g e 23 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
P a g e 24 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
P a g e 25 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator
P a g e 26 | 26
Microsoft Certified Power BI Data Analyst
Arindam Ghosh,9433547743
SalesForce Certified Administrator