Day 3 - Regexps

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

Regexps

Yared Y.
• Lexers are normally constructed by lexer generators,
that transform human-readable specifications of
tokens and white-space into efficient programs.
• Specifications are traditionally written using regular
expressions: An algebraic notation for describing sets
of strings.
• The generated lexers are in a class of extremely simple
programs called finite automata.
Formal language elements
• A formal language is one that can be specified precisely
and is amenable for use with computers, whereas a
natural language is one which is normally spoken by
people.
• The syntax of Java is an example of a formal language
Symbols, Alphabets, Strings,
Language
• Let us assume that the source program is stored in a file. It
consists of a sequence of characters. Lexical analysis, i.e., the
scanner, reads this sequence from left to right and decomposes
it into a sequence of lexical units, called symbols.

• A symbol is any character or mark that represents something.


It could be a letter like 'a', a digit like '1', or any other
individual character like '@' or '#’.
• Example: In the English alphabet, 'a', 'b', 'c' are symbols.
In binary, '0' and '1' are symbols.
• These symbols are the basic building blocks from which
strings (or words) are formed.
Alphabet (∑) - sigma
• An alphabet is a finite set of symbols.
It’s a collection of characters that you can use to build
strings.

Example: The English alphabet is {a, b, c, ..., z}.


The binary alphabet is {0, 1}.
Examples
• Binary Alphabet
Σ = {0, 1}
Used in binary strings, which are fundamental in computer science.
• English Alphabet
Σ = {a, b, c, ..., z}
Used in natural language processing and text analysis.
• DNA Alphabet
Σ = {A, T, C, G}
Used in bioinformatics to represent DNA sequences.
• Numerical Alphabet
Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Used in numerical strings and arithmetic expressions.
String (or word)
• When symbols from the alphabet are combined in a
sequence, they form a string
• A string is a finite sequence of symbols from an
alphabet. The order of the symbols matters

• Example:
Using the English alphabet {a, b, c}, the strings
"abc", "bac", and "aa" are all different strings.
In the binary alphabet {0, 1}, "101" and "011" are
different strings.
• For example, if Σ = {a, b, c}, then
"abc",
"aab", and
"cba" are strings over Σ.

• Empty String (ε)


The empty string is a string with no symbols.
It is often denoted by ε or λ.
Language (L)
• A language is a set of strings formed from the symbols of an
alphabet. It’s essentially a collection of possible sequences that you
can create with the given alphabet.

Alphabet: {0, 1}

Language: A set of strings that contain exactly one '1’.


The language would include strings like:
{"1", "01", "10", "001", "100", "010"}.

• Each of these strings has exactly one '1', which fits the definition of
the language.
Example:

If you have the alphabet {0, 1}, a language might be all


the strings that contain exactly two '1's, like {"101",
"101", "011"}.

Another language could be all valid English words using


the alphabet {a, b, c, ..., z}.
Examples
• A (formal) language is a set of strings from a given
alphabet.

The following are examples of languages from the


alphabet {0,1}:

1. {0,10, 1011}
2. { }
3. {Ɛ, 0, 00, 000, 0000, 000000, ….}
4. The set of all strings of zeros and ones having an
even number of ones
• A language L over an alphabet Σ is a set of strings
composed of symbols from Σ.
• For example,
L = {"a", "ab", "abc"} is a language over the alphabet
Σ = {a, b, c}.
summary
• Symbol: The basic unit
• Alphabet: A set of symbols
• String: A sequence of symbols
• Set: A collection of distinct items
• Language: A set of strings
Language elements
• A string is a list of characters from a given alphabet.
• The elements of a string need not be unique, and the
order in which they are listed is important. For example,
“abc” and “cba” are different strings, as are “abb” and
“ab”.
• The string which consists of no characters is still a string
(of characters from the given alphabet), and we call it
the null string and designate it by Ɛ.
Regular Expressions and Finite
Automata
• For lexical analysis, specifications are traditionally written using
regular expressions: An algebraic notation for describing sets of
strings. The generated lexers are in a class of extremely simple
programs called finite automata.
• To identify the tokens, we need some method of describing the possible
tokens that can appear in the input stream.
• For this purpose, we introduce regular expression, a notation that can
be used to describe essentially all the tokens of programming language.
• Secondly , having decided what the tokens are, we need some
mechanism to recognize these in the input stream. This is done by the
token recognizers, which are designed using transition diagrams and
finite automata
Regular expressions (regex)
• A regex is a sequence of characters that defines a
search pattern. This pattern can be used to match
strings or parts of strings.
• Regexs are used to define the patterns for various types
of tokens.

Example:
A keyword like `if`can be matched with the regex `if`.
An identifier (like variable names) can be matched with
a regex like `[a-zA-Z_][a-zA-Z0-9_]*`.
An integer literal can be matched with a regex like `[0-
• The set of all integer constants or the set of all variable
names are examples of sets of strings, where the
individual digits or letters used to form these constants
or names are taken from a particular alphabet, i.e., a
set of characters.
• A set of strings is called a language.
• For integers, the alphabet consists of the digits 0–9 and
for variable names the alphabet contains both letters
and digits (and perhaps a few other characters, such as
underscore).
Alphabets
• Any finite set of symbols. Typical examples of symbols
are letters, digits, and punctuation.

∑ = {0,1} is a set of binary alphabets,

∑ = {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of
Hexadecimal alphabets,

∑ = {a-z, A-Z} is a set of English language alphabets.


• Alphabets are finite, non-empty set of input symbols.
• Σ = {0, 1} – binary alphabets
• String represents the collection of alphabets.
• w = {0,1, 00, 01, 10, 11, 001, 010, … }
• w indicates the set of possible strings for the given binary alphabet Σ
• Language (L) is the collection of strings which are accepted by finite automata.
• L = {0n1 I n >= 0}
• Length of string is defined as the number of input symbols in a given string. It is found
by || operator.
• Let ω = 0101
• | ω | =4
• Empty string denotes zero occurrence of input symbol. It is represented by Ɛ. Concatenation of
two strings p and q is denoted by pq.
• Let p = 010
• And q = 001
• pq = 010001
• qp = 001010
• i.e., pq ≠ qp
• Prefix A prefix of any string s, is obtained by removing zero or more symbols from the end of s.
• (eg.) s = balloon
• Possible prefixes are: ball, balloon,
• Suffix A suffix of any string s, is obtained by removing zero or more symbols from the beginning
of s.
• (eg.) s =balloon
• Possible prefixes are: loon, balloon
Terms for parts of Strings
• A prefix of string s is any string obtained by removing
zero or more symbols from the end of s. For example,
ban, banana, and ε are prefixes of banana.
• Let s = abcd, prefixes: a, ab, abc, abcd and {}
• A suffix of string s is any string obtained by removing
zero or more symbols from the beginning of s. For
example, nana, banana, and ε are suffixes of banana.
• A substring of s is obtained by deleting any prefix and
any suffix from s. For instance, banana, nan, and ε are
substrings of banana.
• The proper prefixes, suffixes, and substrings of a string
s are those, prefixes, suffixes, and substrings,
respectively, of s that are not ε or not equal to s itself.
• A subsequence of s is any string formed by deleting
zero or more not necessarily consecutive positions of s.
For example, baan is a subsequence of banana.
Strings
• Any finite sequence of alphabets is called a string.
• Length of the string is the total number of occurrence of
alphabets, e.g., the length of the string ethiopia is 8
and is denoted by |ethiopia| = 8.
• A string having no alphabets, i.e. a string of zero length
is known as an empty string and is denoted by ε
(epsilon).
• Example: 01101, 111, 0001, 111 … are strings from the
binary alphabet ∑ = { 0, 1 }.
• A language is any countable set of strings over some
fixed alphabet.
Properties of a String
• Length of a String
The length of a string is the number of symbols it
contains.
For example, the length of the string "abc" is 3.
The length of the empty string ε is 0.
• Concatenation of Strings
Concatenation is the operation of joining two strings
end-to-end.

If w1 & w2 are strings, their concatenation is


denoted by w1 ⋅ w2 or simply w1w2.

For example, if w1 = "abc" & w2 = "def", then w1 ⋅


w2 = "abcdef".
• Substrings
A substring of a string w is a contiguous sequence of
symbols within w.

For example, if w = "abcdef", then "bcd" and "def"


are substrings of w.
• Prefixes and Suffixes
A prefix of a string w is a substring that appears at
the beginning of w.
A suffix of a string w is a substring that appears at
the end of w.
For example, if w = "abcdef", then "abc" is a prefix
and "def" is a suffix of w.
The power of an alphabet
• If ∑ is an alphabet, we can express the set of all strings of a
certain length from that alphabet by using an exponential
notation.

We define ∑k to be the set of strings of length k, each of


whose symbols is in ∑.

∑k denotes the set of words of length k for k >= 0


• ∑0= {ϵ}, regardless of what alphabet ∑ is.
• If ∑ = {0,1}, then
∑1 = {0,1},
∑2 = {00, 01, 10, 11},
∑3={000, 001, 010, 011, 100, 101, 110,111}
• The set of all strings over an alphabet ∑ is
conventionally denoted ∑*.
Example: {0,1}* = {ϵ, 0, 1, 00, 01, 10, 11, 000,
…}
∑* = ∑0 U ∑1 U ∑2 U ∑3U …
• The set of nonempty strings from alphabet ∑ is
denoted ∑+ (excluding the empty string from the set of
strings)
• ∑+ = ∑1 U ∑2 U ∑3 U …
• ∑* = ∑+ U {ε}.
With Examples
• Symbols: 0, 1, 2, a, b, c, d, ….
• Alphabets: {0,1}, {a, b,c}, ….
• Strings: a, b, 0, 1, aa, bb, ab, 01, 11, abedfg,…
• Language:
E.g. From the alphabets :{0,1}
the set of all strings of length 2 = {00,11,10,01}
Example
Alphabet: Σ = {0, 1} (Binary alphabet)

Set of strings of length 2 (Σ2):


Σ2 = {00, 01, 10, 11}
There are 22 = 4 strings of length 2.

Alphabet: Σ = {a, b, c}
Set of strings of length 3 (Σ3):
Σ3 = {aaa, aab, aac, aba, abb, abc, aca, acb, acc, baa,
bab, bac, bba, bbb, bbc, bca, bcb, bcc, caa, cab, cac, cba, cbb, cbc,
cca, ccb, ccc}
There are 33 = 27 strings of length 3.
How to read it 

The notation Σ3 is read as


"Sigma to the power of three" or
"the set of all strings of length
three over the alphabet Σ."
Cardinality
• Number of elements in a set
Eg. {0,1} has two elements and is ∑1
{00, 01, 10, 11} has 4 elements and is ∑2
{000,001,011,010, 100, 101, 110, 111} has 8
elements and is ∑3

∑n has 2n elements!
Operations on Alphabets
• Union
Definition: The union of two alphabets Σ1 and Σ2 is the set
of symbols that belong to either Σ1 or Σ2.
Notation: Σ1 ∪ Σ2.

Example: If Σ1 = {a, b} and Σ2 = {b, c}, then


Σ1 ∪ Σ2 = {a, b, c}.

If Σ1 = {0, 1, 2} and Σ2 = {2, 3, 4}, then


Σ1 ∪ Σ2 = {0, 1, 2, 3, 4}.
• Intersection
Definition: The intersection of two alphabets Σ1 and Σ2
is the set of symbols that belong to both Σ1 and Σ2.
Notation: Σ1 ∩ Σ2.
Example: If Σ1 = {a, b} and Σ2 = {b, c}, then
Σ1 ∩ Σ2 = {b}.
• Difference
Definition: The difference of two alphabets Σ1 and Σ2 is
the set of symbols that belong to Σ1 but not to Σ2.
Notation: Σ1 \ Σ2.
Example: If Σ1 = {a, b} and Σ2 = {b, c}, then
Σ1 \ Σ2 = {a}.
Operation on Languages/Strings
• The various operations on languages are:
• Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
• If L = {a, b} and M = {c, d}Then L ∪ M = {a, b, c, d}
• Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
• If L = {a, b} and M = {c, d} Then L ⋅ M = {ac, ad, bc, bd}
• The Kleene Closure (Kleene star) of a language L is
written as
L* = Zero or more occurrence of language L.
• The Kleene star operation is used on languages (not
directly on alphabets) to denote the set of all possible
strings (including the empty string) that can be formed
by concatenating zero or more strings from a language.

• Notation: L* (where L is a language)


Example: If L = {a, b}, then L* includes {ε, a, b, aa, ab,
ba, bb, aaa, aab, ...}.
• Kleene Star: L* = {ε} ∪ L ∪ L2 ∪ L3 ∪ ...
• Example: Language L: {"a", "b"}
Kleene Star L*: {ε, "a", "b", "aa", "ab", "ba", "bb",
"aaa", ...}
Closure
• Kleene closure (L*)
• Kleene closure refers to zero or more occurrences of input
symbols in a string, i.e., it includes empty string Ɛ(set of
strings with 0 or more occurrences of input symbols).

• Positive closure (L +)
• Positive closure indicates one or more occurrences of input
symbols in a string, i.e., it excludes empty string Ɛ(set of
strings with 1or more occurrences of input symbols).
Positive closure
Positive closure (L +)

• Positive closure indicates one or more occurrences of


input symbols in a string, i.e., it excludes empty string
Ɛ(set of strings with 1or more occurrences of input
symbols).
L3– set of strings each with length 3.
• (eg.) Let Σ = {a, b}
• L* = {Ɛ, a, b, aa, ab, ba, bb, aab, aba, aaba, … }
• L+ = {a, b, aa, ab, ba, bb, aab, aaba, }
• L3 = {aaa, aba, abb, bba, bob, bbb, }
Operation on Languages
• The concatenation of languages is all strings formed by
taking a string from the first language and a string from the
second language, in all possible ways, and concatenating
them.
• The (Kleene) closure of a language L, denoted L*, is the set
of strings you get by concatenating L zero or more times.
• Note that L0, the "concatenation of L zero times," is defined
to be {ϵ), and inductively, Li is Li-1 L.
• Finally, the positive closure, denoted L+, is the same as the
Kleene closure, but without the term Lo. That is, ϵ will not be
in L+ unless it is in L itself.
Example:
• Let L be the set of letters {A, B,... ,Z, a, b,...,z ) and let D
be the set of digits {0,1,...9).
• L U D is the set of letters and digits - strictly speaking
the language with 62 strings of length one, each of
which strings is either one letter or one digit.
• LD is the set of 520 strings of length two, each
consisting of one letter followed by one digit.
• L4 is the set of all 4-letter strings.
• L* is the set of ail strings of letters, including ϵ, the
empty string.
• L ( L U D)* is the set of all strings of letters and digits
beginning with a letter.
• D+ is the set of all strings of one or more digits
Example
• A = {pq, r}
B = {t, uv}
• UNION:
A U B = {pq, r, t, uv}
• Concatenation:
AB = {pqt, pquv, rt, ruv}
• Star
A* = {e, pq, r, t, uv, pqr, rpq, pqpq, rr, pqpqpq, rrr, …}
Precedence of Operators
• Unary operator (*) is having highest precedence
• Concatenation operator is second highest and is left
associative.
• Union operator ( I or U) has least precedence and is left
associative.
• Based on the precedence, the regular expression is
transformed to finite automata when implementing
lexical analyzer.
• For each of the following regexps, list 6 strings which
are in its language.
1. (a(b + c)*)* d
Examples: d, ad, abd, acd, aad, abbcbd
2. (a + b)*(c + d)
Examples: c, d, ac, abd, babc, bad
3. (a*b*)*
Examples: ϵ, a, ab, ba, aa

Note that (a*b*)* == (a + b)*

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy