CMP3008 LN4 RegularExpressions
CMP3008 LN4 RegularExpressions
Formal Languages
and Automata Theory
Lecture Notes 4
Regular Expressions
Sources
https://eecs.wsu.edu/~ananth/CptS317/Lectures/index.htm
"Introduction to automata theory, languages and
computation" by JE Hopcroft, R Motwani and JD Ullman.
" An Introduction to Formal Languages and Automata Theory" by
Peter Linz
1
Content
• Regular Expressions
• Precedence of Operators
• Formal Definition
• Examples
• Further Properties
• Equivalence with Finite Automata
• Generalized Nondeterministic Finite Automata
2
Regular Expressions
• The value of the arithmetic expression is the number 32. The value of a regular expression is a language.
• In this case, the value is the language consisting of all strings starting with a 0 or a 1 followed by any
number of 0s.
• The symbols 0 and 1 are shorthand for the sets {0} and {1}. So (0 ∪ 1) means ({0} ∪ {1}). The value of this
part is the language {0,1}.
• The part 0* means {0}* and its value is the language consisting of all strings containing any number of 0s.
• (0 ∪ 1)0* is shorthand for (0 ∪ 1) ◦ 0*
Regular Expressions vs. Finite Automata
• Offers a declarative way to express the pattern of any string we want to accept
• E.g., 01*+ 10*
4
Regular Expressions
Regular
Languages
Formal language
classes
5
Language Operators
• Union of two languages:
• L U M = all strings that are either in L or M
• Note: A union of two languages produces a third language
6
Kleene Closure (the * operator)
“i” here refers to how many strings to concatenate from the parent
language L to produce strings in the language Li
• Kleene Closure of a given language L:
• L0= {}
• L1= {w | for some w L}
• L2= { w1w2 | w1 L, w2 L (duplicates allowed)}
• Li= { w1w2…wi | all w’s chosen are L (duplicates allowed)}
• (Note: the choice of each wi is independent)
• L* = Ui≥0 Li (arbitrary number of concatenations)
Example:
• Let L = { 1, 00}
• L0= {}
• L1= {1,00}
• L2= {11,100,001,0000}
• L3= {111,1100,1001,10000,000000,00001,00100,0011}
• L* = L0 U L1 U L2 U …
7
Example: how to use these regular expression properties and
language operators?
• L = { w | w is a binary string which does Regular expression for the four cases:
not contain two consecutive 0s or two Case A: (01)*
consecutive 1s anywhere) Case B: (10)*
• E.g., w = 01010101 is in L, Case C: 0(10)*
while w = 10010 is not in L Case D: 1(01)*
• Goal: Build a regular expression for L Since L is the union of all 4 cases:
• Four cases for w:
• Case A: w starts with 0 and |w| is even Reg Exp for L = (01)* + (10)* + 0(10)* + 1(01)*
• Case B: w starts with 1 and |w| is even If we introduce then the regular expression can be
• Case C: w starts with 0 and |w| is odd simplified to:
• Case D: w starts with 1 and |w| is odd
Reg Exp for L = ( +1)(01)*( +0)
8
Examples
all possible strings of 0s and 1s. If Σ = {0,1}, we can write Σ as shorthand for the
regular expression (0 ∪ 1).
• Example:
• 01* + 1 = ( 0 . ((1)*) ) + 1
11
Algebraic Laws of Regular Expressions
• Commutative:
• E+F = F+E
• Associative:
• (E+F)+G = E+(F+G)
• (EF)G = E(FG)
• Identity:
• E+Φ = E
• E=E=E
• Annihilator:
• ΦE = EΦ = Φ
12
Algebraic Laws…
• Distributive:
• E(F+G) = EF + EG
• (F+G)E = FE+GE
• Idempotent: E + E = E
• Involving Kleene closures:
• (E*)* = E*
• Φ* =
• * =
• E+ =EE*
• E? = +E
13
Formal Definition
Important Note
• Don’t confuse the regular expressions ε and ∅.
• The expression ε represents the language containing a single
string—namely, the empty string—whereas ∅ represents the
language that doesn’t contain any strings.
15
+ notation
• For convenience, we let R+ be shorthand for RR*. In other words,
whereas R* has all strings that are 0 or more concatenations of
strings from R, the language R+ has all strings that are 1 or more
concatenations of strings from R.
• So R+ ∪ ε = R*.
• In addition, we let Rk be shorthand for the concatenation of k R’s with
each other.
More Examples
More Examples
More Examples
Further Properties
Further Properties
An example from programming languages
• How can you define a numerical constant?
• Examples: 72, 13.4, 0.1, -0.3, +.02, -7.
• Not a numerical constant: -., +.-, -9+
• Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, .}
• D = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Equivalence with Finite Automata
Equivalence with Finite Automata
Equivalence with Finite Automata
Equivalence with Finite Automata
Equivalence with Finite Automata
Equivalence with Finite Automata
Proof
• If a language A is regular, a regular expression describes it.
• Because A is regular, it is accepted by a DFA.
• We break this procedure into two parts, using a new type of finite
automaton called a generalized nondeterministic finite automaton,
GNFA.
• First we show how to convert DFAs into GNFAs,
• and then GNFAs into regular expressions.
35
Generalized Nondeterministic Finite Automata
• Generalized nondeterministic finite automata (GNFA) are simply
nondeterministic finite automata wherein the transition arrows may
have any regular expressions as labels, instead of only members of
the alphabet or ε.
• The GNFA moves along a transition arrow connecting two states by
reading a block of symbols from the input, which themselves
constitute a string described by the regular expression on that arrow.
GNFA Example
GNFAs
• For convenience, we require that GNFAs always have a special form
that meets the following conditions.
• The start state has transition arrows going to every other state but no arrows
coming in from any other state.
• There is only a single accept state, and it has arrows coming in from every
other state but no arrows going to any other state. Furthermore, the accept
state is not the same as the start state.
• Except for the start and accept states, one arrow goes from every state to
every other state and also from each state to itself.
GNFAs
We can easily convert a DFA into a GNFA in the special form.
• We simply add a new start state with an ε arrow to the old start state and a
new accept state with ε arrows from the old accept states.
• If any arrows have multiple labels (or if there are multiple arrows going
between the same two states in the same direction), we replace each with
a single arrow whose label is the union of the previous labels.
• Finally, we add arrows labeled ∅ between states that had no arrows. This
last step won’t change the language recognized because a transition
labeled with ∅ can never be used.
• From here on we assume that all GNFAs are in the special form.
GNFAs
• After converting DFA into an equivalent GNFA, we need to convert
GNFA into a regular expression. For this, we will remove the states in
GNFA until we have two states (start and accept states).
• The crucial step is constructing an equivalent GNFA with one fewer
state when k > 2.
• We do so by selecting a state, ripping it out of the machine, and
repairing the remainder so that the same language is still recognized.
• Any state will do, provided that it is not the start or accept state.
• We are guaranteed that such a state will exist because k > 2.
• Let’s call the removed state qrip.
GNFAs
• After removing qrip we repair the machine by altering the regular
expressions that label each of the remaining arrows.
• The new labels compensate for the absence of qrip by adding back the
lost computations.
• The new label going from a state qi to a state qj is a regular expression
that describes all strings that would take the machine from qi to qj
either directly or via qrip.
Constructing an equivalent GNFA with one fewer
state
Conversion from GNFA to Regular Expression
• The stages in converting a DFA with three states to an equivalent
regular expression are shown in the following figure.
An Example
Conversion
Another
example
conversion