Unit 2 Lexical Analyzer
Unit 2 Lexical Analyzer
Lexical Analyzer
………………………………………………………………………………………………………
Topics
2.1 Lexical Analysis: The role of the lexical analyzer, Specification and Recognition of
tokens, Input Buffer, Finite Automata relevant to compiler construction syntactic
specification of languages, Optimization of DFA based pattern matchers
………………………………………………………………………………………………………
Lexical Analysis
The lexical analysis is the first phase of a compiler where a lexical analyzer acts as an interface
between the source program and the rest of the phases of compiler. It reads the input characters
of the source program, groups them into lexemes, and produces a sequence of tokens for each
lexeme. The tokens are then sent to the parser for syntax analysis. Normally a lexical analyzer
doesn‟t return a list of tokens; it returns a token only when the parser asks a token from it.
Lexical analyzer may also perform other auxiliary operation like removing redundant white
space, removing token separator (like semicolon) etc.
Token
Source Lexical
Program Parser
Analyzer
Get next
token
Error Error
Symbol Table
Examples of Non-tokens
Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
Token
Token is word, which describes the lexeme in source program. It is generated when lexeme is
matched against pattern. A token is a logical building block of language. They are the sequence
of characters having a collective meaning.
Example 1: Example showing lexeme, token and pattern for variables
Lexeme: A1, Sum, Total
Pattern: Starting with a letter and followed by letter or digit but not a keyword.
Token: ID
Example 2: Example showing lexeme, token and pattern for floating number
Lexeme: 123.45
Pattern: Starting with digit followed by a digit or optional fraction and or optional
exponent
Token: NUM
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these
patterns are defined by means of regular expressions. In programming language, keywords,
constants, identifiers, strings, numbers, operators and punctuations symbols can be considered
as tokens.
A token describes a pattern of characters having same meaning in the source program such as
identifiers, operators, keywords, numbers, delimiters and so on. A token may have a single
attribute which holds the required information for that token. For identifiers, this attribute is a
pointer to the symbol table and the symbol table holds the actual attributes for that token.
Token type and its attribute uniquely identify a lexeme. Regular expressions are widely used to
specify pattern.
Attributes of Tokens
When a token represents more than one lexeme, lexical analyzer must provide additional
information about the particular lexeme. This additional information is called as the attribute of
the token. For simplicity, a token may have a single attribute which holds the required
information for that token.
Example: the tokens and the associated attribute for the following statement.
A=B*C+2
<id, pointer to symbol table entry for A>
<Assignment operator>
<id, pointer to symbol table entry for B>
<mult_op>
<id, pointer to symbol table entry for C>
<add_op>
<num, integer value 2>
Input Buffering
Reading character by character from secondary storage is slow process and time consuming as
well. It is necessary to look ahead several characters beyond the lexeme for a pattern before a
match can be announced. One technique is to read characters from the source program and if
pattern is not matched then push look ahead character back to the source program. This
technique is time consuming. Use buffer technique to eliminate this problem and increase
efficiency.
The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr (bp) and forward to keep track of the pointer of the input scanned. Initially
both the pointers point to the first character of the input string as shown below,
bp
i n t i , j : i = j + 1 ; j = j + 1 ;
fp
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank
space the lexeme „int‟ is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignores and
moves ahead. Then both the begin ptr (bp) and forward ptr (fp) are set at next token. The input
character is thus read from secondary storage, but reading in this way from secondary storage is
costly. Hence buffering technique is used. A block of data is first read into a buffer, and then
second by lexical analyzer. There are two methods used in this context: One Buffer Scheme, and
Two Buffer Scheme. These are explained as following below.
bp
i n t i , j : i = j + 1 ; j = j + 1 ;
fp
1. One Buffer Scheme
In this scheme, only one buffer is used to store the input string but the problem with this
scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to be refilled, that makes overwriting the first of lexeme.
bp
i n t i = i + 1
fp
Figure: One buffer scheme storing input string
2. Two Buffer Scheme
To overcome the problem of one buffer scheme, in this method two buffers are used to store the
input string. The first buffer and second buffer are scanned alternately. When end of current
buffer is reached the other buffer is filled. The only problem with this method is that if length of
the lexeme is longer than length of the buffer then scanning input cannot be scanned
completely.
Initially both the bp and fp are pointing to the first character of first buffer. Then the fp moves
towards right in search of end of lexeme. As soon as blank character is recognized, the string
between bp and fp is identified as corresponding token. To identify, the boundary of first buffer
end of buffer character should be placed at the end first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark present at the end of
second buffer. When fp encounters first eof, then one can recognize end of first buffer and
hence filling up second buffer is started. In the same way when second eof is obtained then it
indicates of second buffer. Alternatively both the buffers can be filled up until end of the input
program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to identify the end of
buffer.
bp fp
i n t i = i + 1 ; j = j + 1 ; eof
Buffer 1 Buffer 2
Figure: Two buffer scheme storing input string
Specifications of Tokens
Regular expressions are an important notation for specifying patterns. Each pattern matches a
set of strings, so regular expressions will serve as names for sets of strings. In brief a regular
expression is a way to specify tokens. The regular expression represents the regular languages.
The language is the set of strings and string is the set of alphabets. Thus following terminologies
are used to specify tokens:
a. Alphabets, Strings and Languages
b. Operations on languages
c. Regular expressions
d. Regular definition
Alphabets
The set of symbols is called alphabets. Example any finite set of symbols ∑= {0,1} is a set of
binary alphabets, ∑={0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, ∑={a-z,
A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string „Kanchanpur‟ is 10 and is denoted by
|Kanchanpur|=10. A string having no alphabets, i.e. a string of zero length is known as an
empty string and is denoted by ε (epsilon).
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
Operations on languages
The following are the operations that can be applied to languages:
Union
Concatenation
Kleene closure
Positive closure
Union
The symbol ∪ is employed to denote the union of two sets. Thus, the set A∪B read “A union B”
or “the union of A and B” is defined as the set that consists of all elements belonging to either
set A or set B (or both). In regular expression plus (+) symbol is used to represent union
operation. Let A and B be two languages, where A = {dog, ba, na} and B = {house, ba} then,
AUB = A+B=A|B = {dog, ba, na, house}
Concatenation
String concatenation is the operation of joining character strings end-to-end. In regular
expression dot operator (.) is used to represent concatenation operation.
Example: A.B = {doghouse, dogba, bahouse, baba, nahouse, naba}
Kleene Closure
The Kleene closure, ∑*, is a unary operator on a set of symbols or strings, ∑, that gives the
infinite set of all possible strings of all possible lengths over ∑ including {Φ}.
Mathematically, ∑* = ∑0 ∪ ∑1 ∪ ∑2 ∪……. where ∑n is the set of all possible strings of length n.
Example: If ∑ = {a, b} then,
∑* = {Φ, a, b, aa, ab, ba, bb, aaa, aba, bab, aab, bba, bbb,…….}
a* = {a0 U a1 U a2 U a3 U…..an}={Φ, a, aa, aaa, aaaa, aaaaa, …………..}
b* ={b0 U b1 U b2 U b3 U…..bn}={Φ, b, bb, bbb, bbbb, bbbbb, ………..}
a.b* = {a}. {Φ, b, bb, bbb, bbbb, bbbbb, ………..}
{a, ab, abb, abbb, abbbb, abbb…….}
(ab)* = ={(ab)0 U (ab)1 U (ab)2 U (ab)3 U…..(ab)n}={Φ, ab, abab, ababab,………}
a*.b* = {Φ, a, aa, aaa, aaaa, aaaaa, …..}. {Φ, b, bb, bbb, bbbb, bbbbb,..…}
= {Φ, b, bb, bbb, bbbb, bbbbb, a, ab, abb, abbb, abbbb, abbbbb, aa, aab, aabb,
aabbb, aaa, aaab, aaabb,………}
(a+b)* = {Φ, a, b, aa, aaa, aaaa, bb, bbb, bbbb, ab, ba, abababa, bababa, aaaab,
baaaa,…………..}
Positive closure
The set ∑+ is the infinite set of all possible strings of all possible lengths over ∑ excluding {Φ}.
Mathematically, ∑+ = ∑1 ∪ ∑2 ∪ ∑3 ∪……. where ∑n is the set of all possible strings of length n.
Example: If ∑ = {a, b} then,
∑+ = ∑* − {Φ}
∑+ = {a, b, aa, ab, ba, bb, aaa, aba, bab, aab, bba, bbb,…….}
a+ = {a1 U a2 U a3 U…..an}={a, aa, aaa, aaaa, aaaaa, …………..}
b+ = {b1 U b2 U b3 U…..bn}={b, bb, bbb, bbbb, bbbbb, ………..}
a.b+ = {a}.{b, bb, bbb, bbbb, bbbbb, ………..}
= {ab, abb, abbb, abbbb, abbb…….}
(ab)+ = {(ab)1 U (ab)2 U (ab)3 U…..(ab)n}={ab, abab, ababab, abababab………}
a+.b+ = {a, aa, aaa, aaaa, aaaaa, …….}.{b, bb, bbb, bbbb, bbbbb, ………..}
= {ab, abb, abbb, abbbb, aab, aabb, aabbb, aabbbb, aaab, aaabb, aaabbb…..}
(a+b)+ = {a, b, aa, aaa, aaaa, bb, bbb, bbbb, ab, ba, abababa, bababa, aaaab, baaaa,……}
Regular Expressions
Regular expressions are the algebraic expressions that are used to describe tokens of a
programming language. It uses the three regular operations. These are called union/or,
concatenation and star. Brackets ( and ) are used for grouping, just as in normal math.
Examples: Given the alphabet A = {0, 1}
1. 1(1+0)*0 denotes the language of all string that begins with a „1‟ and ends with a „0‟.
2. (1+0)*00 denotes the language of all strings that ends with 00 (binary number multiple of 4)
3. (01)*+ (10)* denotes the set of all stings that describe alternating 1s and 0s
4. (0* 1 0* 1 0* 1 0*) denotes the string having exactly three 1‟s.
5. 1*(0+ ε)1*(0+ ε) 1* denotes the string having at most two 0‟s
6. (A | B | C |………| Z | a | b | c |………| z | _ |). (A | B | C |………| Z | a | b | c
|………| z | _ | 1 | 2 |…………….| 9)* denotes the regular expression to specify the
identifier like programming in C. [TU]
7. (1+0)* 001 (1+0)* denotes string having substring 001
8. 0(0 + 1)∗ 0 + 1(0 + 1)∗1 denotes the RE for the language of all binary strings of length at least 2
that begin and end in the same symbol.
9. ((0 + 1)∗ 1 + ε) (00)∗ 00 denotes the RE for the set of all binary strings that end with an even
nonzero number of 0‟s.
10. Regular expression for declaration of valid one dimensional array in programming like C [TU]
∑= (0,1…..9,a,b…z, A, B……..Z,_,[,])
RE: (a|b|…..|z|A|B|….|Z|_) (a|b|…..z|A |B|….|Z|_|0|1|…|9)* . [(1|2|….|9) (0|1|2|….|9)* ]
11. Regular expression for declaration of valid two dimensional array in programming like C [TU]
∑= (0,1…..9,a,b…z, A, B……..Z,_,[,])
RE: (a|b|…..|z|A|B|….|Z|_) (a|b|…..z|A |B|….|Z|_|0|1|…|9)* . [(1|2|….|9) (0|1|2|….|9)* ].
[(1|2|….|9) (0|1|2|….|9)*]
12. Regular expression for declaration of valid floating numbs. [TU]
RE: (0|1|2|….|9)* „.‟ (0|1|2|….|9)+
Regular Definitions
To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use regular definitions. Giving names
to regular expressions is referred to as a Regular definition. If Σ is an alphabet of basic symbols,
then a regular definition is a sequence of definitions of the form,
d1 → r1
d2 → r2
…………….
dn → rn
Where, di is a distinct name and ri is a regular expression over symbols in Σ ∪ {d1, d2... di-1}
Σ = Basic symbol and {d1, d2... di-1} = previously defined names.
Example 1: Regular definition for specifying identifiers in a programming language like C
letter → A | B | C |………| Z | a | b | c |………| z
underscore →‟_‟
digit → 0 | 1 | 2 |…………….| 9
id → (letter | underscore).(letter | underscore | digit)*
If we are trying to write the regular expression representing identifiers without using regular
definition, it will be complex.
(A | B | C |………| Z | a | b | c |………| z | _ |). ((A | B | C |………| Z | a | b | c |………
| z | _ |) (1 | 2 |…………….| 9))*
Example 2: Write regular definition for specifying floating point number in a programming
language like C
digit → 0 | 1 | 2 |…………….| 9
num→ digit * (.) (digit)+
Example 3: Write regular definitions for specifying an integer one dimensional array
declaration in programming language like in C
Lc → a|b|…..|z
Uc → A|B|…..|Z
Digit →1|2|…….|9
Zero → 0
Us →_
Lb → [
Rb →]
Array → (Lc|Uc|Us) (Lc|Uc|Us|Digit|zero)* Lb. Digit.(Digit|zero)* Rb
Regular
NFA DFA
expression
Example 2: An NFA that accepts any binary string that contains 00 or 11 as a substring.
Example 3: NFA over {a, b} that accepts strings starting with a and ending with b.
ε- NFA
In NFA if a transition made without any input symbol then this type of NFA is called ε-NFA.
Here we need ε-NFA because the regular expressions are easily convertible to ε-NFA.
Example 2: DFA accepting all string over Σ = {0, 1} ending with 3 consecutive 0’s.
i
f
a
i f
3. If N (r1) and N (r2) are NFAs for regular expressions r1 and r2
a. For regular expression r1 + r2
N(r1)
i f
N(r2)
b. For regular expression r1 r2
i N(r1) N(r2) f
f
c. For regular expression r*
Using rule 1 and 2 we construct NFA‟s for each basic symbol in the expression, we combine
these basic NFA using rule 3 to obtain the NFA for entire expression.
b. The NFA for the union of „a‟ and „b‟: a|b is constructed from the individual NFA‟s using the ε
NFA as „glue‟. Remove the individual accepting states and replace with the overall accepting
state,
Example 1: At first construct NFA of RE (a+b)*a then convert resulting NFA to DFA.
Solution: The NFA of Given RA is given below,
a
S0 b
b
S2
b
Example 2: At first construct NFA of RE (a+b)*abb then convert resulting NFA to DFA.
Solution: The NFA of Given RA is given below,
b a S3
S0 a
b
b
S4
S2
b
b
Conversion from RE to DFA Directly
Important States
The state s in ε -NFA is an important state if it has no null transition. In optimal state machine
all states are important states. Simply, a state S of an NFA without ε- transition is called the
important state if,
Move ({s}, a) ≠ Φ
Conversion steps
1. Augment the given regular expression by concatenating it with special symbol #
I.e. r (r) #
2. Create the syntax tree for this augmented regular expression
3. In this syntax tree, all alphabet symbols (plus # and the empty string) in the augmented
regular expression will be on the leaves, and all inner nodes will be the operators in that
augmented regular expression.
4. Then each alphabet symbol (plus #) will be numbered (position numbers)
5. Compute functions nullable, firstpos, lastpos, and followpos
6. Finally Construct DFA directly from a regular expression by computing the functions
nullable(n), firstpos(n), lastpos(n) and followpos(i) from the syntax tree.
nullable (n): Is true for * node and node labeled with Ɛ. For other nodes it is false.
firstpos (n): Set of positions at node ti that corresponds to the first symbol of the
sub-expression rooted at n.
lastpos (n): Set of positions at node ti that corresponds to the last symbol of the
sub-expression rooted at n.
followpos (i): Set of positions that follows given position by matching the first or
last symbol of a string generated by sub-expression of the given regular
expression.
Rules for calculating nullable, firstpos and lastpos
Node n nullable (n) firstpos (n) lastpos (n)
A leaf labeled Ɛ True Ø Ø
A leaf with position i False {i} {i}
An or node n = c1| c2 Nullable (c1) or firstpos (c1) U lastpos (c1) U
Nullable (c2) firstpos (c2) lastpos (c2)
A cat node n = c1.c2 Nullable (c1) and If (Nullable (c1)) If (Nullable (c2))
Nullable (c2) firstpos (c1) U lastpos (c1) U
firstpos (c2) Iastpos (c2)
else else
firstpos (c1) lastpos (c2)
A star node n = c1* True firstpos (c1) lastpos (c1)
A +ve closure node False firstpos (c1) lastpos (c1)
n = c1+
Example 1: Convert the regular expression (a | b)* a into equivalent DFA by direct method.
Solution:
Step 1: At first augment the given regular expression as,
(a | b)* . a. #
Step 2: Now construct syntax tree of augmented regular expression as,
{1, 2, 3} {4}
{1, 2, 3} {3}
{4} # {4}
4
{1, 2}
* {1, 2} {3} a {3}
3
{1, 2} | {1, 2}
S1
b
Figure: Resulting DFA of given regular expression
Example 2: Convert the regular expression (a | ε) b c* into equivalent DFA by direct
method.
Solution:
Step 1: At first augment the given regular expression as,
(a | ) b c* #
Step 2: Now construct syntax tree of augmented regular expression as,
{1, 2} {4}
S2
a
S1 b a, c
b a, b, c
S3 a, b S4
c
c
Example 3: Convert the regular expression ba(a+b)* ab into equivalent DFA by direct method.
Solution:
Step 1: At first augment the given regular expression as,
b.a.(a+b)*.a.b.#
Step 2: Now construct syntax tree of augmented regular expression as,
{1} {7}
{1} {2, 3, 4}
{5} a {5}
5
{1} {2}
{3, 4} * {3, 4}
{3, 4} {3, 4}
{1} a {1} {2} +
b {2}
1 2
{4} b {4}
{3} a {3}
4
3
Step 3: Compute followpos as,
b. a. (a+b)*. a. b. #
1 2 3 4 5 6 7
Followpos(1) = {2}
Followpos(2) = {3, 4, 5}
Followpos(3) = {3, 4,5}
Followpos(4) = {3, 4, 5}
Followpos(5) = {6}
Followpos(6) = {7}
Followpos(7) = {Φ}
Step 4: After we calculate follow positions, we are ready to create DFA for the regular
expression as,
Starting state of DFA = S1 = Firstpos(Root node of Syntax tree) = {1}
Mark S1
For a: follwpos(Φ) = {Φ} S2
For b: followpos(1) = {2} S3
Mark S2
For a: follwpos(Φ) = {Φ} S2
For b: followpos(Φ) = {Φ} S2
Mark S3
For a: follwpos(2) = {3, 4, 5} S4
For b: followpos(Φ) = {Φ} S2
Mark S4
For a: follwpos(3) U follwpos(5) = {3, 4, 5, 6} S5
For b: followpos(4) = {3, 4, 5} S4
Mark S5
For a: follwpos(3) U follwpos(5) = {3, 4, 5, 6} S5
For b: followpos(4) U followpos(6) = {3, 4, 5, 7} S6
Mark S6
For a: follwpos(3) U follwpos(5) = {3, 4, 5, 6} S5
For b: followpos(4) = {3, 4, 5} S4
Now there was no new states occur.
So starting state of DFA = {S1}
And accepting state of DFA = {S6} S6
b
a, b
S2
a b a S5
a
S1 b
a
b
S3
a S4
Figure: - DFA for above RE b
a
S0 b
b
S2
b
Solution:
Step 1: P0 will have two sets of states. One set will contain S1 which is final state of DFA and
another set will contain remaining states S0, S2.
So, P0 = {{S1}, {S0, S2}}
Step 2: To calculate P1, we will check whether sets of partition P0 can be partitioned or not:
i) For set {S1}:
Since we have only one state in this set, it can‟t be further partitioned.
So, S1 is not distinguishable.
i) For set {S0, S2}:
δ ( S0, a ) = S1 and δ ( S2, a ) = S1
δ ( S0, b) = S2 and δ( S2, b ) = S2
Moves of S0 and S2 on input symbol „a‟ is S1 which is in same set in partition P0. Similarly,
Moves of S0 and S2 on input symbol b is S2 which is in same set in partition P0. So, S0 and S2 are
not distinguishable. So,
P1 = {{S1}, {S0, S2}}
Minimized DFA corresponding to DFA of above is shown in Figure below as:
a
b S1
a
{S0, S2} b
Example 2: Minimize following DFA by using state partition method,
Solution:
Step 1: P0 will have two sets of states. One set will contain 4 which is final state of DFA and
another set will contain remaining states 1, 2 and 3.
So, P0 = {{4}, {1, 2, 3}}
Step 2: To calculate P1, we will check whether sets of partition P0 can be partitioned or not:
i) For set {4}:
Since we have only one state in this set, it can‟t be further partitioned.
So, {4} is not distinguishable.
i) For set {1, 2, 3}
δ a b
1 {2} {3}
2 {2} {3}
3 {4} {3}
Since transaction for state 1 and 2 for input symbol „a‟ and „b‟ are 2 and 3 respectively. So, P1=
{{4}, {1, 2}, {3}}
Minimized DFA corresponding to DFA of above is shown in Figure below as:
Example 3: Minimize following DFA by using state partition method,
Solution:
Step 1: P0 will have two sets of states. One set will contain q4 which is final state of DFA and
another set will contain remaining states q0, q1, q2 and q3.
So, P0 = {{q4}, {q0, q1, q2, q3}}
Step 2: To calculate P1, we will check whether sets of partition P0 can be partitioned or not:
i) For set {q4}:
Since we have only one state in this set, it can‟t be further partitioned.
So, {q4} is not distinguishable.
ii) For set {q0, q1, q2, q3}}
δ a b
→q0 q1 q2
q1 q1 q3
q2 q1 q2
q3 q1 q4
From above transaction table we observe that states q0 and q3 are equivalent.
So,
P1= {{q0, q2} {q1, q3} {q4}}
Solution:
Step 1: State q5 is inaccessible from the initial state.
So, we eliminate it and its associated edges from the DFA.
The resulting DFA is,
Step 2: Draw a state transition table
Δ 0 1
→q0 q1 q2
q1 q2 *q3
q2 q2 *q4
*q3 *q3 *q3
*q4 *q4 *q4
Now using Equivalence Theorem, we have-
P0 = {q0, q1, q2} {q3, q4}
P1 = {q0} {q1, q2} {q3, q4}
P2 = {q0} {q1, q2} {q3, q4}
Since P2 = P1, so we stop.
From P2, we infer:
States q1 and q2 are equivalent and can be merged together.
States q3 and q4 are equivalent and can be merged together.
So, our minimal DFA is;
Flex: An introduction
Flex is a tool for generating scanners. A scanner is a program which recognizes lexical patterns
in text. The flex program reads the given input files, or its standard input if no file names are
given, for a description of a scanner to generate. The description is in the form of pairs of
regular expressions and C code, called rules. Flex generates as output a C source file, „lex.yy.c‟
by default, which defines a routine yylex(). This file can be compiled and linked with the flex
runtime library to produce an executable. When the executable is run, it analyzes its input for
occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C
code.