Lexical Analyzer in Perspective: Parser Source Program Token
Lexical Analyzer in Perspective: Parser Source Program Token
Lexical Analyzer in Perspective: Parser Source Program Token
token
source lexical
analyzer parser
program
get next
token
symbol
table
Important Issue:
What are Responsibilities of each Box ?
Focus on Lexical Analyzer and Parser
1
Why to separate Lexical analysis and parsing
o Simplicity of design
2
Tokens, Patterns, and Lexemes
o A token is a pair a token name and an optional token
attribute
3
Example
4
Using Buffer to Enhance Efficiency
Current token
E = M * C * * 2 eof
5
Algorithm: Buffered I/O with Sentinels
Current token
0 Unrestricted A
2 Context-Free |LHS | = 1
3 Regular |RHS| = 1 or 2 ,
A a | aB, or
A a | Ba
7
Formal Language Operations
OPERATION DEFINITION
union of L and M L M = {s | s is in L or s is in M}
written L M
concatenation of L LM = {st | s is in L and t is in M}
and M written LM
Kleene closure of L L*= Li
written L* i 0
8
Formal Language Operations
Examples
L = {A, B, C, D } D = {1, 2, 3}
L D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, DD}
L4 = L2 L2 = ??
L* = { All possible strings of L plus }
L+ = L* -
L (L D ) = ??
L (L D )* = ??
9
Language & Regular Expressions
A Regular Expression is a Set of Rules /
Techniques for Constructing Sequences of Symbols
(Strings) From an Alphabet.
10
Rules for Specifying Regular Expressions:
fix alphabet
is a regular expression denoting {}
If a is in , a is a regular expression that denotes {a}
p
(a) (r) | (s) is a regular expression L(r) L(s)
r
e
c
(b) (r)(s) is a regular expression L(r) L(s)
e
d (c) (r)* is a regular expression (L(r))*
e
n
c (d) (r) is a regular expression L(r)
e
All are Left-Associative. Parentheses are dropped as
allowed by precedence rules. 11
EXAMPLES of Regular Expressions
L = {A, B, C, D } D = {1, 2, 3}
A|B|C|D =L
(A | B | C | D ) (A | B | C | D ) = L2
(A | B | C | D )* = L*
(A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L D)
12
Algebraic Properties of
Regular Expressions
AXIOM DESCRIPTION
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |
r = r
r = r Is the identity element for concatenation
13
Token Recognition
How can we use concepts developed so far to assist in
recognizing tokens of a source language ?
Assume Following Tokens:
if, then, else, relop, id, num
14
Overall
Regular Token Attribute-Value
Expression
ws - -
if if -
then then -
else else -
id id pointer to table entry
num num pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Note: Each token has a unique token identifier to define category of lexemes
15
Transition diagrams
Transition diagram for relop
16
Transition diagrams (cont.)
Transition diagram for reserved words and identifiers
17
Transition diagrams (cont.)
Transition diagram for unsigned numbers
18
Transition diagrams (cont.)
Transition diagram for whitespace
19
Lexical Analyzer Generator - Lex
lex.yy.c
C a.out
compiler
20
Lexical errors
Some errors are out of power of lexical
analyzer to recognize:
fi (a == f(x))
However, it may be able to recognize errors
like:
d = 2r
Such errors are recognized when no pattern
for tokens matches a character sequence
21
Error recovery
22