2 - Scanner
2 - Scanner
2 - Scanner
Radu Prodan
LEXICAL ANALYSIS
Source Code
Target Code
Generator
Optimiser
Optimiser
Semantic
Syntactic
Analyser
Analyser
Analyser
Lexical
Code
Literal Symbol Error
Table Table Handler
▪ Regular expressions
▪ Conclusions
▪ Token
– Corresponds to a natural language word
– Keywords: if, while, for, int, float
– Identifiers : user-defined, variable size beginning with letter
– Special symbols: arithmetic or logical operations: +, *, /, <, >, =, <>
▪ Special symbols
– PLUS, MINUS represent string of characters “+”, “–”
▪ Identifiers
– ID can represent many user-defined variables or identifiers (“a”, “b”, “c”)
▪ Numbers
– NUM can represent many numbers or values (1, 2, 3, ...)
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 5
Attribute
▪ Attribute
– Associated to a token has one or more
▪ ID token type
– Lexeme: “x”, “i”, “j”, “tmp”, “var” (string value)
▪ Example: a[index] = 4 + 2
a [ i n d e x ] = 4 + 2
▪ getToken
a [ i n d e x ] = 4 + 2
▪ Token ID Source
token
Semantic
▪ String value: “a” Scanner getToken Parser
program analysis
Symbol Table
▪ Regular expressions
– Specify program tokens
▪ Regular expressions
▪ Conclusions
▪ Concatenation: rs
▪ Language L(r)
– L(rs) = L(r)L(s)
– Generated by regular expression r
▪ Repetition: r*
▪ Basic regular expression a – L(r*) = L(r)*
– a
– L(a) = { a } ▪ Sub-expressions: (r)
– L((r)) = L(r)
– Empty string: L() = { }
– Empty set: L() = { } ▪ Precedence order: *,
concatenation, |
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 11
Examples of Regular Expressions
▪ = { a, b, c }
▪ Keywords
– keyword = if | while | do | ...
▪ Identifiers
– letter = [a-zA-Z]
– digit = [0-9]
– identifier = letter(letter|digit)*
▪ /* this is a C comment */
– ba(~(ab))*ab
• Not valid because “not” operator is restricted to single characters
– (~(ab))* could be written as: b*(a*~(a|b)b*)*a*
▪ Disambiguating rules
– Keyword (or reserved words) first
– Principle of longest substring
▪ Lookahead problem
– Token delimiters must not be consumed but returned to input stream
– Often one single lookahead character is enough, but sometimes not
▪ while x …
– Keyword while, identifier x
– Space as token delimiter
▪ xtemp=ytemp
– Identifiers xtemp and ytemp
– = as token delimiter
▪ I F ( X 2 .EQ. 0 ) THE N
– Equivalent to IF(X2.EQ.0) THEN
▪ IF(IF.EQ.0)THENTHEN=1.0
– First IF and THEN are keywords
– Second IF and THEN are variables (identifiers)
▪ Regular expressions
▪ Conclusions
▪ identifier = letter(letter|digit)*
– Transition graph with two states letter
– Start state: 1
letter
– Accepting state: 2 1 2
– Transitions: arrowed lines
digit
▪ Process of recognising xtemp as identifier
x t e m p
1 2 2 2 2 2
▪ A DFA M consists of
– An alphabet
– A set of states S
– A transition function T: S → S
– A start state s0 S
– A set of accepting states A S
c1 c2 c3 cn-1 cn
s0 s1 s2 ... sn-1 sn
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 21
Error Transitions
▪ letter = [a-zA-Z]
letter
other1 = ~letter
other2 = ~(letter|digit)
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 22
DFA Examples
▪ Set of strings that contain exactly one b
– (a|c)*b(a|c)* notb notb
digit
signedNat = (+|-)? nat
– digit
▪ Pascal comments { }
– {(~})*}
– other = ~}
other *
▪ C comments 1
/
2
*
3
*
4
/
5
– 1 – start
– 2 – entering comment
– 3 – inside comment other
– 4 – exiting comment
– 5 – finish
while state = 1, 2, 3 or 4 do
case state of
1: case input character of
“/”: state := 2;
else state := 6; { error or other }
end case;
2: case input character of
“*”: state := 3;
else state := 6; { error or other }
end case;
3: case input character of
”*”: state := 4;
else { stay in state 3 }
end case;
4: case input character of
“/”: state := 5;
“*”: { stay in state 4 }
else state := 3;
end case; other *
end case;
advance input;
end while;
/ * * /
1 2 3 4 5
if state = 5 then accept else error;
other
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 27
Transition Table
▪ Transition table
– Two-dimensional array indexed by state and input character
– Expresses values of transition function T
– Extra column indicates accepting states
– Square brackets indicate “noninput-consuming” transitions
letter
letter [other]
1 2 3
digit
T Input character Accepting
letter digit other
1 2 No
State
2 2 2 [3] No
3 Yes
other
T Input character Accepting
/ * other
1 2 No
State 2 3 No
3 4 3 No
4 5 4 3 No
5 Yes
▪ Regular expressions
▪ Conclusions
< >
return NE
<
return LT
: =
▪ -transitions
– Connect NFAs of all tokens < =
– “Spontaneous” transition without
consuming any input characters
– “Match” of empty string =
▪ Observations
– Any ci may be
– c1c2…cn may have fewer than n characters ( removed)
– Sequence of states s1s2…sn chosen from sets of states T(s0,c1), T(s1,c2), …,T(sn-1,cn) not
always uniquely determined
– Arbitrary number of in input stream corresponding to any number of NFA -transitions
–
▪ To do
– Concatenation
– Or
– Repetition
...r...
...s...
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 39
Repetition
▪ Input ▪ Two new states: start and accepting
– Two regular expressions r and s connected through -transitions
– Two NFAs (of r and of s)
▪ Repetition through -transition from
▪ Goal accepting to start state of r
– NFA for regular expression r*
▪ Empty string is accepted by -transition
from start to accepting state
...r...
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 40
Example: ab | a
a
▪ a
▪ b b
▪ ab a b
▪ ab | a
a b
a
▪ digit
digit
▪ letter | digit
letter
digit
letter
digit
letter
letter
digit
12.03.2024
R. Prodan, Compiler Construction, Summer Semester 2024 43
Subset Construction
▪ Convert an NFA into a DFA
▪ -closure of a state s is
– Set of states reachable by a series of zero or more -transitions
– Denoted as s
a b
2 3 4 5
1 8
a
6 7
a b
{1, 2, 6} {3, 4, 7, 8} {5, 8}
letter
5 6
letter
1 2 3 4 9 10
digit
7 8
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 47
Agenda
▪ Introduction
▪ Regular expressions
▪ Conclusions