Chapter 2 - Lexical Analyser
Chapter 2 - Lexical Analyser
Chapter 2 - Lexical Analyser
Instructor: Mohammed O.
Email: momoumer90@gmail.com
Samara University
Chapter Two
This Chapter Covers:
Role of lexical analyser
Token Specification and Recognition
NFA to DFA
Lexical Analyzer
Lexical Analyzer reads the source program character by
character to produce tokens.
Normally a lexical analyzer doesn’t return a list of tokens
at one shot, it returns a token when the parser asks a
token from it.
3
2
1
Tokens/Patterns/Lexemes/Attributes
A token is sequence of characters which represents a unit
of information in the source program.
Example: X = B*1
Token and associated attributes:
<id, attr> where attr is pointer to the symbol table for X
<assignOp> no attribute is needed (if there is only one assignment operator)
<id, attr> where attr is pointer to the symbol table for B
<multiOp> no attribute is needed
<num,val> where val is the actual value of the number.
Scanner
A scanner groups (classed together) input characters into
tokens.
For example, if the input is:
x = x*(b+1); then the scanner generates the following
sequence of tokens
id(x), =, id(x), *, (, id(b), +, num(1), ), ;
The parser will repeatedly call the scanner to read all the
tokens from the input stream or until an error is detected
(such as a syntax error).
Some tokens require some extra information.
For example, an identifier is a token (so it is represented by
some number) but it is also associated with a string that
holds the identifier name.
Scanner (Cont.)
For example, the token id(x) is associated with the string, "x".
Similarly, the token num(1) is associated with the number, 1.
Tokens are specified by patterns, called regular expressions.
For example, the regular expression [a-z][a-zA-Z0-9]*
recognises all identifiers with at least one alphanumeric letter
whose first letter is lower-case alphabetic.
A typical scanner:
recognises the keywords of the language (these are the
reserved words that have a special meaning in the language,
such as the word class in Java); (such as the #include "file"
directive in C).
Scanner (Cont.)
recognises special characters, such as parentheses ( and ),
or groups of special characters, such as := (equal by
definition) and ==;
recognises identifiers, integers, reals, decimals, strings, etc;
ignores whitespaces and comments;
Hand Implementation
There are two ways to use hand implementation:
Input Buffer approach
Transitional diagrams approach
Input Buffering
The lexical analyser scans the characters of the source
programme one at a time to discover tokens.
Cont.
Often, many characters beyond (in addition to) the next
token may have to be examined before the next token itself
can be determined.
L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
L1 L2 = {a,b,c,d,1,2}
-closure({0}) = {0,1,2,4,7}
mark S0
-closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S0,a] S1 transfunc[S0,b] S2
mark S1
-closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a] S1 transfunc[S1,b] S2
mark S2
-closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a] S1 transfunc[S2,b] S2
Converting a NFA into a DFA (Cont.)
Syntax tree of (a|b) * a #
#
4
* a
3 • each symbol is numbered (positions)
• each symbol is at a leave
|
G1 = {2}
G2 = {1,3}
a b
1->2 1->3
2->2 2->3
3->4 3->3