Lexical Analysis: Leonidas Fegaras

Lexical Analysis
Leonidas Fegaras
CSE 5317/4305
L2: Lexical Analysis
Lexical Analysis
A scanner groups input characters into tokens
input: x = x * (acc+123) token value
identifier x equal = identifier x star * left-paren ( identifier acc plus + integer 123 right-paren ) Tokens are typically represented by numbers
CSE 5317/4305 L2: Lexical Analysis 2
Communication with the Parser

get next character get token
source file
scanner
token
parser
AST
Each time the parser needs a token, it sends a request to the scanner the scanner reads as many characters from the input stream as necessary to construct a single token when a single token is formed, the scanner is suspended and returns the token to the parser the parser will repeatedly call the scanner to read all the tokens from the input stream
Tasks of a Scanner
A typical scanner:
recognizes the keywords of the language
these are the reserved words that have a special meaning in the language, such as the word class in Java
recognizes special characters, such as ( and ), or groups of special characters, such as := and == recognizes identifiers, integers, reals, decimals, strings, etc ignores whitespaces (tabs, blanks, etc) and comments recognizes and processes special directives (such as the #include "file" directive in C) and macros
CSE 5317/4305
Scanner Generators
Input: a scanner specification
describes every token using Regular Expressions (REs)
eg, the RE [a-z][a-zA-Z0-9]* recognizes all identifiers with at least one alphanumeric letter whose first letter is lower-case alphabetic
handles whitespaces and resolve ambiguities
Output: the actual scanner Scanner generators compile regular expressions into efficient programs (finite state machines) You will use a scanner generator for Java, called JLex, for the project
CSE 5317/4305
Regular Expressions
are a very convenient form of representing (possibly infinite) sets of strings, called regular sets
eg, the RE (a | b)*aa represents the infinite set
{aa,aaa,baa,abaa, ... }
a RE is one of the following:
name epsilon symbol concatenation alternation repetition RE a AB designation {} {a} for some character a the set { rs | rA, sB }, where rs is string concatenation, and A and B designate the REs for A and B A | B the set A B, where A and B designate the REs for A and B A* the set | A | (AA) | (AAA) | ... (an infinite set)
eg, the RE (a | b)c designates { rs | r{a}{b}, s{c} }, which is equal to {ac,bc} Shortcuts: P+ = PP*, P? = P | , [a-z] = (a|b|...|z)
Properties
concatenation and alternation are associative
eg, ABC means (AB)C and is equivalent to A(BC)
alternation is commutative
eg, A | B = B | A
repetition is idempotent
eg, A** = A*
concatenation distributes over alternation

eg, (a | b)c = ac | bc
CSE 5317/4305
Examples
for-keyword letter digit identifier sign integer decimal real = for = [a-zA-Z] = [0-9] = letter (letter | digit)* = + | - | = sign (0 | [1-9]digit*) = integer . digit* = (integer | decimal) E sign digit+
CSE 5317/4305
Disambiguation Rules
1) longest match rule: from all tokens that match the input prefix, choose the one that matches the most characters 2) rule priority: if more than one token has the longest match, choose the one listed first Examples: for8 is it the for-keyword, the identifier f, the identifier fo, the identifier for, or the identifier for8? Use rule 1: for8 matches the most characters. for is it the for-keyword, the identifier f, the identifier fo, or the identifier for? Use rule 1 & 2: the for-keyword and the for identifier have the longest match but the for-keyword is listed first.
How Scanner Generators Work

Translate REs into a finite state machine Done in three steps:
1) translate REs into a no-deterministic finite automaton (NFA) 2) translate the NFA into a deterministic finite automaton (DFA) 3) optimize the DFA (optional)
CSE 5317/4305
10
Deterministic Finite Automata

A DFA represents a finite state machine that recognizes a RE eg, the RE (abc+)+ is represented by the DFA:
A finite automaton consists of

a finite set of states a set of transitions (moves) one start state a set of final states (accepting states)
a DFA has a unique transition for every state-character combination A DFA accepts a string if starting from the start state and moving from state to state, each time following the arrow that corresponds the current input character, it reaches a final state when the entire input string is consumed
DFA (cont.)
The error state 0 is implied:
The transition table T gives the next state T[s,c] for a state s and a character c a b c 0 0 0 0 1 2 0 0 2 0 3 0 3 0 0 4 4 2 0 4
The DFA of a Scanner

for-keyword = for identifier = [a-z][a-z0-9]*
CSE 5317/4305
13
Scanner Code
The scanner code that uses the transition table T:
state = initial_state; current_character = get_next_character(); while ( true ) { next_state = T[state,current_character]; if (next_state == ERROR) break; state = next_state; current_character = get_next_character(); if ( current_character == EOF ) break; }; if ( is_final_state(state) ) `we have a valid token' else `report an error'
CSE 5317/4305
14
With Longest Match

state = initial_state; final_state = ERROR; current_character = get_next_character(); while ( true ) { next_state = T[state,current_character]; if (next_state == ERROR) break; state = next_state; if ( is_final_state(state) ) final_state = state;
current_character = get_next_character();
if (current_character == EOF) break; }; if ( final_state == ERROR ) `report an error' else if ( state != final_state ) `we have a valid token but need to backtrack (to put characters back into the input stream)' else `we have a valid token'
CSE 5317/4305
15
Alternative Scanner Code

For each transition in a DFA
s1
s2
generate code:
s1: current_character = get_next_character(); ... if ( current_character == 'c' ) goto s2; ... s2: current_character = get_next_character(); ...
CSE 5317/4305
16
Mapping a RE into an NFA

An NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over The following rules construct NFAs with only one final state:
CSE 5317/4305
17
Example
The RE (a | b)c is mapped into the NFA:
CSE 5317/4305
18
Converting an NFA to a DFA

Subset construction:
assign a number to each NFA state each DFA state will be assigned a set of numbers the closure of a DFA state {n1,...,nk} is the DFA state that contains all the NFA states that can be reached by zero or more empty transitions (ie, transitions) from the NFA states n1, ..., or nk
so the closure of {n1,...,nk} is a superset of or equal to {n1,...,nk}
the initial DFA state is the closure of the initial NFA state for every DFA state labelled by some set {n1,...,nk} and for every character c in the language alphabet, you find all the states reachable by n1, n2, or nk using c arrows and you union together the closures of these nodes. If this set is not the label of any other node in the DFA constructed so far, you create a new DFA node with this label
CSE 5317/4305
19
Example
CSE 5317/4305
20
Example
(a | b)*(abb | a+b)
CSE 5317/4305
21
JLex
Regular expressions (where e and f are regular expressions):
c any character c other than: ? * + | ( ) ^ $ . [ ] { } " \ \c any character c, but \n is newline, \^c is control-c, etc . any character except \n ... the concatenation of all the characters in the string ef concatenation e | f alternation e* Kleene closure e+ ee* e? optional e {name} macro expansion [...] any character enclosed in [ ] (but only one character), from:

CSE 5317/4305
c a character c (or use \c) ef any character from e or from f a-b any character from a to b ... any character in the string
L2: Lexical Analysis 22
[^...] any character except those enclosed by [ ]
JLex Rules
A JLex rule: RE { action } where action is Java code
typically, the action returns a token but you want to skip whitespaces and comments yytext() returns the part of the input that matches the RE
JLex uses longest match and rule priority States and state transitions can be used for better control
the initial (default) state is YYINITIAL any other state should be declared using the %state directive now a rule can take the form:
<s> RE
{ action }
which can match if we are in state s only you jump to a state s using yybegin(s)
Case Study: The Calculator Scanner

The calculator example is available at: http://lambda.uta.edu/cse5317/calc.tar.gz After you download it on gamma, do:
tar xfz calc.tar.gz cd calc build run
then try it with some input; eg,

2*(3+8); x:=3+4; x+3; define f(n) = if n=0 then 1 else n*f(n-1); f(5); quit;
Tokens are Defined in calc.cup

terminal LP, RP, COMMA, SEMI, ASSIGN, IF, THEN, ELSE, AND, OR, NOT, QUIT, PLUS, TIMES, MINUS, DIV, EQ, LT, GT, LE, NE, GE, FALSE, TRUE, DEFINE; terminal String ID;
terminal Integer
terminal Float terminal String
INT;
REALN; STRINGT;
The class constructor Symbol pairs together a terminal token with an optional value (a Java Object)
if a terminal is specified with a class (a subtype of Object) then an object of this class should be provided along with the token eg, Symbol(sym.ID,x) eg, Symbol(sym.INT,10)
CSE 5317/4305
25
The Calculator Scanner

import java_cup.runtime.Symbol; %% %class CalcLex %public %line %char %cup DIGIT=[0-9] ID=[a-zA-Z][a-zA-Z0-9_]* %%
CSE 5317/4305
26
The Calculator Scanner (cont.)

{DIGIT}+ {DIGIT}+"."{DIGIT}+ "(" ")" "," ";" ":=" "define" { return new Symbol(sym.INT,new Integer(yytext())); } { return new Symbol(sym.REALN,new Float(yytext())); } { return new Symbol(sym.LP); } { return new Symbol(sym.RP); } { return new Symbol(sym.COMMA); } { return new Symbol(sym.SEMI); } { return new Symbol(sym.ASSIGN); } { return new Symbol(sym.DEFINE); }
"quit"
"if" "then" "else" "and" "or" "not" "false" "true"
CSE 5317/4305 L2: Lexical Analysis
{ return new Symbol(sym.QUIT); }

{ return new Symbol(sym.IF); } { return new Symbol(sym.THEN); } { return new Symbol(sym.ELSE); } { return new Symbol(sym.AND); } { return new Symbol(sym.OR); } { return new Symbol(sym.NOT); } { return new Symbol(sym.FALSE); } { return new Symbol(sym.TRUE); }
27
The Calculator Scanner (cont.)

"+" "*" "-" "/" "=" "<" ">" "<=" "!=" ">=" {ID} { return new Symbol(sym.PLUS); } { return new Symbol(sym.TIMES); } { return new Symbol(sym.MINUS); } { return new Symbol(sym.DIV); } { return new Symbol(sym.EQ); } { return new Symbol(sym.LT); } { return new Symbol(sym.GT); } { return new Symbol(sym.LE); } { return new Symbol(sym.NE); } { return new Symbol(sym.GE); } { return new Symbol(sym.ID,yytext()); }
\"[^\"]*\" { return new Symbol(sym.STRINGT, yytext().substring(1,yytext().length()-1)); } [ \t\r\n\f] .

CSE 5317/4305
{ /* ignore white spaces. */ } { System.err.println("Illegal character: "+yytext()); }

L2: Lexical Analysis 28

Lexical Analysis: Leonidas Fegaras

Uploaded by

Copyright:

Available Formats

Lexical Analysis: Leonidas Fegaras

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lexical Analysis: Leonidas Fegaras

Uploaded by

Copyright:

Available Formats

Lexical Analysis

L2: Lexical Analysis

Communication with the Parser

L2: Lexical Analysis

L2: Lexical Analysis

concatenation distributes over alternation

L2: Lexical Analysis

L2: Lexical Analysis

How Scanner Generators Work

L2: Lexical Analysis

Deterministic Finite Automata

A finite automaton consists of

The DFA of a Scanner

L2: Lexical Analysis

L2: Lexical Analysis

With Longest Match

L2: Lexical Analysis

Alternative Scanner Code

L2: Lexical Analysis

Mapping a RE into an NFA

L2: Lexical Analysis

L2: Lexical Analysis

Converting an NFA to a DFA

L2: Lexical Analysis

L2: Lexical Analysis

L2: Lexical Analysis

[^...] any character except those enclosed by [ ]

Case Study: The Calculator Scanner

then try it with some input; eg,

Tokens are Defined in calc.cup

L2: Lexical Analysis

The Calculator Scanner

L2: Lexical Analysis

The Calculator Scanner (cont.)

{ return new Symbol(sym.QUIT); }

The Calculator Scanner (cont.)

\"[^\"]*\" { return new Symbol(sym.STRINGT, yytext().substring(1,yytext().length()-1)); } [ \t\r\n\f] .

{ /* ignore white spaces. */ } { System.err.println("Illegal character: "+yytext()); }

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.