Chapter 2 Lexical Analysis
Chapter 2 Lexical Analysis
Chapter 2 Lexical Analysis
Lexical Analysis
Introduction
• The word “lexical” in the traditional sense means “pertaining to words”.
• In terms of programming languages, words are entities like variable names,
numbers, keywords etc.
• Such word-like entities are traditionally called tokens.
• A lexical analyser, also called a lexer or scanner, will as input take a string of
• individual letters and divide this string into a sequence of classified tokens.
• Additionally, it will filter out whatever separates the tokens (the so-called white-
space), i.e., lay-out characters (spaces, newlines etc.) and comments.
• The main purpose of lexical analysis is to make life easier for the subsequent
syntax analysis phase.
Why Lexical analysis phase?
• Efficiency: A specialised lexer may do the simple parts of the work
faster than the parser, using more general methods, can.
• Modularity: The syntactical description of the language need not be
cluttered with small lexical details such as white-space and
comments.
• Compiler portability is enhanced. Input-device-specific peculiarities
can be restricted to the lexical analyzer.
Cont’d…
• Lexers are normally constructed by lexer generators, that transform
human-readable specifications of tokens and white-space into
efficient programs.
• It is common for the lexical analyzer to interact with the symbol table
as well.
• When the lexical analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the symbol table.
Cont’d…
• Another task Lexical Analyzer is correlating error messages generated
by the compiler with the source program.
• Sometimes, lexical analyzers are divided into a cascade of two
processes:
• a) Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments and
compaction of consecutive whitespace characters into one.
• b) Lexical analysis proper is the more complex portion, where the
scanner produces the sequence of tokens as output.
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional attribute
value.
• The token name is an abstract symbol representing a kind of lexical
unit, e.g., a particular keyword, or a sequence of input characters
denoting an identifier.
• A pattern is a description of the form that the lexemes of a token may
take.
• In the case of a keyword as a token, the pattern is just the sequence
of characters that form the keyword
Cont’d…
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical
analyzer as an instance of that token.
Cont’d…
• p r i n t f ( " T o t a l = %d\n", s c o r e ) ;
• both p r i n t f and s c o r e are lexemes matching the pattern for
token id, and
• " T o t a l = °/,d\n" is a lexeme matching literal.
• In many programming languages, the following classes cover most or
all of the tokens:
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical analyzer
must provide the subsequent compiler phases additional information
about the particular lexeme that matched.
• For example, the pattern for token n u m b e r matches both 0 and 1,
but it is extremely important for the code generator to know which
lexeme was found in the source program.
Finite Automata
• For lexical analysis, specifications are traditionally written using
regular expressions: An algebraic notation for describing sets of
strings.
• The generated lexers are in a class of extremely simple programs
called finite automata.
• Regular Expressions:
• The set of all integer constants or the set of all variable names are
examples of sets of strings,
• where the individual digits or letters used to form these constants or
names are taken from a particular alphabet, i.e., a set of characters.
Cont’d…
• A set of strings is called a language.
• For integers, the alphabet consists of the digits 0–9 and for variable
names the alphabet contains both letters and digits (and perhaps a
few other characters, such as underscore).
• Given an alphabet, we will describe sets of strings by regular
expressions, an algebraic notation that is compact and relatively easy
for humans to use and understand.
• Regular expressions are often called “regexps” for short.
Cont’d…
Cont’d…
• To find L(s) for a given regular expression s, we use derivation: Rules
that rewrite a regular expression into a string of letters.
• These rules allow a single regular expression to be rewritten into several
different strings, so L(s) is the set of strings that s can be rewritten to
using these rules.
• We can use the derivation rules to find the language for a regular
expression.
• As an example, L(a(b|c)) = {ab, ac} i.e a(b|c) ⇒ a(b) = ab and a(b|c) ⇒
a(c) = ac.
• L((a|b)∗) is infinite and contains any sequence of as and bs, including
the empty sequence.
Properties of regular expression
Cont’d…
• In a regular expression, x* means zero or more occurrence of x. It can generate
{e, x, xx, xxx, xxxx, .....}
• In a regular expression, x+ means one or more occurrence of x. It can generate
{x, xx, xxx, xxxx, .....}
• Example 3:
• Write the regular expression for the language accepting all the string
containing any number of a's and b's.
• The regular expression will be:
• r.e. = (a + b)*
• This will give the set as L = {ε, a, aa, b, bb, ab, ba, aba, bab, .....}, any
combination of a and b.
Cont’d…
• Example 1:
• Write the regular expression for the language accepting all the string which
are starting with 1 and ending with 0, over ∑ = {0, 1}.
• In a regular expression, the first symbol should be 1, and the last symbol
should be 0. The r.e. is as follows:
• R = 1 (0+1)* 0
• Example 2:
• Write the regular expression for the language starting and ending with a
and having any having any combination of b's in between.
• The regular expression will be:
• R = a b* b
Conversion of RE to FA
• Finite automata are used to recognize patterns.
• It takes the string of symbol as input and changes its state accordingly.
When the desired symbol is found, then the transition occurs.
• At the time of transition, the automata can either move to the next
state or stay in the same state.
• Finite automata have two states, Accept state or Reject state. When
the input string is processed successfully, and the automata reached
its final state, then it will accept.
Formal Definition of FA
• A finite automaton is a collection of 5-tuple (Q, ∑, δ, q0, F), where:
• Q: finite set of states
• ∑: finite set of the input symbol
• q0: initial state
• F: final state
• δ: Transition function
Cont’d…
• Finite automata can be represented by input tape and finite control.
• Input tape: It is a linear tape having some number of cells. Each input
symbol is placed in each cell.
• Finite control: The finite control decides the next state on receiving
particular input from input tape. The tape reader reads the cells one
by one from left to right, and at a time only one input symbol is read.
Types of Finite Automata
Cont’d…
• 1. DFA
• DFA refers to deterministic finite automata. Deterministic refers to the uniqueness of the
computation. In the DFA, the machine goes to one state only for a particular input
character. DFA does not accept the null move.
• 2. NFA
• NFA stands for non-deterministic finite automata. It is used to transmit any number of
states for a particular input. It can accept the null move.
• Some important points about DFA and NFA:
• Every DFA is NFA, but NFA is not DFA.
• There can be multiple final states in both NFA and DFA.
• DFA is used in Lexical Analysis in Compiler.
• NFA is more of a theoretical concept.
Cont’d…
• In the following diagram, we can see that from state q0 for input a,
there is only one path which is going to q1.
• Similarly, from q0, there is only one path for input b going to q2.
Formal Definition of DFA
• A DFA is a collection of 5-tuples same as we described in the
definition of FA.
• Q: finite set of states
• ∑: finite set of the input symbol
• q0: initial state
• F: final state
• δ: Transition function
Example 1:
• Q = {q0, q1, q2}
• ∑ = {0, 1}
• q0 = {q0}
• F = {q2}
• NFA also has five states same as DFA, but with different transition
function, as shown follows:δ: Q x ∑ →2Q
1.Q: finite set of states
2.∑: finite set of the input symbol
3.q0: initial state
4.F: final state
5.δ: Transition function
Graphical Representation of an NFA
→q0 q0, q1 q1
q1 q2 q0
*q2 q2 q1, q2
Conversion from NFA to DFA
• In NFA, when a specific input is given to the current state, the machine
goes to multiple states.
• It can have zero, one or more than one move on a given input symbol. On
the other hand, in DFA, when a specific input is given to the current state,
the machine goes to only one state.
• DFA has only one move on a given input symbol.
• Let, M = (Q, ∑, δ, q0, F) is an NFA which accepts the language L(M).
• There should be equivalent DFA denoted by M' = (Q', ∑', q0', δ', F') such
that L(M) = L(M').
Steps for converting NFA to DFA:
• Step 1: Initially Q' = ϕ
• Step 2: Add q0 of NFA to Q'. Then find the transitions from this start
state.
• Step 3: In Q', find the possible set of states for each input symbol. If
this set of states is not in Q', then add it to Q'.
• Step 4: In DFA, the final state will be all the states which contain
F(final states of NFA)
• Example 1:
• Convert the given NFA to DFA.
Cont’d…
State 0 1
→q0 q0 q1
q1 {q1, q2} q1
State 0 1
→[q0] [q0] [q1]
[q1] [q1, q2] [q1]
*[q2] [q2] [q1, q2]
*[q1, q2] [q1, q2] [q1, q2]
Conversion of RE to FA
• Example 1:
• Design a FA from given regular expression 10 + (0 + 11)0*
1.
• Step 1
• Step2
Cont’d
• Step3
• Step 4
Cont’d
• Step5
Example2
• Design a NFA from given regular expression 1 (1* 01* 01*)*.
Recognition of Tokens in Lexer
• The terminals of the grammar, which are if, t h e n , else, relop, id,
and number, are the names of tokens as far as the lexical analyzer is
concerned.
• The patterns for these tokens are described using regular definitions
Cont’d…
• In addition, we assign the lexical analyzer the job of stripping out
whitespace, by recognizing the "token" ws defined by:
• ws -» ( blank | tab j newline )+
• Here, blank, tab, and newline are abstract symbols that we use to
express the ASCII characters of the same names.
• Token ws is different from the other tokens in that, when we
recognize it, we do not return it to the parser, but rather restart the
lexical analysis from the character that follows the whitespace.
Transition Diagram
• As an intermediate step in the construction of a lexical analyzer, we
first convert patterns into stylized flowcharts, called "transition
diagrams."
• In this section, we perform the conversion from regular-expression
patterns to transition diagrams.
• Transition diagrams have a collection of nodes or circles, called states.
• Each state represents a condition that could occur during the process
of scanning the input looking for a lexeme that matches one of several
patterns
Cont’d…
• Some important conventions about transition diagrams are:
• 1. Certain states are said to be accepting, or final.
• These states indicate that a lexeme has been found, although the actual
lexeme may not consist of all positions between the lexemeBegin and
forward pointers.
• We indicate an accepting state by a double circle.
• 2. In addition, if it is necessary to retract the forward pointer one position
(i.e., the lexeme does not include the symbol that got us to the accepting
state), then we shall additionally place a * near that accepting state.
• In our example, it is never necessary to retract forward by more than one
position, but if it were, we could attach any number of *'s to the accepting
state.
Cont’d…
• 3. One state is designated the start state, or initial state; it is indicated
by an edge, labeled "start," entering from nowhere.
• The transition diagram always begins in the start state before any
input symbols have been read.
Cont’d…
Implementation of Lex
• Download and install flex 2.5 or bison 2.4 and MinGW or DevC++(gcc)
• Download and install Configure environmental virable
The Lexical-Analyzer Generator Lex
• It is implemented by Flex
Cont’d…
• An input file, which we call l e x . l , is written in the Lex language and
describes the lexical analyzer to be generated.
• The Lex compiler transforms l e x . 1 to a C program, in a file that is
always named l e x . y y . c. The latter file is compiled by the C
compiler into a file called a . o u t , as always.
• The C-compiler output is a working lexical analyzer that can take a
stream of input characters and produce a stream of tokens.
Structure of Lex Programs
• declarations
• %%
• translation rules
• %%
• auxiliary functions
• The declarations section includes declarations of variables, manifest
constants (identifiers declared to stand for a constant, e.g., the name
of a token), and regular definitions,
Exampe 1
• %{
• #include <stdio.h>
• %}
• %%
• [0-9]+ { printf("NUMBER: %s\n", yytext); }
• [-+*/] { printf("OPERATOR: %s\n", yytext); }
• [()] { printf("PARENTHESIS: %s\n", yytext); }
• . { /* Ignore other characters */ }
• %%
• int main() {
• yylex();
• return 0;
• }
• int yywrap(){
• return 1;
• }
• Open cmd and type flex filename.l
• Then type gcc lex.yy.c
Flex Example
Cont’d…
• The user code section of the input to flex defines a main program to
repeatedly call yylex() that identity of the lexical token) together with
a string representation of the token.
• flex provides a special variable yytext that contains the text that
matched the regular expression pattern in the rule.
• The while loop repeatedly calls yylex() and this loop terminates when
yylex() returns the value 0, signifying end of file.
• The yywrap() function is also defined here as always returning the
value 1.
Cont’d…
• yywrap is automatically called when yylex encounters the end of the
input file.
• If yywrap returns 1, then yylex assumes that its job is done and there
are no more characters to analyse.
• If, however, yywrap returns 0, this indicates that yylex should
continue and yywrap will have opened a new file for processing
The End
• Thank you for your Attention!!!