CD Notes3
CD Notes3
UNIT 3
Syntax Analyzer
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall
learn the basic concepts used in the construction of a parser.
The parser (syntax analyzer) receives the source code in the form of tokens from the
lexical analyzer and performs syntax analysis, which create a tree-like intermediate
representation that depicts the grammatical structure of the token stream.
We have seen that a lexical analyzer can identify tokens with the help of regular
expressions and pattern rules. But a lexical analyzer cannot check the syntax of a given
sentence due to the limitations of the regular expressions. Regular expressions cannot
check balancing tokens, such as parenthesis. Therefore, this phase uses context-free
grammar CFG, which is recognized by pushdown automata.
Syntax of a language refers to the structure of a valid programs / statements of that
language. It is specified by certain rules known as productions and collection of such rules
is known as grammar.
Parsing is a process of determining that stream of tokens are valid or not which is
defined by a grammar.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
What is grammar?
Grammar contains the set of rules to construct a sentence in a language. We
are defining set of rules in CFG , from these rules we will construct string.
L = {ambm | m ≥ 1}.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
The reason is, you have to reach the final state only when no. of 'a' and
no. of 'b' are equal in the input string. And to do that you have to count both,
the no. of 'a' as well as no. of 'b' but because value of 'n' can reach infinity, it's
not possible to count up to infinity using Finite automata.
G= (V, T, P, S)
Where,
G describe T describes a finite set of terminal symbols.
V describes a finite set of non-terminal symbols
P describes a set of production rules
S is the start symbol.s the grammar
In CFG, the start symbol is used to derive the string. You can derive the string
by repeatedly replacing a non-terminal by the right hand side of the
production, until all non-terminal have been replaced by terminal symbols.
Production rules:
S aSa
S bSb
S c
Now check that abbcbba string can be derived from the given CFG.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
S aSa
S abSba
S abbSbba
S abbcbba
Capabilities of CFG
2. Right-most Derivation
In the right most derivation, the input is scanned and replaced with the
production rule from right to left. So in right most derivatives we read the
input string from right to left.
S=S+S
S=S-S
S = a | b |c
String to be derived: a - b + c
aa - b + c - b + c
The right-most derivation is:
S=S-S
S=S-S+S
S=S-S+c
S=S-b+c
S=a-b+c
Parse tree
o Parse tree is the graphical representation of symbol. The symbol can be
terminal or non-terminal.
o In parsing, the string is derived using the start symbol. The root of the
parse tree is that start symbol.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
Step 2:
Step 3:
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
Step 4:
Step 5:
Ambiguity
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
Example:
S = aSb | SS
S=
For the string aabb, the above grammar generates two parse trees:
If the grammar has ambiguity then it is not good for a compiler construction.
No method can automatically detect and remove the ambiguity but you can
remove ambiguity by re-writing the whole grammar without ambiguity.
A A α |β.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
1) A βA 2) A αA |
In right recursion A() is going to do some work alpha(α) first and then execute
recursive function A(). So alpha(α) act as a condition checking , so no way to
fall in infinite loop. The language will be generated by right recursion is α*β.
A-> βα* but grammar should not contain * symbol. So we will make it as
1)A βA
2)A αA |
E E + T|T
T T * F|F
F (E)|id
Eliminate immediate left recursion from the Grammar.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
Solution
Comparing E E + T|T with A A α |β
E → E +T | T
A → A α | Β
A = E, α = +T, β = T
A A α |β is changed to A βA and A α A |ε
A βA means E TE
A α A |ε means E +TE |ε
A → A α | β
∴ A = T, α =∗ F, β = F
∴ A → β A′ means T → FT′
A → α A′|ε means T′ →* FT′|ε
Production F → (E)|id does not have any left recursion
∴ Combining productions 1, 2, 3, 4, 5, we get
E TE
E +TE | ε
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
T FT
T * FT |ε
F (E)| id
Left Factoring-
The grammar with common prefix between at least two different productions
from the same L.H.S (non terminal symbol in CFG) is known as Non
Deterministic grammar. Because for one symbol it has many production. In
this case of Non Deterministic grammar compiler cann’t decide unique
production for particular terminal and many times it needs to backtrack for
searching correct production. This backtracking process is more time
consuming and top down parser don’t allow backtracking. So to convert Non-
Deterministic grammar into Deterministic grammar is known as left factoring
method.
In left factoring,
We make one production for each common prefixes.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
Example :
S iEtS / iEtSeS / a
E b
Solution-
The left factored grammar is-
S iEtSS’ / a
S’ eS /
E b
Example 2: Do left factoring in the following grammar-
Solution :
A aA’
A’ → AB / Bc / Ac
Again this grammar has common prefix A.
A’ → AA’’
A’’ → B / c
produced by this production rule, and is same as the current character of the
input string which is also ‘a’).
Hence it is validated that if the compiler/parser knows about first character of
the string that can be obtained by applying a production rule, then it can
wisely apply the correct production rule to get the correct syntax tree for the
given input string.
Why FOLLOW?
The parser faces one more problem. Let us consider below grammar to
understand this problem.
A -> aBb
B -> c | ε
And suppose the input string is “ab” to parse.
As the first character in the input is a, the parser applies the rule A->aBb.
A
/| \
a B b
Now the parser checks for the second character of the input string which is
b, and the Non-Terminal to derive is B, but the parser can’t get any string
derivable from B that contains b as first character.
But the Grammar does contain a production rule B -> ε, if that is applied
then B will vanish, and the parser gets the input “ab”, as shown below. But
the parser can apply it only when it knows that the character that follows B
in the production rule is same as the current character in the input.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
In RHS of A -> aBb, b follows Non-Terminal B, i.e. FOLLOW(B) = {b}, and the
current input character read is also b. Hence the parser applies this rule.
And it is able to get the string “ab” from the given grammar.
A A
/ | \ / \
a B b => a b
|
ε
So FOLLOW can make a Non-terminal vanish out if needed to generate the
string from the parse tree.
The conclusions is, we need to find FIRST and FOLLOW sets for a given
grammar so that the parser can properly apply the needed rule at the
correct position.
FIRST
FIRST(X) for a grammar symbol X is the set of terminals that begin the strings
derivable from X.
E -> TE’
E’ -> +T E’|Є
T -> F T’
T’ -> *F T’ | Є
F -> (E) | id
FIRST sets
FIRST(E) = FIRST(T) = { ( , id }
FIRST(E’) = { +, Є }
FIRST(T) = FIRST(F) = { ( , id }
FIRST(T’) = { *, Є }
FIRST(F) = { ( , id }
Example 2:
Production Rules of Grammar
S -> ACB | Cbb | Ba
A -> da | BC
B -> g | Є
C -> h | Є
FIRST sets
FIRST(A) = { d } U FIRST(BC)
= { d, g, h, Є }
FIRST(B) = { g , Є }
FIRST(C) = { h , Є }
Follow
Follow(X) to be the set of terminals that can appear immediately to the right o
Rules to compute FOLLOW set:
1) FOLLOW(S) = { $ } // where S is the starting Non-Terminal
2) If A -> pBq is a production, where p, B and q are any grammar symbols,
then everything in FIRST(q) except Є is in FOLLOW(B).
3) If A->pB is a production, then everything in FOLLOW(A) is in FOLLOW(B).
4) If A->pBq is a production and FIRST(q) contains Є,
then FOLLOW(B) contains { FIRST(q) – Є } U FOLLOW(A) f Non-Terminal X in
some sentential form.
Example :
Production Rules:
E -> TE’
E’ -> +T E’|Є
T -> F T’
T’ -> *F T’ | Є
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
F -> (E) | id
FIRST set
FIRST(E) = FIRST(T) = { ( , id }
FIRST(E’) = { +, Є }
FIRST(T) = FIRST(F) = { ( , id }
FIRST(T’) = { *, Є }
FIRST(F) = { ( , id }
FOLLOW Set
FOLLOW(E) = { $ , ) } // Note ')' is there because of 5th rule
FOLLOW(E’) = FOLLOW(E) = { $, ) } // See 1st production rule
FOLLOW(T) = { FIRST(E’) – Є } U FOLLOW(E’) U FOLLOW(E) = { + , $ , ) }
FOLLOW(T’) = FOLLOW(T) = {+,$,)}
FOLLOW(F) = { FIRST(T’) – Є } U FOLLOW(T’) U FOLLOW(T) = { *, +, $, ) }
Example 2:
Production Rules:
S -> aBDh
B -> cC
C -> bC | Є
D -> EF
E -> g | Є
F -> f | Є
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
FIRST set
FIRST(S) = { a }
FIRST(B) = { c }
FIRST(C) = { b , Є }
FIRST(D) = FIRST(E) U FIRST(F) = { g, f, Є }
FIRST(E) = { g , Є }
FIRST(F) = { f , Є }
FOLLOW Set
FOLLOW(S) = { $ }
FOLLOW(B) = { FIRST(D) – Є } U FIRST(h) = { g , f , h }
FOLLOW(C) = FOLLOW(B) = { g , f , h }
FOLLOW(D) = FIRST(h) = { h }
FOLLOW(E) = { FIRST(F) – Є } U FOLLOW(D) = { f , h }
FOLLOW(F) = FOLLOW(D) = { h }
Example 3:
Production Rules:
S -> ACB|Cbb|Ba
A -> da|BC
B-> g|Є
C-> h| Є
FIRST set
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
FIRST(B) = { g, Є }
FIRST(C) = { h, Є }
FOLLOW Set
FOLLOW(S) = { $ }
FOLLOW(A) = { h, g, $ }
FOLLOW(B) = { a, $, h, g }
FOLLOW(C) = { b, g, $, h }
Note :
LL(1) Parser
Here the 1st L represents that the scanning of the Input will be done from
Left to Right manner and the second L shows that in this parsing technique we
are going to use Left most Derivation Tree. And finally, the 1 represents the
number of look-ahead, which means how many symbols are you going to see
when you want to make a decision.
Step 1: First check for left recursion in the grammar, if there is left recursion
in the grammar remove that and go to step 2.
Step 2: Calculate First() and Follow() for all non-terminals.
First(): If there is a variable, and from that variable, if we try to drive all
the strings then the beginning Terminal Symbol is called the First.
Follow(): What is the Terminal Symbol which follows a variable in the
process of derivation.
Step 3: For each production A –> α. (A tends to alpha)
Find First(α) and for each terminal in First(α), make entry A –> α in the
table.
If First(α) contains ε (epsilon) as terminal than, find the Follow(A) and
for each terminal in Follow(A), make entry A –> α in the table.
If the First(α) contains ε and Follow(A) contains $ as terminal, then
make entry A –> α in the table for the $.
To construct the parsing table, we have two functions:
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
In the table, rows will contain the Non-Terminals and the column will
contain the Terminal Symbols. All the Null Productions of the Grammars will
go under the Follow elements and the remaining productions will lie under
the elements of the First set.
Example-1:
Consider the Grammar:
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)
*ε denotes epsilon
First Follow
E –> TE’ { id, ( } { $, ) }
E’ –> +TE’/ε { +, ε } { $, ) }
id + * ( ) $
Here you can write production numbers also rather than production rules.
As you can see that all the null productions are put under the Follow set of
that symbol and all the remaining productions are lie under the First of that
symbol.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible
that one cell may contain more than one production. If each cell contain only
one production then the grammar is LL(1) or can be accepted by LL(1) parser.
Operator Grammar:
Examples –
This is an example of operator grammar:
o E->E+E/E*E/id
However, the grammar given below is not an operator grammar
because two non-terminals are adjacent to each other:
o S->SAS/a
o A->bSb/b
We can convert it into an operator grammar, though:
o S->SbSbS/SbS/a
o A->bSb/b
There are two methods for determining what precedence relations should
hold between a pair of terminals:
Use the conventional associativity and precedence of operator.
The second method of selecting operator-precedence relations is first to
construct an unambiguous grammar for the language, a grammar that reflects
the correct associativity and precedence in its parse trees.
This parser relies on the following three precedence relations: , ,
a b This means a “yields precedence to” b.
a b This means a “takes precedence over” b.
a b This means a “has same precedence as” b.
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
$ id +
6) Now TOS is “+” and look ahead is at “id” so (+, id) will compared , it
shows < so we will push “id” in stack again.
7) So , like this way we have to follow all steps .
So, operator relation table has a disadvantage – if we have n operators then
size of table will be n*n and complexity will be 0(n2). In order to decrease
the size of table, we use operator function table.
Operator precedence parsers usually do not store the precedence table
with the relations; rather they are implemented in a special way. Operator
precedence parsers use precedence functions that map terminal symbols
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
Since there is no cycle in the graph, we can make this function table:
From this graph we have to find longest route.
fid -> g* -> f+ ->g+ -> f$
PCET-NMVPM’s
Nutan College of Engineering and Research, Talegaon, Pune
DEPARTMENT OF COMPUTER ENGINEERING AND SCIENCE
from this route we will feel the function table. As from fid how many arrows
are there upto the end…… 4 so we will write 4 for fid.
f+ path is 2 so we will fill 2.