CD Digital Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 126

(MR20-1CS0112)COMPILERDESIGN

Course Objectives:
 To introduce the Finite Automata, NFA and DFA.
 To gain insight into the Context Free Language.
 To study the Phases of a Compiler and Lexical Analysis and Syntax Analysis.
 To acquaint the Intermediate Code Generation.
 To acquaint Code Optimization and Code Generation.

Course Outcomes:

After completion of the course, the students will be able to


 Understand the concept of Finite Automata, NFA and DFA
 Understand about Context Free Language
 Explain the concept of Phases of a Compiler, Lexical Analysis and Syntax Analysis.
 Describe the Intermediate code generation,
 Understand the concept of Code Optimization and Code Generation.

UNIT –I
Finite Automata and Regular Expressions: Finite Automata- Examples and Definitions - Accepting the
Union, Intersection, Difference of Two Languages. Regular Expressions: Regular Languages and Regular
Expressions– Conversion from Regular Expression to NFA and Deterministic Finite Automata. Context free
grammar: Derivations trees and ambiguity – Simplified forms and Normal forms.

UNIT –II
Introduction to Compiler: Introduction to Compilers: Definition of compiler – Interpreter- bootstrapping –
phases of compiler.
Lexical Analysis: Roles of Lexical analyzer –Input buffering – specification of tokens – Recognition of
Tokens – A language for specifying lexical analyzers – design of a Lexical analyzer.

UNIT –III
Parsing: Role of parser - Top Down Parser: Backtracking, Recursive Decent Parsing and Predictive Parsers.
Bottom up Parser: Shift Reduce Parsing – LR parsers : SLR Parser, CLR parser and LALR Parsers.

UNIT –IV
Syntax Directed Translation and Intermediate code Generator: Syntax Directed Definitions –
construction of Syntax tree. Intermediate code Generation : Abstract syntax tree – three address code – types
of three address statements – syntax directed translations into three address code. Boolean expression and
flow of control statements.

UNIT –V
Code optimization and Code generation: Basic blocks and flow graphs – optimization of basic blocks –
principal sources of optimization – loop optimization – DAG representation of basic blocks. Simple code
generator – register allocation and assignments – peephole optimization.
REFERENCE BOOKS:

1. John E.Hopgroft and Jeffrey D. Ullman, Introduction to Automata Theory, Languages and Computation, 3rd
Edition, Pearson Edition.
2. Alfred Aho, V.Ravisethi, and D.Jeffery Ullman, “Compilers Principles, techniques and Tools”, Addison-
Wesley, 2nd Edition, 2007.
3. Mishra K.L.P, “Theory of Computer Science: Automata, language and Computation”, PHI Press, 2006.
4. “Theory of Computation & Applications – Automata Theory Formal Languages”, by Anil Malviya, Malabika
Datta, BPB Publications, 2015.
UNIT I

Finite Automata and Regular Expressions

 Finite Automata:

A Finite Automata is the mathematical model of a digital computer. Finite Automata are used as string

or language acceptors. They are mainly used in pattern matching tools like LEX and Text editors.

 The Finite State System represents a mathematical model of a system with certain input.
 The model finally gives a certain output. The output given to the machine is processed by various
states. These states are called intermediate states.
 A good example of finite state systems is the control mechanism of an elevator. This mechanism only
remembers the current floor number pressed, it does not remember all the previously pressed numbers.
 The finite state systems are useful in design of text editors, lexical analyzers and natural
languageprocessing. The word “automaton” is singular and “automata” is plural.
 An automaton in which the output depends only on the input is called an automaton without memory.
 An automaton in which the output depends on the input and state is called as automation with memory.
Finite Automation Model:
 Informally, a FA – Finite Automata is a simple machine that reads an input string – one symbol at a
time -- and then, after the input has been completely read, decides whether to accept or reject the input.
As the symbols are read from the tape, the automaton can changeits state, to reflect how it reacts to
what it has seen so far.
 The Finite Automata can be represented as,

i) Input Tape: Input tape is a linear tape having some cells which can hold an input symbol from ∑.
ii) Finite Control: It indicates the current state and decides the next state on receiving a particular input from
the input tape. The tape reader reads the cells one by one from left to right and at any instance only one input
symbol is read. The reading head examines read symbol and the head moves to the right side with or without
changing the state. When the entire string is read and if finite control is in final state then the string is accepted
otherwise rejected. The finite automaton can be represented by a transition diagram in which the vertices
represent the states and the edges represent transitions.
 A Finite Automaton (FA) consists of a finite set of states and set of transitions among statesin response to
inputs.
• Always associated with a FA is a transition diagram, which is nothing but a‘directed graph’.
• The vertices of the graph correspond to the states of the FA.
• The FA accepts a string of symbols from∑, x if the sequence of transitions corresponding to
symbols in x leads from the state to an accepting state.
 Finite Automata can be classified into two type:
1. FA without output or Language Recognizers ( e.g. DFA and NFA)
2. FA with output or Transducers ( e.g. Moore and Mealy machines)

Finite Automata Definition:

Finite Automaton (FA), a collection of states in which we make transitions based upon input symbols.

Definition: A Finite Automaton

A finite automaton (FA) is a 5-tuple (Q,Σ,q0,A,δ) where

 Q is a finite set of states;


 Σ is a finite input alphabet;
 q0∈Q is the initial state;
 A⊆Q is the set of accepting states; and
 δ:Q×Σ→Qis the transition function.

For any element q of Q and any symbol σ∈Σ, we interpret δ(q,σ) as the state to which the FA moves, if it is in
state q and receives the input σ.

For example, for the FA shown here, we would say that:


 Q={q0,q1,q2,0,1,2}
o These are simply labels for states, so we can use any symbol that is convenient.
 Σ={0,1}
o These are the characters that we can supply as input.
 q0 is, well, q0 because we chose to use that matching label for the state.
 A={q1,0}
o The accepting states
 δ={((q0,0),q1),((q0,1),1),((q1,0),q2),((q1,1),q2,(q2,0),q2),((q2,1),q2), ((0,0),0),((0,1),1), ((1,0),2),((1,1),0
),((2,0),1),((2,1),2)}
o Functions are, underneath it all, sets of pairs. The first element in each pair is the input to the
function and the second element is the result. Hence when you see ((q0,0),q1) it means that, "if
the input is (q0,0) the result is q1

The input is itself a pair because δ was defined as a function of the form Q×Σ→Qso the input
has the form Q×Σ the set of all pairs in which the first element is taken from set Q and the
second element from set Σ.

o Of course, there may be easier ways to visualize δ. In particular, we could do it via a table with
the input state on one axis and the input character on another:

Starting State

Input q0 q1 q2 0 1 2

0 q1 q2 q2 0 2 1

1 1 q2 q2 1 0 2
o The table representation is particularly useful because it suggests an efficient implementation. If
we numbered our states instead of using arbitrary labels:

Starting State

Input 0 1 2 4 5 6

0 1 2 2 3 5 4

1 4 2 2 4 3 5

 Accepting the Union:


 L1={ab,aab,abab,abb,…….}
 L1={ab,aab,abab,abb,…….}
Here,

L1= starts with a and end with b


 L2= starts with b and ends with a
Therefore,
L=L1 U L2
Or
L=L1+L2

State transition diagram for L1


 The state transition diagram for the language L1 is given below −

The above diagram accepts all strings starting with a and ending with b.Here,

 q0 is the initial state.


 q1 is an intermediate state.
 q2 is the final state.
 q3 is the dead state.
State transition diagram for L2
 The state transition diagram for language L2 is as follows −

The above diagram accepts all strings starting with b and ending with a.Here,

 q0: Initial state.


 q1: Intermediate state.
 q2: Final state.
 q3: Dead state.
Now the union of L1 and L2 gives the final result of language which starts and ends with different elements.

The state transition diagram of L1 U L2 is as follows −


 Accepting Intersection:
Let’s understand the intersection of two DFA with an example.
Designing a DFA for the set of string over {0, 1} such that it ends with 01 and has evennumber 0f 1’s.
There two desired language will be formed:
L1= {01, 001, 101, 0101, 1001, 1101, .... }

L2= {11, 011, 101, 110, 0011, 1100, ..... }

L = L1 and L2 = L1 ∩ L2

State Transition Diagram for the language L1


This is a FA for language L1

It accepts all the string that accept 01 at end.

State Transit ion Diagram for the language L2


This is a FA for language L2

It accepts all the string that accept with even number of 1’s.
State Transition Diagram ForL1 ∩ L2
State Intersection of L1 and L2 can be explained by language that a string over {0, 1} accept such that it
ends with 01 and has even number of 1’s.
L = L1 ∩ L2
= {1001, 0101, 01001, 10001, ....}
Thus as we see that L1 and L2 have been combined through intersection process and this final FA accept all
the language that has even number of 1’s and is ending with 01.

 Regular Language:
The set of regular languages over an alphabet is defined recursively as below. Any language belonging to
this set is a regular language over .
Definition of set of Regular Languages:
Basis Clause: ,Ǿ{ } and {a} for any symbol a€ are regular languages.
Inductive Clause: If Lr and Ls are regular languages, then Lr Ls , LrLs and Lr* are regular languages.

Nothing is a regular language unless it is obtained from the above two clauses.

For example, let = {a, b}. Then since {a} and {b} are regular languages, {a, b} ( = {a} {b} ) and {ab} (
= {a}{b} ) are regular languages. Also since {a} is regular, {a}* is a regular language which is the set of
strings consisting of a's such as , a, aa, aaa, aaaa etc. Note also that *, whic h is the set of strings consisting
of a's and b's, is a regular language because {a, b} is regular.

Regular Expression:
Regular expressions are used to denote regular languages. They can represent regular languages and
operations on them succinctly.
The set of regular expressions over an alphabet is defined recursively as below. Any element of that set is
a regular expression.

Basis Clause: Ǿ and a are regular expressions corresponding to languages ,Ǿ{ } and {a}, respectively,
where a is an element of .
Inductive Clause: If r and s are regular expressions corresponding to languages Lr and Ls ,then ( r + s
),(rs) and ( r*) are regular expressions corresponding to languages Lr Ls , LrLs and Lr* , respectively.

Nothing is a regular expression unless it is obtained from the above two clauses.

FA is characterized into two types:

1.Deterministic Finite Automata (DFA):


DFA consists of 5 tuples {Q, Σ, q, F, δ}.

Q : set of all states.

Σ : set of input symbols. ( Symbols which machine takes as input )

q : Initial state. ( Starting state of a machine )

F : set of final state.

δ : Transition Function, defined as δ : Q X Σ --> Q.

In a DFA, for a particular input character, the machine goes to one state only. A transition function is
defined on every state for every input symbol. Also in DFA null (or ε) move is not allowed, i.e., DFA
cannot change state without any input character.

For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.

Figure: DFA with Σ = {0, 1}


One important thing to note is, there can be many possible DFAs for a pattern. A DFA with a minimum
number of states is generally preferred.
2.Nondeterministic Finite Automata(NFA):

NFA is similar to DFA except following additional features:


1. Null (or ε) move is allowed i.e., it can move forward without reading symbols.
2. Ability to transmit to any number of states for a particular input.
However, these above features don’t add any power to NFA. If we compare both in terms of power, both are
equivalent.

Due to the above additional features, NFA has a different transition function, the rest is the same as DFA.

δ: Transition Function

δ: Q X (Σ U ε ) --> 2 ^ Q.

As you can see in the transition function is for any input including null (or ε), NFA can go to any state
number of states. For example, below is an NFA for the above problem.

NFA
One important thing to note is, in NFA, if any path for an input string leads to a final state, then the input
string is accepted. For example, in the above NFA, there are multiple paths for the input string “00”. Since
one of the paths leads to a final state, “00” is accepted by the above NFA.

Algorithm for the conversion of Regular Expression to NFA

A Regular Expression is a representation of Tokens. But, to recognize a token, it can need a token Recognizer,
which is nothing but a Finite Automata (NFA). So, it can convert Regular Expression into NFA.

Input − A Regular Expression R


Output − NFA accepting language denoted by R
Method
For ε, NFA is

For a NFA is

For a + b, or a | b NFA is

For ab, NFA is

For a*, NFA is

Example1 − Draw NFA for the Regular Expression a(a+b)*ab


Solution
ε−closure(S) − It is the set of states that can be reached form state s on ε−transitions alone.
 If s, t, u states. Initially, ε−closure (s)={s}.
If s→t, then ε−closure (s)={s,t}.

 If s→t→u, then ε−closure (s)={s,t,u}
It will be repeated until all states are covered.
Algorithm: ε−closure (T)
T is a set of states whose ε−closure (s) is to be found.
Push All states in T on the stack
ε −closure (T)=T
While (stack not empty) {
Pop s, the top element of Stack
for each state t, with edge s→t {
ift is not present in ε−closure (T) {
ε−closure (T)=ε−closure (T)𝖴{t}
Push t on Stack
}
}
}
Example: Convert (a|b)*abb. To NFA and DFA

a
ε 2 3 ε
star ε ε a b b
0 1 6 7 8 9 10
ε 4 5
b

Start the Conversion


1. Begin with the start state 0 and calculate ε-closure(0). a. the set of states reachable by ε-transitions
which includes 0 itself is { 0,1,2,4,7}. This defines a new state A in the DFA A = {0,1,2,4,7}
2. We must now find the states that A connects to. There are two symbols in the language (a, b) so in the
DFA we expect only two edges: from A on a and from A on b. Call these states B and C:
We find B and C in the following way:
Find the state B that has an edge on a from A
a. start with A{0,1,2,4,7}. Find which states in A have states reachable by a transitions. This set is called
move(A,a) The set is {3,8}: move (A,a) = {3,8}
b. now do an ε-closure on move(A,a). Find all the states in move(A,a) which are reachable with ε-transitions. We
have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1 and 7, and from 1 to 2 and 4.
Starting with 8 we can get to 8 only. So the complete set is {1,2,3,4,6,7,8}. So
ε-closure(move(A,a)) =B ={1,2,3,4,6,7,8}

This defines the new state B that has an edge on a from A


Find the state C that has an edge on b from A
c. start with A{0,1,2,4,7}. Find which states in A have states reachable by b transitions. This set is called
move(A,b) The set is {5}: move(A,b) = {5}
d. now do an ε-closure on move(A,b). Find all the states in move(A,b) which are reachable with ε-transitions. We
have only state 5 to consider. From 5 we can get to 5, 6, 7, 1, 2, 4. So the complete set is {1,2,4,5,6,7}. So
ε-closure(move(A,b)) = C = {1,2,4,5,6,7}
This defines the new state C that has an edge on b from A
A={0,1,2,4,7} B={1,2,3,4,6,7,8} C={1,2,4,5,6,7}
Now that we have B and C we can move on to find the states that have a and b transitions from B and C.
Find the state that has an edge on a from B
e. start with B{1,2,3,4,6,7,8}. Find which states in B have states reachable by a transitions. This set is called
move(B,a) The set is {3,8}: move(B,a) = {3,8}
f. now do an ε-closure on move(B,a). Find all the states in move(B,a) which are reachable with ε-transitions. We
have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1 and 7, and from 1 to 2 and 4.
Starting with 8 we can get to 8 only. So the complete set is {1,2,3,4,6,7,8}. So
ε-closure(move(B,a)) ={1,2,3,4,6,7,8}= B

which is the same as the state B itself. In other words, we have a repeating edge to B:
A={0,1,2,4,7} B={1,2,3,4,6,7,8} C={1,2,4,5,6,7}
Find the state D that has an edge on b from B
g. start with B{1,2,3,4,6,7,8}. Find which states in B have states reachable by b transitions. This set is called
move(B,b) The set is {5,9}: move(B,b) = {5,9}
h. now do an ε-closure on move(B,b). Find all the states in move(B,b) which are reachable with ε-transitions.
From 5 we can get to 5, 6, 7, 1, 2, 4. From 9 we get to 9 itself. So the complete set is {1,2,4,5,6,7,9}}. So
ε-closure(move(B,b)) = D = {1,2,4,5,6,7,9} This defines the new state D that has an edge on b from B
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D{1,2,4,5,6,7,9}
Find the state that has an edge on a from D
i. start with D{1,2,4,5,6,7,9}. Find which states in D have states reachable by a transitions. This set is called
move(D,a) The set is {3,8}: move(D,a) = {3,8}
j. now do an ε-closure on move(D,a). Find all the states in move(B,a) which are reachable with ε-transitions. We
have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1 and 7, and from 1 to 2 and 4.
Starting with 8 we can get to 8 only. So the complete set is {1,2,3,4,6,7,8}. So ε-closure(move(D,a)) =
{1,2,3,4,6,7,8} =B
This is a return edge to B:
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D{1,2,4,5,6,7,9}
Find the state E that has an edge on b from D
k. start with D{1,2,4,5,6,7,9}. Find which states in D have states reachable by b transitions. This set is called
move(B,b) The set is {5,10}: move(D,b) = {5,10}
l. now do an ε-closure on move(D,b). Find all the states in move(D,b) which are reachable with ε-transitions. From
5 we can get to 5, 6, 7, 1, 2, 4. From 10 we get to 10 itself. So the complete set is {1,2,4,5,6,7,10}}. So
ε-closure(move(D,b) = E = {1,2,4,5,6,7,10}
This defines the new state E that has an edge on b from D. Since it contains an accepting state, it is also an accepting
state.
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D={1,2,4,5,6,7,9}, E={1,2,4,5,6,7,10}
We should now examine state C
Find the state that has an edge on a from C
m. start with C{1,2,4,5,6,7}. Find which states in C have states reachable by a transitions. This set is called move(C,a)
The set is {3,8}:
move(C,a) = {3,8}=B
we have seen this before. It’s the state B
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D={1,2,4,5,6,7,9}, E={1,2,4,5,6,7,10}
Find the state that has an edge on b from C
n. start with C{1,2,4,5,6,7}. Find which states in C have states reachable by b transitions. This set is called move(C,b)
The set is {5}:
o. move(C,b) = {5}
p. now do an ε-closure on move(C,b). Find all the states in move(C,b) which are reachable with ε-transitions. From 5
we can get to 5,6,7,1,2,4. which is C itself So
ε-closure(move(C,b)) = C
This defines a loop on C

Finally we need to look at E. Although this is an accepting state, the regular expression allows us to repeat adding in
more a’s and b’s as long as we return to the accepting E state finally. So
Find the state that has an edge on a from E
q. start with E{1,2,4,5,6,7,10}. Find which states in E have states reachable by a transitions. This set is called
move(E,a) The set is {3,8}:
move(E,a) = {3,8}=B. We saw this before, it’s B

Find the state that has an edge on b from E


r. start with E{1,2,4,5,6,7,10}. Find which states in E have states reachable by b transitions. This set is called
move(E,b) The set is {5}:
move(E,b) = {5}
We’ve seen this before. It’s C. Finally

a
a

b b
a B E
start a
A b a b
C

That’s it ! There is only one edge from each state for a given input character. It’s a DFA. Disregard the fact that
each of these states is actually a group of NFA states. We can regard them as single states in the DFA. In fact it also
requires other as an edge beyond E leading to the ultimate accepting state. Also the DFA is not yet optimized (there can
be less states).

However, we can make the transition table so far. Here it is:


DFA:
ε – closure (0) = {0,1,2,4,7} —-
Let AMove(A,a) = {3,8}
ε – closure (Move(A,a)) = {1,2,3,4,6,7,8}—-
Let BMove(A,b) = {5}
ε – closure (Move(A,b)) = {1,2,4,5,6,7}—-
Let CMove(B,a) = {3,8}
ε – closure (Move(B,a)) =
{1,2,3,4,6,7,8}—- BMove(B,b) = {5,9}
ε – closure (Move(B,b)) = {1,2,4,5,6,7,9}—-
Let DMove(C,a) = {3,8}
ε – closure (Move(C,a)) =
{1,2,3,4,6,7,8}—- BMove(C,b) = {5}
ε – closure (Move(C,b)) =
{1,2,4,5,6,7}—- CMove(D,a) = {3,8}
ε – closure (Move(D,a)) =
{1,2,3,4,6,7,8}—- BMove(D,b) = {5,10}
ε – closure (Move(D,b)) = {1,2,4,5,6,7,10}—-
Let EMove(E,a) = {3,8}
ε – closure (Move(E,a)) =
{1,2,3,4,6,7,8}—- BMove(E,b) = {5}
ε – closure (Move(E,b)) = {1,2,4,5,6,7}—- C

State Input a Input b


A B C
B B D
C B C
D B E
*E B C
 Context free grammar

Context free grammar is a formal grammar which is used to generate all possible strings in a given formal
language. Context free grammar G can be defined by four tuples as:

G= (V, T, P, S)

Where,

G describes the grammar

T describes a finite set of terminal symbols.

V describes a finite set of non-terminal symbols

P describes a set of production rules

S is the start symbol.

In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly replacing a non-
terminal by the right hand side of the production, until all non-terminal have been replaced by terminal symbols.

S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba

By applying the production S → aSa, S → bSb recursively and finally applying the production S → c, we get
the string abbcbba.

Capabilities of CFG

There are the various capabilities of CFG:

o Context free grammar is useful to describe most of the programming languages.


o If the grammar is properly designed then an efficient parser can be constructed automatically.
o Using the features of associatively & precedence information, suitable grammars for expressions can
be constructed.
o Context free grammar is capable of describing nested structures like: balanced parentheses, matching begin-end,
corresponding if-then-else's & so on.
Derivations trees:
 Derivation tree is a graphical representation for the derivation of the given production rules of the
context free grammar (CFG).
 It is a way to show how the derivation can be done to obtain some string from a given set of production
rules. It is also called as the Parse tree.
 The Parse tree follows the precedence of operators.
 The deepest subtree is traversed first. So, the operator in the parent node has less precedence over the
operator in the subtree.

Properties

The properties of the derivation tree are given below −

 The root node is always a node indicating the start symbols. 


 The derivation is read from left to right.
 The leaf node is always the terminal node.
 The interior nodes are always the non-terminal nodes.

Example

The production rules for the derivation tree are as follows −


E=E+E
E=E*E
E=a|b|c
Here, let the input be a*b+c

Step 1: Step 2: Step 3: Step 4: Step 5


Ambiguity Grammar:
If a context free grammar G has more than one derivation tree for some string w ∈L(G), it is called
an ambiguous grammar. There exist multiple right-most or left-most derivations for some string generated
from that grammar.

Problem

Check whether the grammar G with production rules −


X → X+X | X*X |X| a
is ambiguous or not.

Solution

Let’s find out the derivation tree for the string "a+a*a". It has two leftmost derivations.
Derivation 1 − X → X+X → a +X → a+ X*X → a+a*X → a+a*a
Parse tree 1 −

Derivation 2 − X → X*X → X+X*X → a+ X*X → a+a*X → a+a*a


Parse tree 2 −

Since there are two parse trees for a single string "a+a*a", the grammar G is ambiguous.

 Simplified Forms:

As we have seen, various languages can efficiently be represented by a context-free grammar. All the
grammar are not always optimized that means the grammar may consist of some extra symbols (non-terminal).
Having extra symbols, unnecessary increase the length of grammar. Simplification of grammar means reduction
of grammar by removing useless symbols.
The properties of reduced grammar are given below:

1. Each variable (i.e. non-terminal) and each terminal of G appears in the derivation of some word in L.

2. There should not be any production as X → Y where X and Y are non-terminal.


3. If ε is not in the language L then there need not to be the production X → ε.

Let us study reduction process

Removal of Useless Symbols

A symbol can be useless if it does not appear on the right-hand side of the production rule and does not take part
in the derivation of any string. That symbol is known as a useless symbol. Similarly, a variable can be useless if
it does not take part in the derivation of any string. That variable is known as a useless variable.

For Example:
1. T → aaB | abA | aaT
2. A → aA
3. B → ab | b
4. C → ad

In the above example, the variable 'C' will never occur in the derivation of any string, so the production C → ad
is useless. So we will eliminate it, and the other productions are written in such a way that variable C can never
reach from the starting variable 'T'.

Production A → aA is also useless because there is no way to terminate it. If it never terminates, then it can
never produce a string. Hence this production can never take part in any derivation.

To remove this useless production A → aA, we will first find all the variables which will never lead to a
terminal string such as variable 'A'. Then we will remove all the productions in which the variable 'B' occurs.
Elimination of ε Production

The productions of type S → ε are called ε productions. These type of productions can only be removed from
those grammars that do not generate ε.

Step 1: First find out all nullable non-terminal variable which derives ε.

Step 2: For each production A → a, construct all production A → x, where x is obtained from a by removing
one or more non-terminal from step 1.

Step 3: Now combine the result of step 2 with the original production and remove ε productions.

Example:

Remove the production from the following CFG by preserving the meaning of it.

1. S → XYX
2. X → 0X | ε
3. Y → 1Y | ε
Solution:

Now, while removing ε production, we are deleting the rule X → ε and Y → ε. To preserve the meaning of CFG
we are actually placing ε at the right-hand side whenever X and Y have appeared.

Let us take

S → XYX

If the first X at right-hand side is ε. Then

S → YX

Similarly if the last X in R.H.S. = ε. Then

S → XY

If Y = ε then

S → XX
If Y and X are ε then,

S→X

If both X are replaced by ε

S → Y Now,

S → XY | YX | XX | X | Y

Now let us consider

X → 0X

If we place ε at right-hand side for X then,

X→0

X → 0X | 0

Similarly Y → 1Y | 1

Collectively we can rewrite the CFG with removed ε production as

S → XY | YX | XX | X | Y

X → 0X | 0

Y → 1Y | 1

Removing Unit Productions

The unit productions are the productions in which one non-terminal gives another non-terminal. Use the
following steps to remove unit production:

Step 1: To remove X → Y, add production X → a to the grammar rule whenever Y → a occurs in the grammar.

Step 2: Now delete X → Y from the grammar.

Step 3: Repeat step 1 and step 2 until all unit productions are removed.
For example:

S → 0A | 1B | C
A → 0S | 00
B→1|A
C → 01
Solution:

S → C is a unit production. But while removing S → C we have to consider what C gives. So, we can add a rule
to S.

S → 0A | 1B | 01

Similarly, B → A is also a unit production so we can modify it as

B → 1 | 0S | 00

Thus finally we can write CFG without unit production as

S → 0A | 1B | 01

A → 0S | 00

B → 1 | 0S | 00

C → 01

 Normal Forms
Chomsky's Normal Form (CNF):
CNF stands for Chomsky normal form. A CFG(context free grammar) is in CNF(Chomsky normal form) if all
production rules satisfy one of the following conditions:
o Start symbol generating ε. For example, A → ε.
o A non-terminal generating two non-terminals. For example, S → AB.
o A non-terminal generating a terminal. For example, S → a.

For example:
G1 = {S → AB, S → c, A → a, B → b}
G2 = {S → aA, A → a, B → c}

The production rules of Grammar G1 satisfy the rules specified for CNF, so the grammar G1 is in CNF.
However, the production rule of Grammar G2 does not satisfy the rules specified for CNF as S → aZ contains
terminal followed by non-terminal. So the grammar G2 is not in CNF.
Steps for converting CFG into CNF

Step 1: Eliminate start symbol from the RHS. If the start symbol T is at the right-hand side of any production,
create a new production as:

S1 → S

Where S1 is the new start symbol.

Step 2: In the grammar, remove the null, unit and useless productions. You can refer to the Simplification of
CFG.

Step 3: Eliminate terminals from the RHS of the production if they exist with other non-terminals or terminals.
For example, production S → aA can be decomposed as:

S → RA
R→a

Step 4: Eliminate RHS with more than two non-terminals. For example, S → ASB can be decomposed as:

S → RS
R → AS

Example:

Convert the given CFG to CNF. Consider the given grammar G1:

S → a | aA | B
A → aBB | ε
B → Aa | b

Solution:

Step 1: We will create a new production S1 → S, as the start symbol S appears on the RHS. The grammar will
be:

S1 → S
S → a | aA | B
A → aBB | ε
B → Aa | b

Step 2: As grammar G1 contains A → ε null production, its removal from the grammar yields:

S1 → S
S → a | aA | B
A → aBB
B → Aa | b | a
Now, as grammar G1 contains Unit production S → B, its removal yield:

S1 → S
S → a | aA | Aa | b
A → aBB
B → Aa | b | a

Also remove the unit production S1 → S, its removal from the grammar yields:

S1 → a | aA | Aa | b
S → a | aA | Aa | b
A → aBB
B → Aa | b | a

Step 3: In the production rule S1 → aA | Aa, S → aA | Aa, A → aBB and B → Aa, terminal a exists on RHS
with non-terminals. So we will replace terminal a with X:

S1 → a | XA | AX | b
S → a | XA | AX | b
A → XBB
B → AX | b | a
X→a

Step 4: In the production rule A → XBB, RHS has more than two symbols, removing it from grammar yield:

S1 → a | XA | AX | b
S → a | XA | AX | b
A → RB
B → AX | b | a
X→a
R → XB Hence, for the given grammar, this is the required CNF.
Greibach Normal Form (GNF):

GNF stands for Greibach normal form. A CFG(context free grammar) is in GNF(Greibach normal form) if all
the production rules satisfy one of the following conditions:

o A start symbol generating ε. For example, S → ε.


o A non-terminal generating a terminal. For example, A → a.
o A non-terminal generating a terminal which is followed by any number of non-terminals. For example, S
→ aASB.

For example:

G1 = {S → aAB | aB, A → aA| a, B → bB | b}


G2 = {S → aAB | aB, A → aA | ε, B → bB | ε}

The production rules of Grammar G1 satisfy the rules specified for GNF, so the grammar G1 is in GNF.
However, the production rule of Grammar G2 does not satisfy the rules specified for GNF as A → ε and B → ε
contains ε(only start symbol can generate ε). So the grammar G2 is not in GNF.
Steps for converting CFG into GNF

Step 1: Convert the grammar into CNF.

If the given grammar is not in CNF, convert it into CNF. You can refer the following topic to convert the CFG
into CNF: Chomsky normal form

Step 2: If the grammar exists left recursion, eliminate it.

If the context free grammar contains left recursion, eliminate it. You can refer the following topic to eliminate
left recursion: Left Recursion

Step 3: In the grammar, convert the given production rule into GNF form.

If any production rule in the grammar is not in GNF form, convert it.

Example:
S → XB | AA
A → a | SA
B→b
X→a

Solution:

As the given grammar G is already in CNF and there is no left recursion, so we can skip step 1 and step 2 and
directly go to step 3.

The production rule A → SA is not in GNF, so we substitute S → XB | AA in the production rule A → SA as:

S → XB | AA
A → a | XBA | AAA
B→b
X→a

The production rule S → XB and B → XBA is not in GNF, so we substitute X → a in the production rule S →
XB and B → XBA as:

S → aB | AA
A → a | aBA | AAA
B→b
X→a

Now we will remove left recursion (A → AAA), we get:

S → aB | AA
A → aC | aBAC
C → AAC | ε
B→b
X→a
Department of Computer Science & Engineering Course File : Compiler Design
Now we will remove null production C → ε, we get:

S → aB | AA
A → aC | aBAC | a | aBA
C → AAC | AA
B→b
X→a

The production rule S → AA is not in GNF, so we substitute A → aC | aBAC | a | aBA in production rule S →
AA as:

S → aB | aCA | aBACA | aA | aBAA


A → aC | aBAC | a | aBA
C → AAC
C → aCA | aBACA | aA | aBAA
B→b
X→a

The production rule C → AAC is not in GNF, so we substitute A → aC | aBAC | a | aBA in production rule C →
AAC as:

S → aB | aCA | aBACA | aA | aBAA


A → aC | aBAC | a | aBA
C → aCAC | aBACAC | aAC | aBAAC
C → aCA | aBACA | aA | aBAA
B→b
X→a

Hence, this is the GNF form for the grammar G.


UNIT –II
Introduction to Compiler

INTRODUCTION TO LANGUAGE PROCESSING

As Computers became inevitable and indigenous part of human life, and several languages
with different and more advanced features are evolved into this stream to satisfy or comfort the
user in communicating with the machine , the development of the translators or mediator
Software‘s have become essential to fill the huge gap between the human and machine
understanding.

This process is called Language Processing to reflect the goal and intent of the process. On the
way to this process to understand it in a better way, we have to be familiar with some key
terms and concepts explained in following lines.

LANGUAGE TRANSLATORS
Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target] language? The Source program is a high level language
whereas the Target language can be anything from the machine language of a target machine
(between Microprocessor to Supercomputer) to another high level language program.

Two commonly Used Translators are Compiler and Interpreter


1. Compiler : Compiler is a program, reads program in one language called Source
Language and translates in to its equivalent program in another Language called Target
Language, in addition to this its presents the error information to the User.

If the target program is an executable machine-language program, it can then be called


by the users to process inputs and produce outputs.
2. Interpreter: An interpreter is another commonly used language processor. Instead of
producing a target program as a single translation unit, an interpreter appears to directly
execute the operations specified in the source program on inputs supplied by the user.

LANGUAGE PROCESSING SYSTEM:

Based on the input the translator takes and the output it produces, a language translator
can be called as any one of the following.

Preprocessor: A preprocessor takes the skeletal source program as input and produces an
extended version of it, which is the resultant of expanding the Macros, manifest constants if any,
and including header files etc in the source file. For example, the C preprocessor is a macro
processor that is used automatically by the C compiler to transform our source before actual
compilation. Over and above a preprocessor performs the following activities:

 Collects all the modules, files in case if the source program is divided into different
modules stored at different files.

 Expands short hands / macros into source language statements.

Compiler: Is a translator that takes as input a source program written in high level language and
converts it into its equivalent target program in machine language. In addition to above the
compiler also

Reports to its user the presence of errors in the source program.

Facilitates the user in rectifying the errors, and execute the code.

Assembler: Is a program that takes as input an assembly language program and converts it into
its equivalent machine language code.

Loader / Linker: This is a program that takes as input a relocatable code and collectsthe library
functions, relocatable object files, and produces its equivalent absolute machine code.

Specifically,

Loading consists of taking the relocatable machine code, altering the relocatable
addresses, and placing the altered instructions and data in memory at the proper
locations.
Linking allows us to make a single program from several files of relocatable machine
code. These files may have been result of several different compilations, one or more
may be library routines provided by the system available to any program that
needs them.

In addition to these translators, programs like interpreters, text formatters etc., may be used
in language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.

Normally the steps in a language processing system includes Preprocessing the skeletal
Source program which produces an extended or expanded source program or a ready to compile
unit of the source program, followed by compiling the resultant, then linking / loading , and
finally its equivalent executable code is produced. As I said earlier not all these steps are
mandatory. In some cases, the Compiler only performs this linking and loading functions
implicitly.

TYPES OF COMPILERS:

Based on the specific input it takes and the output it produces, the Compilers can be
classified into the following types;

Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a


HLL into its equivalent in native machine code or object code.

Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.

Cross-Compilers: These are the compilers that run on one machine and produce code for
another machine.

Incremental Compilers: These compilers separate the source into user defined– steps;
Compiling/recompiling step- by- step; interpreting steps in a given order Converters (e.g.
COBOL to C++): These Programs will be compiling from one high level language to another.
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers from
intermediate language (byte code, MSIL) to executable code or native machine code. These
perform type –based verification which makes the executable code more trustworthy

Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native
code for Java and .NET

Binary Compilation: These compilers will be compiling object code of one platform into object
code of another platform.

BOOTSTRAPING

o Bootstrapping is widely used in the compilation development.


o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of
compiler that can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this compiled compiler
to compile everything else as well as future versions of itself.

A compiler can be characterized by three languages:


1. Source Language
2. Target Language
3. Implementation Language

The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

1. Create a compiler SCA A for subset, S of the desired language, L using language "A" and that compiler runs
on machine A.

2.Create a compiler LCSA for language L written in a subset of L.


3. Compile LCS A using the compiler ASC A to obtain
A LCA A. LC A is a compiler for language L,
which runs on machine A and produces code for machine A.

The process described by the T-diagrams is called bootstrapping.

PHASES OF A COMPILER:
 Due to the complexity of compilation task, a Compiler typically proceeds in a Sequence of
compilation phases. The phases communicate with each other via clearly defined interfaces.
Generally an interface contains a Data structure (e.g., tree), Set of exported functions. Each
phase works on an abstract intermediate representation of the source program, not the
source program text itself (except the first phase)
 Compiler Phases are the individual modules which are chronologically executed to perform
their respective Sub-activities, and finally integrate the solutions to give target code.
 It is desirable to have relatively few phases, since it takes time to read and write immediate
files. Following diagram (Figure1.4) depicts the phases of a compiler through which it goes
during the compilation. There fore a typical Compiler is having the following Phases:

1. Lexical Analyzer (Scanner),


2. Syntax Analyzer (Parser),
3. Semantic Analyzer,
4. Intermediate Code Generator(ICG),
5. Code Optimizer(CO) , and
6. Code Generator(CG)
In addition to these, it also has Symbol table management, and Error handler phases. Not all the phases
are mandatory in every Compiler. e.g, Code Optimizer phase is optional in some cases. The description is given in
next section. The Phases of compiler divided in to two parts, first three phases we are called as Analysis part
remaining three called as Synthesis part.

PHASE, PASSES OF A COMPILER:


In some application we can have a compiler that is organized into what is called passes.
Where a pass is a collection of phases that convert the input from one representation to a
completely deferent representation. Each pass makes a complete scan of the input and produces
its output to be processed by the subsequent pass. For example a two pass Assembler.
THE FRONT-END & BACK-END OF A COMPILER
 All of these phases of a general Compiler are conceptually divided into The Front-end, and
The Back-end. This division is due to their dependence on either the Source Language or the
Target machine. This model is called an Analysis & Synthesis model of a compiler.

 The Front-end of the compiler consists of phases that depend primarily on the Source
language and are largely independent on the target machine. For example, front-end of the
compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the
Intermediate Code Generator.

 The Back-end of the compiler consists of phases that depend on the target machine, and
those portions don‘t dependent on the Source language, just the Intermediate language. In
this we have different aspects of Code Optimization phase, code generation along with the
necessary Error handling, and Symbol table operations.

LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:

 Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an
identifier , a Keyword , a punctuation mark, a multi character operator like := .
 The character sequence forming a token is called a lexeme of the token.

 The Scanner generates a token-id, and also enters that identifiers name in the
Symbol table if it doesn‘t exist.
 Also removes the Comments, and unnecessary spaces.

The format of the token is < Token name, Attribute value>

SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its
subsequent phase Semantic Analyzer and performs the following functions:

 Groups the above received, and recorded token stream into syntactic structures,
usually into a structure called Parse Tree whose leaves are tokens.
 The interior node of this tree represents the stream of tokens that logically
belongs together.
 It means it checks the syntax of program elements.

SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically correct, it
may happen that they are not correct semantically. Therefore the semantic analyzer checks the
semantics (meaning) of the statements formed.

o The Syntactically and Semantically correct structures are produced here in the form of
aSyntax tree or DAG or some other sequential representation like matrix.

INTERMEDIATE CODE GENERATOR (ICG): This phase takes the syntactically and
semantically correct structure as input, and produces its equivalent intermediate notation of the
source program. The Intermediate Code should have two important properties specified below:
o It should be easy to produce, and Easy to translate into the target program.
Example intermediate code forms are:
o Three address codes,
o Polish notations, etc.

CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
 Attempts to improve the IC so as to have a faster machine code. Typical functions
include –Loop Optimization, Removal of redundant computations, Strength reduction,
Frequency reductions etc.
 Sometimes the data structures used in representing the intermediate forms may also be
changed.

CODE GENERATOR: This is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or absolute machine
code.
 Memory locations are selected for each variable used, and assignment of variables
to registers is done.
 Intermediate instructions are translated into a sequence of machine instructions.
The Compiler also performs the Symbol table management and Error handling throughout
the compilation process. Symbol table is nothing but a data structure that stores different source

language constructs, and tokens generated during the compilation. These two interact with
allphases of the Compiler.
For example the source program is an assignment statement; the following figure shows the phases
of compiler will process the program.

The input source program is Position=initial+rate*60


LEXICAL ANALYSIS:

As the first phase of a compiler, the main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce as output tokens for
each lexeme in the source program. This stream of tokens is sent to the parser for syntax
analysis. It is common for the lexical analyzer to interact with the symbol table as well.

When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter
that lexeme into the symbol table. This process is shown in the following figure.

When lexical analyzer identifies the first token it will send it to the parser, the parser
receives the token and calls the lexical analyzer to send next token by issuing the
getNextToken() command. This Process continues until the lexical analyzer identifies all the
tokens. During this process the lexical analyzer will neglect or discard the white spaces and
comment lines.

TOKENS, PATTERNS AND LEXEMES:


A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a
sequence of input characters denoting an identifier. The token names are the input symbols that
the parser processes. In what follows, we shall generally write the name of a token in boldface.
We will often refer to a token by its token name.

A pattern is a description of the form that the lexemes of a token


A lexeme is a sequence of characters in the source program that matches the
patternfor atoken and is identified by the lexical analyzer as an instance of that token.

Example: In the following C language


statement ,printf ("Total = %d\nǁ,
score) ;
both printf and score are lexemes matching the pattern for token id, and "Total = %d\nǁis
a lexeme matching literal [or string].

LEXICAL ANALYSIS Vs PARSING

There are a number of reasons why the analysis portion of a compiler is normally separated
into lexical analysis and parsing (syntax analysis) phases.

1. Simplicity of design is the most important consideration. The separation of Lexical and
Syntactic analysis often allows us to simplify at least one of these tasks. For example, a
parser that had to deal with comments and whitespace as syntactic units would be
considerably more complex than one that can assume comments and whitespace have
already been removed by the lexical analyzer.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply


specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
significantly.

3. Compiler portability is enhanced: Input-device-specific peculiarities can be


restricted to the lexical analyzer.
INPUT BUFFERING:
Before discussing the problem of recognizing lexemes in the input, let us examine some
ways that the simple but important task of reading the source program can be speeded. This task
is made difficult by the fact that we often have to look one or more characters beyond the next
lexeme before we can be sure we have the right lexeme. There are many situations where we
need to look at least one additional character ahead. For instance, we cannot be sure we've seen
the end of an identifier until we see a character that is not a letter or digit, and therefore is not
part of the lexeme for id. In C, single-character operators like -, =, or < could also be the
beginning of a two-character operator like ->, ==, or <=. Thus, we shall introduce a two- buffer
scheme that handles large look ahead safely. We then consider an improvement involving
"sentinels" that saves time checking for the ends of buffers.

Buffer Pairs
Because of the amount of time taken to process characters and the large numberof
characters that must be processed during the compilation of a large source program, specialized
buffering techniques have been developed to reduce the amount of overhead required to process
a single input character. An important scheme involves two buffers that are alternately reloaded.

Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than using
one system call per character. If fewer than N characters remain in the input file, then a special
character, represented by eof, marks the end of the source file and is different from any possible
character of the source program. Two pointers to the input are maintained:

1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this
chapter.

Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.

Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer.

As long as we never need to look so far ahead of the actual lexeme that the sum of the
lexeme's length plus the distance we look ahead is greater than N, we shall never overwrite the
lexeme in its buffer before determining it.

Sentinels To Improve Scanners Performance:

If we use the above scheme as described, we must check, each time we advance
forward, that we have not moved off one of the buffers; if we do, then we must also reload the
other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and
one to determine what character is read (the latter may be a multi way branch).

We can combine the buffer-end test with the test for the current character if we extend
each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot
be part of the source program, and a natural choice is the character eof. Figure shows the same
arrangement as Figure above but with the sentinels added. Note that eof retains its use as a
marker for the end of the entire input.

Any eof that appears other than at the end of a buffer means that the input is at an end. Below
Figure summarizes the algorithm for advancing forward. Notice how the first test, which can be part
of a multiway branch based on the character pointed to by forward, is the only test we make,
except in the case where we actually are at the end of a buffer or the end of the
input.switch ( *forward++ )
{

case eof: if (forward is at end of first buffer ){

reload second buffer;

forward = beginning of second buffer;}

else if (forward is at end of second buffer

){

reload first buffer;

forward = beginning of first buffer;

else /* eof within a buffer marks the end of input


*/terminate lexical analysis;
break;
}
SPECIFICATION OF TOKENS:

Let us understand how the language theory undertakes the following terms:
1. Alphabets
2. Strings
3. Special symbols
4. Language
5. Longest match rule
6. Operations
7. Notations
8. Representing valid tokens of a language in regular expression
9. Finite automata
1. Alphabets: Any finite set of symbols
o {0,1} is a set of binary alphabets,
o {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,
o {a-z, A-Z} is a set of English language alphabets.

2. Strings: Any finite sequence of alphabets is called a string.


3. Special symbols: A typical high-level language contains the following symbols:

Arithmetic Addition(+), Subtraction(-), Multiplication(*), Division(/)


Symbols

Punctuation Comma(,), Semicolon(;), Dot(.)

Assignment =

Special +=, -=, *=, /=


assignment

Comparison ==, !=. <. <=. >, >=

Preprocessor #

4. Language: A language is considered as a finite set of strings over some finite set
of
alphabets.
5. Operations: The various operations on languages are:
 Union of two languages L and M is written as, L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M is written as, LM = {st | s is in L and tis in M}
o The Kleene Closure of a language L is written as, L* = Zero or more occurrence of
language L.
6. Notations: If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : L(r)UL(s)
Concatenation : L(r)L(s)
Kleene closure : (L(r))*
7.Representing valid tokens of a language in regular expression: If x is a regular expression,
o x* means zero or more occurrence of x.
o x+ means one or more occurrence of x.
8.Finite automata: Finite automata is a state machine that takes a string of symbols as input
and changes its state accordingly. If the input string is successfully processed and the automata
reaches its final state, it is accepted. The mathematical model of finite automata consists of:
o Finite set of states (Q)
o Finite set of input symbols (Σ)
o One Start state (q0)
o Set of final states (qf)
o Transition function (δ)
RECOGNITION OF TOKENS

Starting point is the language grammar to understand the

tokens:stmt -> if expr then stmt | if expr then stmt else

stmt | Ɛ expr -> term relop term | term

term -> id | number

The next step is to formalize the

patterns: digit -> [0-9]

Digits -> digit+

number -> digit(.digits)? (E[+-]?

Digit)?letter -> [A-Za-z_]

id -> letter

(letter|digit)*If -> if

Then ->

thenElse ->

else

Relop -> < | > | <= | >= | = | <>

We also need to handle whitespaces: ws -> (blank | tab | newline)+


Transition diagram for relop

Transition diagram for reserved words and identifiers

Transition diagram for unsigned numbers

Transition diagram for whitespace


A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS
There is a wide range of tools for constructing lexical analyzers.
Lex
YACC
Lex is a computer program that generates lexical analyzers. Lex is commonly used
withthe yacc parser generator.
LEX

o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence
oftokens.
o It reads the input stream and produces the source code as output through
implementingthe lexical analyzer in the C program.

The function of Lex is as follows:

o Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex
compiler runs the lex.1 program and produces a C program lex.yy.c.
o Finally C compiler runs the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.

Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Definitions include declarations of variables, constants, and regular definitions

Rules are statements of the


formp1 {action1}
p2 {action2}


pn {actionn}

where pi is regular expression and actioni describes what action the lexical analyzer should
takewhen pattern pi matches a lexeme. Actions are written in C code.

User subroutines are auxiliary procedures needed by the actions. These can
becompiled separately and loaded with the lexical analyzer.

YACC- YET ANOTHER COMPILER-COMPILER

o YACC stands for Yet Another Compiler Compiler.


o YACC provides a tool to produce a parser for a given grammar.
o YACC is a program designed to compile a LALR (1) grammar.
o It is used to produce the source code of the syntactic analyzer of the language
produced by LALR (1) grammar.
o The input of YACC is the rule or grammar and the output is a C program.

DESIGN OF LEXICAL ANALYSER:


1 The Structure of the Generated Analyzer
2 Pattern Matching Based on NFA's
3 DFA's for Lexical Analyzers
4 Implementing the Lookahead Operator
In this section we shall see how a lexical-analyzer generator such as Lex is architected. We discuss two approaches,
based on NFA's and DFA's; the latter is essentially the implementation of Lex.
1. The Structure of the Generated Analyzer
Figure Overviews the architecture of a lexical analyzer generated by Lex. The program that serves as the
lexical analyzer includes a fixed program that simulates an automaton; at this point we leave open whether that
automaton is deterministic or nondeterministic. The rest of the lexical analyzer consists of components that are
created from the Lex program by Lex itself.
A Lex program is turned into a transition table and actions, which are used by a finite-automaton simulator
These components are:
A transition table for the automaton. Those functions that are passed directly through Lex to the output.
The actions from the input program, which appear as fragments of code to be invoked at the appropriate time by
the automaton simulator.
To construct the automaton, we begin by taking each regular-expression pattern in the Lex program and
converting it, using Algorithm, to an NFA. We need a single automaton that will recognize lexemes matching any
of the patterns in the program, so we combine all the NFA's into one by introducing a new start state with e-
transitions to each of the start states of the NFA's N{ for pattern pi. This construction is shown in Fig. We shall
illustrate the ideas of this section with the following simple, abstract example:
Note that these three patterns present some conflicts of the type discussed in another Section. In
particular, string abb matches both the second and third patterns, but we shall consider it a lexeme for pattern p2,
since that pattern is listed first in the above Lex program. Then, input strings such as aabbb • • •have many prefixes
that match the third pattern. The Lex rule is to take the longest, so we continue reading 6's, until another a is met,
whereupon we report the lexeme to be the initial a's followed by as many 6's as there are.
2. Pattern Matching Based on NFA's
If the lexical analyzer simulates an NFA such as that of Fig. 3.52, then it must read input beginning at the
point on its input which we have referred to as lexemeBegin. As it moves the pointer called forward ahead in the
input, it calculates the set of states it is in at each point, following Algorithm.
Eventually, the NFA simulation reaches a point on the input where there are no next states. At that point,
there is no hope that any longer prefix of the input would ever get the NFA to an accepting state; rather, the set of
states will always be empty. Thus, we are ready to decide on the longest prefix that is a lexeme matching some
pattern.
We look backwards in the sequence of sets of states, until we find a set that includes one or more accepting
states. If there are several accepting states in that set, pick the one associated with the earliest pattern pi in the list
from the Lex program. Move the forward pointer back to the end of the lexeme, and perform the
action Ai associated with pattern pi.
Example : Suppose we have the patterns of previous Example and the input begins aaba. the Figure 3.53
shows the sets of states of the NFA of Fig. 3.52 that we enter, starting with e-closure of the initial state 0, which is
{0,1,3,7}, and proceeding from there. After reading the fourth input symbol, we are in an empty set of states, since
in Fig. 3.52, there are no transitions out of state 8 on input a.
Thus, we need to back up, looking for a set of states that includes an accepting state. Notice that, as
indicated in Fig. 3.53, after reading a we are in a set that includes state 2 and therefore indicates that the pattern a
has been matched. However, after reading aab, we are in state 8, which indicates that a * b + has been matched;
prefix aab is the longest prefix that gets us to an accepting state. We therefore select aab as the lexeme, and execute
action A3, which should include a return to the parser indicating that the token whose pattern is p3 = a * b + has
been found. •
3. DFA's for Lexical Analyzers
Another architecture, resembling the output of Lex, is to convert the NFA for all the patterns into an
equivalent DFA, using the subset construction of Algorithm 3.20. Within each DFA state, if there are one or more
accepting NFA states, determine the first pattern whose accepting state is represented, and make that pattern the
output of the DFA state.
Example : Figure 3.54 shows a transition diagram based on the DFA that is constructed by the subset
construction from the NFA in Fig. 3.52. The accepting states are labeled by the pattern that is identified by that
state. For instance, the state {6,8} has two accepting states, corresponding to patterns abb and a * b + . Since the
former is listed first, that is the pattern associated with state {6,8} . •
We use the DFA in a lexical analyzer much as we did the NFA. We simulate the DFA until at some point
there is no next state (or strictly speaking, the next state is 0, the dead state corresponding to the empty set of NFA
states). At that point, we back up through the sequence of states we entered and, as soon as we meet an accepting
DFA state, we perform the action associated with the pattern for that state.
Example : Suppose the DFA of Fig. 3.54 is given input abba. The se-quence of states entered is 0137,247,58,68,
and at the final a there is no tran-sition out of state 68. Thus, we consider the sequence from the end, and in this
case, 68 itself is an accepting state that reports pattern p2 = abb . •
4. Implementing the Lookahead Operator
The Lex lookahead operator / in a Lex pattern r\/r2 is sometimes necessary, because the pattern r*i for a
particular token may need to describe some trailing context r2 in order to correctly identify the actual lexeme.
When converting the pattern r\/r 2 to an NFA, we treat the / as if it were e, so we do not actually look for a / on the
input. However, if the NFA recognizes a prefix xy of the input buffer as matching this regular expression, the end
of the lexeme is not where the NFA entered its accepting state. Rather the end occurs when the NFA enters a
state s such that
UNIT III

 THE ROLE OF PARSER

The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer
and verifies that the string can be generated by the grammar for the source language. It
reports any syntax errors in the program. It also recovers from commonly occurring errors so
that it can continue processing its input.

1. It verifies the structure generated by the tokens based on the grammar.


2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.

Issues :
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.

Syntax error handling:


Programs can contain errors at many different levels. For example :
1. Lexical, such as misspelling an identifier, keyword or operator.
2. Syntactic, such as an arithmetic expression with unbalanced parentheses.
3. Semantic, such as an operator applied to an incompatible operand.
4. Logical, such as an infinitely recursive call.
Functions of error handler:
1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.

Error recovery strategies:


The different strategies that a parse uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction

Panic mode recovery:


On discovering an error, the parser discards input symbols one at a time until a
synchronizing token is found. The synchronizing tokens are usually delimiters, such
as semicolon or end. It has the advantage of simplicity and does not go into an infinite loop.
When multiple errors in the same statement are rare, this method is quite useful.

Phrase level recovery:


On discovering an error, the parser performs local correction on the remaining input
that allows it to continue. Example: Insert a missing semicolon or delete an extraneous
semicolon etc.

Error productions:
The parser is constructed using augmented grammar with error productions. If an
error production is used by the parser, appropriate error diagnostics can be generated to
indicate the erroneous constructs recognized by the input.

Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a
parse tree for a string y, such that the number of insertions, deletions and changes of tokens is
as small as possible. However, these methods are in general too costly in terms of time and
space.

CONTEXT-FREE GRAMMARS
A Context-Free Grammar is a quadruple that consists of
terminals,
Non-terminals,
start symbol and
productions.

Terminals: These are the basic symbols from which strings are formed.
Non-Terminals: These are the syntactic variables that denote a set of strings.
These help to define the language generated by the grammar.
Start Symbol: One non-terminal in the grammar is denoted as the “Start-symbol” and the
setof strings it denotes is the language defined by the grammar.
Productions : It specifies the manner in which terminals and non-terminals can be
combinedto form strings. Each production consists of a non-terminal,
followed by an arrow, followed by a string of non-terminals and terminals.

Example of context-free grammar:

The following grammar defines simple arithmetic expressions


: expr → expr op expr expr → (expr)

expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ↑

In this grammar,
id + - * / ↑ ( ) are terminals.

expr , op are non-terminals.


expr is the start symbol.
Each line is a production.
Derivations:
Two basic requirements for a grammar are :
1. To generate a valid string.
2. To recognize a valid string.

Derivation is a process that generates a valid string with the help of grammar by
replacing the non-terminals on the left with the string on the right side of the production.

Example : Consider the following grammar for arithmetic expressions :

E→E+E|E*E|(E)|-E| id

To generate a valid string - ( id+id ) from the grammar the steps are
1. E → - E
2. E → - ( E )
3. E → - ( E+E )
4. E → - ( id+E )
5. E → - ( id+id )
In the above derivation,
E is the start symbol
-(id+id) is the required sentence(only terminals).
Strings such as E, -E, -(E), . . . are called sentinel forms.

Types of derivations:
The two types of derivation are:
1. Left most derivation
2. Right most derivation.
 In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen first for
replacement.
In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen first for
replacement.
Example:
Given grammar G : E → E+E | E*E | ( E ) | - E | id Sentence to be derived : - (id+id)

Left Most Derivation


E→-E
E→-(E)
E→-(E+E)
E→-(id+E)
E→-(id+id)

Right Most Derivation


E→-E
E→-(E)
E→-(E+E)
E→-(E+id)
E→-(id+id)
String that appear in leftmost derivation are called left sentinel forms.
String that appear in rightmost derivation are called right sentinel forms.
Sentinels:
Given a grammar G with start symbol S, if S → α , where α may contain non-
terminals or terminals, then α is called the sentinel form of G.
Yield or frontier of tree:
Each interior node of a parse tree is a non-terminal. The children of node can be a terminal or
non-terminal of the sentinel forms that are read from left to right. The sentinel form in the
parse tree is called yield or frontier of the tree.
Ambiguity:
A grammar that produces more than one parse for some sentence is said to be
ambiguous
grammar.
Example : Given grammar G : E → E+E | E*E | ( E ) | - E | id
The sentence id+id*id has the following two distinct leftmost derivations:
E → E+ E
E → E* E
E → id + E
E→E+ E * E
E → id + E * E
E → id + E * E
E → id + id * E
E → id + id * E
E → id + id * id
E → id + id * id

The two corresponding trees are,

WRITING A GRAMMAR

A grammar consists of a number of productions. Each production has an abstract


symbol called a nonterminal as its left-hand side, and a sequence of one or more
nonterminal and terminal symbols as its right-hand side. For each grammar, the terminal
symbols are drawn from a specified alphabet.

There are four categories in writing a grammar :


1. Regular Expression Vs Context Free Grammar
2. Eliminating ambiguous grammar.
3. Eliminating left-recursion
4. Left-factoring.

Eliminating ambiguity:
Ambiguity of the grammar that produces more than one parse tree for leftmost or
rightmost derivation can be eliminated by re-writing the grammar.

Consider this example,


G: stmt→if expr then stmt|ifexprthenstmtelstmte|other This grammar is ambiguous since
the string if E1 then if E2 then S1 else S2 has the following two parse trees for leftmost
derivation (Fig. 2.3)

To eliminate ambiguity, the following grammar may be


used: stmt→matched|unmatchedstmt_stmt

matched→ifexprstmtthenmatchedelsematchedtmt_stmt|other unmatched→ifexprsthenmtst
mt|ifexprthenmatchedelseunmatchedtmt_stmt
Eliminating Left Recursion:
A grammar is said to be left recursive if it has an on-terminal A such that there is a
derivation A=>Aα for some string α. Top-down parsing methods cannot
handle left-recursive grammars. Hence, left recursion can be eliminated as follows:
If there is→ aαA |β production A can be replaced with as
A→βA’
A’→αA’ | ε
without changing the set of strings derivable from A.

Left factoring:
Left factoring is a grammar transformation that is useful for producing a grammar
suitable for predictive parsing. When it is not clear which of two alternative productions to
use to expand a non-terminal A, we can rewrite the A-productions to defer the decision until
we have seen enough of the input to make the right choice.
If there is any→αβ1|2αβ, production it can be A rewritten as
A→αA’

A’→β1|2 β

Consider the grammar ,


G : S → iEtS | iEtSeS | a
E→b

Left factored, this grammar becomes


S → iEtSS’ | a

S’ → eS |ε E → b
PARSING
It is the process of analyzing a continuous stream of input in order to determine its
grammatical structure with respect to a given formal grammar.

Parse tree:

Graphical representation of a derivation or deduction is called a parse tree. Each


interior node of the parse tree is a non-terminal; the children of the node can be terminals or
non-terminals.

Types of parsing:

1. Top down parsing


2. Bottom up parsing

Ø Top-down parsing : A parser can start with the start symbol and try to transform it to the
input string. Example : LL Parsers.
Ø Bottom-up parsing : A parser can start with input and attempt to rewrite it
into the start symbol. Example : LR Parsers.

TOP-DOWN PARSING

It can be viewed as an attempt to find a left-most derivation for an input string or an attempt
to construct a parse tree for the input starting from the root to the leaves.

Types – of top down parsing :

1. Recursive descent parsing


2. Predictive parsing

RECURSIVE DESCENT PARSING

Typically, top-down parsers are implemented as a set of recursive functions that


descent through a parse tree for a string. This approach is known as recursive descent parsing,
also known as LL(k) parsing where the first L stands for left-to-right, the second L stands for
leftmost-derivation, and k indicates k-symbol lookahead.

Therefore, a parser using the single-symbol look-ahead method and top-down parsing
without backtracking is called LL(1) parser. In the following sections, we will also
use an extended BNF notation in which some regulation expression operators are to
be incorporated.
This parsing method may involve backtracking.
Example for :Backtracking

Consider the grammar G : S → cAd


A→ab|a

and the input string w=cad.


The parse tree can be constructed using the following top-down approach :
Step1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first
symbol of w. Expand the tree with the production of S.

Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’.But the third leaf of tree is b which does not match with the
input symbol d.Hence discard the chosen production and reset the pointer to
second backtracking.

Step4:

Now try the second alternative for A.

Now we can halt and announce the successful completion of parsing.

Predictive parsing

It is possible to build a non recursive predictive parser by maintaining a stack


explicitly, rather than implicitly via recursive calls. The key problem during predictive
parsing is that of determining the production to be applied for a nonterminal . The non
recursive parser in figure looks up the production to be applied in parsing table. In what
follows, we shall see how the table can be constructed directly from certain grammars.
A table-driven predictive parser has an input buffer, a stack, a parsing table, and an
output stream. The input buffer contains the string to be parsed, followed by $, a symbol used
as a right end marker to indicate the end of the input string. The stack contains a sequence of
grammar symbols with $ on the bottom, indicating the bottom of the stack. Initially, the stack
contains the start symbol of the grammar on top of $. The parsing table is a two dimensional
array M[A,a] where A is a nonterminal, and a is a terminal or the symbol $. The parser is
controlled by a program that behaves as follows. The program considers X, the symbol on the
top of the stack, and a, the current input symbol. These two symbols determine the action of
the parser. There are three possibilities.

1 If X= a=$, the parser halts and announces successful completion of parsing.


3 If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This
entry will be either an X-production of the grammar or an error entry. If, for example,
M[X,a]={X- >UVW}, the parser replaces X on top of the stack by WVU( with U on top). As
output, we shall assume that the parser just prints the production used; any other code could
be executed here. If M[X,a]=error, the parser calls an error recovery routine
Predictive parsing table construction:
The construction of a predictive parser is aided by two functions associated with a grammar
G:
3. FIRST
4. FOLLOW
Rules for first( ):
1. If X is terminal, then FIRST(X) is {X}.
2. If X → ε is a production, then add ε to FIRST(X).
3. If X is non-terminal and X → aα is a production then add a to FIRST(X).
4. If X is non-terminal and X → Y1 Y2…Yk is a production, then place a in
FIRST(X) if for some i, a is in FIRST(Yi), and ε is in all of FIRST(Y1),…,FIRST(Yi-
1);that is, Y1,….Yi-1=> ε. If ε is in FIRST(Yj) for all j=1,2,..,k, then add ε to
FIRST(X).
Rules for follow( ):
1. If S is a start symbol, then FOLLOW(S) contains $.
2. If there is a production A → αBβ, then everything in FIRST(β) except ε is placed in
follow(B).
3. If there is a production A → αB, or a production A → αBβ where FIRST(β)
contains
ε, then everything in FOLLOW(A) is in FOLLOW(B).
Algorithm for construction of predictive parsing table:

Input : Grammar G

Output : Parsing table M

Method :
1. For each production A → α of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A, a].
3. If ε is in FIRST(α), add A → α to M[A, b] for each terminal b in FOLLOW(A). If ε
is in FIRST(α) and $ is in FOLLOW(A) , add A → α to M[A, $].
4. Make each undefined entry of M be error.
Example:
Consider the following grammar :
E→E+T|T
T→T*F|F
F→(E)|id

After eliminating left-recursion the grammar is


E →TE’
E’ → +TE’ | ε
T →FT’
T’ → *FT’ | ε
F → (E)|id

First( ) :

FIRST(E) = { ( , id}
FIRST(E’) ={+ , ε }
FIRST(T) = { ( , id}
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }

Predictive parsing Table

Stack Implementation
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry.
This type of grammar is called LL(1) grammar.

Consider this following grammar:


S→iEtS | iEtSeS| a

E→b

After eliminating left factoring, we have


S→iEtSS’|a
S’→ eS | ε
E→b

To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}

FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}

Parsing table:

Since there are more than one production, the grammar is not LL(1) grammar.
 BOTTOM-UP PARSING

Constructing a parse tree for an input string beginning at the leaves and going towards
the root is called bottom-up parsing. A general type of bottom-up parser is a shift-reduce
parser.

SHIFT-REDUCE PARSING

Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse


tree for an input string beginning at the leaves (the bottom) and working up towards the root
(the top). The reductions trace out the right-most derivation in reverse.

Example:
Consider the grammar:
E→E+E
E→E*E
E→(E)
E→id

And the input string id1+id2*id3

The rightmost derivation is :


E→E+E
→ E+E*E
→ E+E*id3
→ E+id2*id3
→ id1+id2* id3

In the above derivation the underlined substrings are called handles.


Handle pruning:
A rightmost derivation in reverse can be obtained by “handle pruning”. (i.e.) if w is a
sentence or string of the grammar at hand, then w = γn, where γn is the nth right sentinel form
of
some rightmost derivation.

Actions in shift-reduce parser:

• shift - The next input symbol is shifted onto the top of the stack.
• reduce - The parser replaces the handle within a stack with a non-terminal.
• accept - The parser announces successful completion of parsing.
• error - The parser discovers that a syntax error has occurred and calls an error recovery
routine.
Conflicts in shift-reduce parsing:
There are two conflicts that occur in shift-reduce parsing:
1. Shift-reduce conflict: The parser cannot decide whether to shift or to reduce.
2. Reduce-reduce conflict: The parser cannot decide which of several reductions to make.

Stack implementation of shift-reduce parsing :

1. Shift-reduce conflict:
Example:
Consider the grammar:
E→E+E | E*E | id and input id+id*id
2. Reduce-reduce conflict:
Consider the grammar: M→R+R|R+c|R
R→c

and input c+c

Viable prefixes:

α is a viable prefix of the grammar if there is w such that αw is a right

The set of prefixes of right sentinel forms that can appear on the stack of a shift-reduce
parser are called viable prefixes
The set of viable prefixes is a regular language.

LR PARSERS

An efficient bottom-up syntax analysis technique that can be used CFG is called
LR(k) parsing. The ‘L’ is for left-to-right scanning of the input, the ‘R’ for constructing a
rightmost derivation in reverse, and the ‘k’ for the number of input symbols. When ‘k’ is
omitted, it is
assumed to be 1.

Advantages of LR parsing:
1. It recognizes virtually all programming language constructs for which CFG can be
written.
2. It is an efficient non-backtracking shift-reduce parsing method.
3. A grammar that can be parsed using LR method is a proper superset of a grammar
that can be parsed with predictive parser
4. 4. It detects a syntactic error as soon as possible.

Drawbacks of LR method:

It is too much of work to construct a LR parser by hand for a programming language


grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.

Types of LR parsing method:


1. SLR- Simple LR
Easiest to implement, least powerful.
2. CLR- Canonical LR
Most powerful, most expensive.
3. LALR- Look-Ahead LR
Intermediate in size and cost between the other two methods.

The LR parsing algorithm:


The schematic form of an LR parser is as follows:
It consists of an input, an output, a stack, a driver program, and a pa parts (action and goto).
Ø The driver program is the same for all LR parser.
Ø The parsing program reads characters from an input buffer one at a time.
Ø The program uses a stack to store a string of the form s0X1s1X2s2…Xmsm, where
sm is on top. Each Xi is a grammar symbol and each si is a state.
Ø The parsing table consists of two parts : action and goto functions.

Action : The parsing program determines sm, the state currently on top of stack, and ai,
the current input symbol. It then consults action[sm,ai] in the action table which can have one
of four values:

1. shift s, where s is a state,


2. reduce by a grammar production A → β,

3. accept,

4. 4. error.
Goto : The function goto takes a state and grammar symbol as arguments and produces a
state.

LR Parser:

LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.

In the LR parsing, "L" stands for left-to-right scanning of the input.

"R" stands for constructing a right most derivation in reverse.

"K" is the number of input symbols of the look ahead used to make number of parsing
decision.
LR algorithm:

The LR algorithm requires stack, input, output and parsing table. In all type of LR parsing,
input, output and stack are same but parsing table is different.

Fig: Block diagram of LR parser

Input buffer is used to indicate end of input and it contains the string to be parsed followed by
a $ Symbol.

A stack is used to contain a sequence of grammar symbols with a $ at the bottom of the stack.

Parsing table is a two dimensional array. It contains two parts: Action part and Go To part.

LR (1) Parsing

Various steps involved in the LR (1) Parsing:

o For the given input string write a context free grammar.


o Check the ambiguity of the grammar.
o Add Augment production in the given grammar.
o Create Canonical collection of LR (0) items.
o Draw a data flow diagram (DFA).
o Construct a LR (1) parsing table.

Augment Grammar

Augmented grammar G` will be generated if we add one more production in the given
grammar G. It helps the parser to identify when to stop the parsing and announce the
acceptance of the input.

Example

Given grammar

1. S → AA
2. A → aA | b
The Augment grammar G` is represented by

1. S`→ S
2. S → AA
3. A → aA | b
SLR (1) Parsing

SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only difference is in the
parsing table.To construct SLR (1) parsing table, we use canonical collection of LR (0) item.

In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.

Various steps involved in the SLR (1) Parsing:

o For the given input string write a context free grammar


o Check the ambiguity of the grammar
o Add Augment production in the given grammar
o Create Canonical collection of LR (0) items
o Draw a data flow diagram (DFA)
o Construct a SLR (1) parsing table

SLR (1) Table Construction

The steps which use to construct SLR (1) Table is given below:

SLR ( 1 ) Grammar

E→E+T|T
T→T*F|F
F → (E) | id

Add Augment Production and insert '•' symbol at the first position for every production in G
E`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F→•(E)
F → •id
Add Augment production to the I0 State and Compute the Closure
I0:E`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F→•(E)
F → •id
Go to (I0, E) = closure (E` → E•, E → E• + T) =I1
Go to (I0, T) = closure (E → T•, T→ T• * F) =I2
Go to (I0, F) = Closure ( T → F• ) = T → F•= I3
Go to (I0, ( ) = Closure (F→(•E),E→•E+T, E→•T, T→•T*F, T→•F, F→•(E), F → •id ) = I4
Go to (I0, id) = closure ( F → id•) = F → id• = I5

Go to (I1, +) = Closure (E → E +•T, T→•T*F, T→•F, T→•(E), F → •id) = I6

Go to (I2, *) = Closure(T→ T *• F, F→•(E), F → •id) = I7


Go to (I4, E) = Closure(F→(E•), E→E•+T) = I8
Go to (I4, T) = Closure(E → T•, T→ T• * F) =I2
Go to (I4, F) = Closure( T → F• ) = T → F•= I3
Go to (I4, ( ) = Closure(F→(•E),E→•E+T, E→•T, T→•T*F, T→•F, F→•(E), F → •id ) = I4
Go to (I4, id) = Closure( F → id•) = F → id• = I5

Go to (I6, T) = Closure(E→E+T•, T→T•*F) = I9


Go to (I6, F) = Closure( T → F• ) = T → F•= I3
Go to (I6, ( ) = Closure(F→(•E),E→•E+T, E→•T, T→•T*F, T→•F, F→•(E), F → •id ) = I4
Go to (I6, id) = Closure( F → id•) = F → id• = I5

Go to (I7, F) = Closure(T→T*F•)= I10


Go to (I7, ( ) = Closure(F→(•E),E→•E+T, E→•T, T→•T*F, T→•F, F→•(E), F → •id ) = I4
Go to (I7, id ) = Closure( F → id•) = F → id• = I5

Go to (I8, ) ) = Closure(F→(E) •= I11


Go to (I8, + ) = Closure(E → E +•T, T→•T*F, T→•F, T→•(E), F → •id) = I6

Go to (I9, * ) = Closure(T→ T *• F, F→•(E), F → •id) = I7

SLR (1) Table


States Action Goto
+ * ( ) id $ E T F
I0 S4 S5 1 2 3
I1 S6 Accept
I2 R2 S7 R2 R2
I3 R4 R4 R4 R4
I4 S4 S5 8 2 3
I5 R6 R6 R6 R6
I6 S4 S5 9 3
I7 S4 S5 10
I8 S6 S11
I9 R1 S7 R1 R1
I10 R3 R3 R3 R3
I11 R5 R5 R5 R5

Explanation:

First(F)={(,id}
First(T)={(id}
First(E)={(,id}
Follow(E)=First(+T)∪{$}={+,$,)}
Follow(T)=First(*F)∪First(F)={*,+,$,)}
Follow (F) = {*, +, $,)}

o I1 contains the final item which drives E` → E• and follow (E`) = {$}, so action
{I1, $} = Accept
o I2 contains the final item which drives E → T• and follow (E) = {+,), $}, so action
{I2, +} = R2, action {I2, $} = R2, action {I2, )} = R2
o I3 contains the final item which drives T → F• and follow (T) = {+, *,), $}, so
action {I3, +} = R4, action {I3, *} = R4, action {I3, $} = R4, action {I3, )} = R4
o I5 contains the final item which drives F → id• and follow (F) = {+, *, $,)}, so
action {I5, +} = R6, action {I5, *} = R6, action {I5, $} = R6, action {I5, )} = R6
o I9 contains the final item which drives E → E + T• and follow (E) = {+,), $}, so
action {I9, +} = R1, action {I9, $} = R1, action {I9, )} = R1
o I10 contains the final item which drives T → T * F• and follow (T) = {+, *,), $}, so
action {I10, +} = R3, action {I10, *} = R3, action {I10, $} = R3,action {I10, )} = R3
o I11 contains the final item which drives F → (E)• and follow (T) = {+, *,), $}, so
action {I11, +} = R5, action {I11, *} = R5, action {I11, $} = R5,action {I11, )} = R5.

Parsing:

Stack Input Action


0 Id*id+id$
0 id 5 *id+id$ (0,id)=S5 shift id 5
0F3 *id+id$ (5,*) = r6 reduce F->id and (0,F)=3
0T2 *id+id$ (3,*)=r4 T->F and (0,T) = 2
0T 2*7 id+id$ (2,*)=S7 shift *, 7
0 T 2 * 7 id 5 +id$ (7,id)=S5 shift id, 5
0 T 2 * 7 F 10 +id$ (5,+)=r6 F->id (7,F)=10
0T2 +id$ (10,+)=r3 T->T*F and (0,T)=2
0E1 +id$ (2,+)=r2 E->T and (0,E)=1
0E1+6 Id$ (1,+) = S6 shift +and 6
0 E 1 + 6 id 5 $ (6,id)=S5 shift 5 and id
0E1+6F3 $ (5,$)=r6 F->id and (6,F)=3
0E1+6T9 $ (3,$)=r4 T->F and (6,T)=9
0E1 $ (9,$)=r1 E->E+T and (0,E)=1
0E1 $ (1,$)=Accept

CLR ( 1 ) Grammar

CLR refers to canonical lookahead. CLR parsing use the canonical collection of LR (1) items
to build the CLR (1) parsing table. CLR (1) parsing table produces the more number of states
as compare to the SLR (1) parsing.

In the CLR (1), we place the reduce node only in the lookahead symbols.

Various steps involved in the CLR (1) Parsing:

o For the given input string write a context free grammar


o Check the ambiguity of the grammar
o Add Augment production in the given grammar
o Create Canonical collection of LR (0) items
o Draw a data flow diagram (DFA)
o Construct a CLR (1) parsing table

LR (1) item

LR (1) item is a collection of LR (0) items and a look ahead symbol.

LR (1) item = LR (0) item + look ahead

The look ahead is used to determine that where we place the final item.

The look ahead always add $ symbol for the argument production.
Example

CLR ( 1 ) Grammar

S → AA

A → aA

A→b

Add Augment Production, insert '•' symbol at the first position for every production in G and
also add the lookahead.

S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b

I0 State:

Add Augment production to the I0 State and Compute the Closure

I0 = Closure (S` → •S)

Add all productions starting with S in to I0 State because "." is followed by the non-terminal.
So, the I0 State becomes

I0= S`→•S,$
S → •AA, $

Add all productions starting with A in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.

I0= S`→•S,$
S→•AA,$
A→•aA,a/b
A → •b, a/b

I1= Goto(I0,S)=closure(S`→S•,$)=S`→S•,$
I2= Go to (I0, A) = closure ( S → A•A, $ )

Add all productions starting with A in I2 State because "." is followed by the non-terminal.
So, the I2 State becomes

I2= S→A•A,$
A→•aA,$
A → •b, $
I3= Go to (I0, a) = Closure ( A → a•A, a/b )

Add all productions starting with A in I3 State because "." is followed by the non-terminal.
So, the I3 State becomes

I3= A→a•A,a/b
A→•aA,a/b
A → •b, a/b

Goto(I3,a)=Closure(A→a•A,a/b)=(sameasI3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)

I4= Goto(I0,b)=closure(A→b•,a/b)=A→b•,a/b
I5= Goto(I2,A)=Closure(S→AA•,$)=S→AA•,$
I6= Go to (I2, a) = Closure (A → a•A, $)

Add all productions starting with A in I6 State because "." is followed by the non-terminal.
So, the I6 State becomes

I6= A→a•A,$
A→•aA,$
A → •b, $

Goto(I6,a)=Closure(A→a•A,$)=(sameasI6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)

I7= Goto(I2,b)=Closure(A→b•,$)=A→b•,$
I8= Goto(I3,A)=Closure(A→aA•,a/b)=A→aA•,a/b
I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $

CLR (1) Parsing table:

Action Goto
States
a b $ S A
0 S3 S4 1 2
1 ACCEPT
2 S6 S7 5
3 S3 S4 8
4 r3 r3
5 r1
6 S6 S7 9
7 r3
8 r2 r2
9 r2
Productions are numbered as follows:

1. S → AA ... (1)
2. A → aA ....(2)
3. A → b ... (3)

The placement of shift node in CLR (1) parsing table is same as the SLR (1) parsing table.
Only difference in the placement of reduce node.

I4 contains the final item which drives ( A → b•, a/b), so action {I4, a} = R3, action {I4, b} =
R3.
I5 contains the final item which drives ( S → AA•, $), so action {I5, $} = R1.
I7 contains the final item which drives ( A → b•,$), so action {I7, $} = R3.
I8 contains the final item which drives ( A → aA•, a/b), so action {I8, a} = R2, action {I8, b}
=R2.
I9 contains the final item which drives ( A → aA•, $), so action {I9, $} = R2.

LALR (1) Parsing:

LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the
canonical collection of LR (1) items.

In the LALR (1) parsing, the LR (1) items which have same productions but different look
ahead are combined to form a single set of items

LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.

Example
S → AA
A → aA
A→b

Add Augment Production, insert '•' symbol at the first position for every production in G and
also add the look ahead.

S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b

I0 State:

Add Augment production to the I0 State and Compute the Closure


I0 = Closure (S` → •S)

Add all productions starting with S in to I0 State because "•" is followed by the non-terminal.
So, the I0 State becomes

I0 = S` → •S, $

S → •AA, $

Add all productions starting with A in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.

I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b

I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $


I2= Go to (I0, A) = closure ( S → A•A, $ )

Add all productions starting with A in I2 State because "•" is followed by the non-terminal.
So, the I2 State becomes

I2= S → A•A, $
A → •aA, $
A → •b, $

I3= Go to (I0, a) = Closure ( A → a•A, a/b )

Add all productions starting with A in I3 State because "•" is followed by the non-terminal.
So, the I3 State becomes

I3= A → a•A, a/b


A → •aA, a/b
A → •b, a/b

Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)


Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)

I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b


I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)

Add all productions starting with A in I6 State because "•" is followed by the non-terminal.
So, the I6 State becomes

I6 = A → a•A, $
A → •aA, $
A → •b, $
Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)

I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $


I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) A → aA•, $

If we analyze then LR (0) items of I3 and I6 are same but they differ only in their lookahead.

I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}

I6= { A → a•A, $
A → •aA, $
A → •b, $
}

Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can
combine them and called as I36.

I36 = { A → a•A, a/b/$


A → •aA, a/b/$
A → •b, a/b/$
}

The I4 and I7 are same but they differ only in their look ahead, so we can combine them and
called as I47.

I47 = {A → b•, a/b/$}

The I8 and I9 are same but they differ only in their look ahead, so we can combine them and
called as I89.

I89 = {A → aA•, a/b/$}

LALR (1) Parsing table:

Action Goto
States
a b $ S A
0 S36 S47 1 2
1 ACCEPT
2 S36 S47 5
36 S36 S47 89
47 r3 r3 r3
5 r1
89 r2 r2 r2
Difference between LL and LR parser:
LL Parser LR Parser

First L of LL is for left to right and second L is L of LR is for left to right and R is for
for leftmost derivation. rightmost derivation.

It follows reverse of right most


It follows the left most derivation. derivation.

Using LL parser parser tree is constructed in top Parser tree is constructed in bottom up
down manner. manner.

In LL parser, non-terminals are expanded. In LR parser, terminals are compressed.

Starts with the start symbol(S). Ends with start symbol(S).

Ends when stack used becomes empty. Starts with an empty stack.

Pre-order traversal of the parse tree. Post-order traversal of the parser tree.

Terminal is read before pushing into the


Terminal is read after popping out of stack. stack.

It may use backtracking or dynamic


programming. It uses dynamic programming.

LL is easier to write. LR is difficult to write.

Example: LR(0), SLR(1), LALR(1),


Example: LL(0), LL(1) CLR(1)
UNIT IV
 Syntax Directed Definition

Syntax Directed Definition (SDD) is a kind of abstract specification. It is generalization of


context free grammar in which each grammar production X –> a is associated with it a set
of production rules of the form s = f(b1, b2, ……bk) where s is the attribute obtained from
function f. The attribute can be a string, number, type or a memory location. Semantic rules
are fragments of code which are embedded usually at the end of production and enclosed in
curly braces ({ }).
Example:
E --> E1 + T { E.val = E1.val + T.val}

Annotated Parse Tree – The parse tree containing the values of attributes at each node for
given input string is called annotated or decorated parse tree.
Features –
 High level specification
 Hides implementation details
 Explicit order of evaluation is not specified

Types of attributes – There are two types of attributes:

1. Synthesized Attributes – These are those attributes which derive their values from their
children nodes i.e. value of synthesized attribute at node is computed from the values of
attributes at children nodes in parse tree.
Example:
E --> E1 + T { E.val = E1.val + T.val}
In this, E.val derive its values from E 1.val and T.val
Computation of Synthesized Attributes –
 Write the SDD using appropriate semantic rules for each production in given grammar.
 The annotated parse tree is generated and attribute values are computed in bottom up
manner.
 The value obtained at root node is the final output.
Example: Consider the following grammar
 S --> E
 E --> E1 + T
 E --> T
 T --> T1 * F
 T --> F
 F --> digit
 The SDD for the above grammar can be written as follow

Let us assume an input string 4 * 5 + 6 for computing synthesized attributes. The annotated
parse tree for the input string is
For computation of attributes we start from leftmost bottom node.
 The rule F –> digit is used to reduce digit to F and the value of digit is obtained
from lexical analyzer which becomes value of F i.e. from semantic action F.val =
digit.lexval.
 Hence, F.val = 4 and since T is parent node of F so, we get T.val = 4 from semantic
action T.val = F.val.
 Then, for T –> T1 * F production, the corresponding semantic action is T.val =
T1.val * F.val . Hence, T.val = 4 * 5 = 20
 Similarly, combination of E1.val + T.val becomes E.val i.e. E.val = E 1.val + T.val =
26. Then, the production S –> E is applied to reduce E.val = 26 and semantic action
associated with it prints the result E.val . Hence, the output will be 26.

2. Inherited Attributes – These are the attributes which derive their values from their
parent or sibling nodes i.e. value of inherited attributes are computed by value of parent or
siblingnodes.

Example:
L --> T { L.in = T.type }

Computation of Inherited Attributes –


 Construct the SDD using semantic actions.
 The annotated parse tree is generated and attribute values are computed in top down
manner.
Example: Consider the following grammar
D--> T L

T --> int

T --> float

T --> double

L --> L1, id
L --> id

The SDD for the above grammar can be written as follow


Let us assume an input string int a, c for computing inherited attributes. The annotated
parse tree for the input string is

The value of L nodes is obtained from T.type (sibling) which is basically lexical value
obtained as int, float or double. Then L node gives type of identifiers a and c. The
computation of type is done in top down manner or preorder traversal. Using function
Enter_type the type of identifiers a and c is inserted in symbol table at corresponding
id.entry.

 Syntax Tree
Tree in which each leaf node describes an operand & each interior node an operator. The
syntax tree is shortened form of the Parse Tree.

Example1 − Draw Syntax Tree for the string a + b ∗ c − d.


Rules for constructing a syntax tree

Each node in a syntax tree can be executed as data with multiple fields.

In the node for an operator, one field recognizes the operator and the remaining field
includes a pointer to the nodes for the operands.

The operator is known as the label of the node. The following functions are used to create
the nodes of the syntax tree for the expressions with binary operators.

Each function returns a pointer to the recently generated node.

 mknode (op, left, right) − It generates an operator node with label op and two field
including pointers to left and right.
 mkleaf (id, entry) − It generates an identifier node with label id and the field
including the entry, a pointer to the symbol table entry for the identifier.
 mkleaf (num, val) − It generates a number node with label num and a field including
val, the value of the number.
For example, construct a syntax tree for an expression a − 4 + c. In this sequence, p 1, p2, … . .
p5 are pointers to the symbol table entries for identifier 'a' and 'c' respectively.

 p1− mkleaf (id, entry a);


 p2− mkleaf (num, 4);
 p3− mknode ( ′−′, p1, p2)
 p4− mkleaf(id, entry c)
 p5− mknode(′+′, p3, p4);
The tree is generated in a bottom-up fashion.

 The function calls mkleaf (id, entry a) and mkleaf (num 4) construct the leaves for a
and 4.
 The pointers to these nodes are stored using p1and p2. The call mknodes (′−′, p1, p2 )
then make the interior node with the leaves for a and 4 as children. The syntax tree
will be

Syntax Directed Translation of Syntax Trees

Production Semantic Action

E → E(1) + E(2) {E. VAL = Node (+, E(1). VAL, E(2). VAL)}

E → E(1) ∗ E(2) {E. VAL = Node (∗, E(1). VAL, E(2). VAL)})

E → (E(1)) {E. VAL = E(1). VAL}

E →- E(1) {E. VAL = UNARY (−, E(1). VAL}

E → id {E. VAL = Leaf (id)}

Node (+, (𝟏), 𝐕𝐀𝐋, (𝟐). 𝐕𝐀𝐋) will create a node labeled +.
E(1). VAL &E(2). VAL are left & right children of this node.

Similarly, Node (∗, E(1). VAL, E(2). VAL) will make the syntax as −

Function UNARY (−, E(1). VAL)will make a node – (unary minus) & E(1). VAL will be
the only child of it.

Function LEAF (id) will create a Leaf node with label id.

Example2 − Construct a syntax tree for the expression.

 a = b ∗ −c + d
 Solution

Example3 − Construct a syntax tree for a statement.

If a = b then b = 2 * c

Solution

Example4 − Consider the following code. Draw its syntax Tree

If x > 0 then x = 2 * (a + 1) else x = x + 1.


Example5 − Draw syntax tree for following arithmetic expression a * (b + c) – d /2. Also,
write given expression in postfix form.
 Intermediate code Generation

Intermediate code is used to translate the source code into the machine code. Intermediate
code lies between the high-level language and the machine language.

Fig: Position of intermediate code generator

o If the compiler directly translates source code into the machine code without
generating intermediate code then a full native compiler is required for each new
machine.
o The intermediate code keeps the analysis portions same for all the compilers that's
why it doesn't need a full compiler for every unique machine.
o Intermediate code generator receives input from its predecessor phase and semantic
analyzer phase. It takes input in the form of an annotated syntax tree.
o Using the intermediate code, the second phase of the compiler synthesis phase is
changed according to the target machine.

Intermediate representation

Intermediate code can be represented in two ways:

1. High Level intermediate code:

High level intermediate code can be represented as source code. To enhance performance of
source code, we can easily apply code modification. But to optimize the target machine, it is
less preferred.

2. Low Level intermediate code

Low level intermediate code is close to the target machine, which makes it suitable for
register and memory allocation etc. it is used for machine-dependent optimizations.

The following are commonly used intermediate code representations:


1. Postfix Notation: Also known as reverse Polish notation or suffix notation. The
ordinary (infix) way of writing the sum of a and b is with an operator in the middle: a +
b The postfix notation for the same expression places the operator at the right end as ab
+. In general, if e1 and e2 are any postfix expressions, and + is any binary operator, the
result of applying + to the values denoted by e1 and e2 is postfix notation by e1e2 +. No
parentheses are needed in postfix notation because the position and arity (number of
arguments) of the operators permit only one way to decode a postfix expression. In
postfix notation, the operator follows the operand.

Example 1: The postfix representation of the expression (a + b) * c is : ab + c *


Example 2: The postfix representation of the expression (a – b) * (c + d) + (a – b) is:
ab–cd+*ab-+

2. Three-Address Code: A statement involving no more than three references (two for
operands and one for result) is known as a three address statement. A sequence of three
address statements is known as a three address code. Three address statement is of form
x = y op z, where x, y, and z will have address (memory location). Sometimes a
statement might contain less than three references but it is still called a three address
statement.
Example: The three address code for the expression
a+b*c+d:
T1=b*c
T2=a+T1
T3=T2+d T 1 , T 2 , T 3 are temporary variables.
There are 3 ways to represent a Three-Address Code in compiler design:
i)Quadruples
ii)Triples
iii)Indirect Triples

3. Syntax Tree: A syntax tree is nothing more than a condensed form of a parse tree. The
operator and keyword nodes of the parse tree are moved to their parents and a chain of
single productions is replaced by the single link in the syntax tree the internal nodes are
operators and child nodes are operands. To form a syntax tree put parentheses in the
expression, this way it’s easy to recognize which operand should come first.
 Abstract Syntax Tree

Syntax trees are called as Abstract Syntax Trees because-

 They are abstract representation of the parse trees.


 They do not provide every characteristic information from the real syntax.
 For example- no rule nodes, no parenthesis etc.

Example 1

Considering the following grammar-

E→E+T|T

T→TxF|F

F → ( E ) | id

Generate the following for the string id + id x id

1. Parse tree
2. Syntax tree
3. Directed Acyclic Graph (DAG)

Parse Tree-
Syntax Tree-

Directed Acyclic Graph-

Example 2

Construct a syntax tree for the following arithmetic expression-

( a + b ) * ( c – d ) + ( ( e / f ) * ( a + b ))

Solution-
Step-01:

We convert the given arithmetic expression into a postfix expression as-

(a+b)*(c–d)+((e/f)*(a+b))

ab+ * ( c – d ) + ( ( e / f ) * ( a + b ) )

ab+ * cd- + ( ( e / f ) * ( a + b ) )

ab+ * cd- + ( ef/ * ( a + b ) )

ab+ * cd- + ( ef/ * ab+ )


ab+ * cd- + ef/ab+*

ab+cd-* + ef/ab+*

ab+cd-*ef/ab+*+

Step-02:

We draw a syntax tree for the above postfix expression.

Steps Involved
Start pushing the symbols of the postfix expression into the stack one by one.

When an operand is encountered,

 Push it into the stack.


When an operator is encountered

 Push it into the stack.


 Pop the operator and the two symbols below it from the stack.
 Perform the operation on the two operands using the operator you have in hand.
 Push the result back into the stack.
Continue in the similar manner and draw the syntax tree simultaneously.
The required syntax tree is-

 Three address code:

o Three-address code is an intermediate code. It is used by the optimizing compilers.


o In three-address code, the given expression is broken down into several separate
instructions. These instructions can easily translate into assembly language.
o Each Three address code instruction has at most three operands. It is a combination of
assignment and a binary operator.

Example

Given Expression:

a := (-c * b) + (-c * d)

Three-address code is as follows:


t1 := -c
t2 := b*t1
t3 := -c
t4 := d * t3
t5 := t2 + t4
a := t5

t is used as registers in the target program.

The three address code can be represented in two forms: quadruples and triples.

Quadruples
The quadruples have four fields to implement the three address code. The field of quadruples
contains the name of the operator, the first source operand, the second source operand and the
result respectively.

Fig: Quadruples field

Example
a := -b * c + d
Three-address code is as follows:
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
These statements are represented by quadruples as follows:

Operator Argument 1 Argument 2 Result


(0) Uminus b - t1
(1) + c d t2
(2) * t1 t2 t3
(3) = t3 - a

Triples:

The triples have three fields to implement the three address code. The field of triples contains
the name of the operator, the first source operand and the second source operand.

In triples, the results of respective sub-expressions are denoted by the position of expression.
Triple is equivalent to DAG while representing expressions.

Fig: Triples field


Example
a := -b * c + d
Three-address code is as follows:
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
These statements are represented by triples follows:

Operator Argument 1 Argument 2


(0) Uminus b -
(1) + c d
(2) * (0) (1)
(3) = a (2)

Indirect Triples:

It uses Indirect Addressing

a := -b * c + d
Three-address code is as follows:
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
These statements are represented by Indirect triples follows:

Address
(50) (0)
(51) (1)
(52) (2)
(53) (3)

Operator Argument 1 Argument 2


(50) Uminus b -
(51) + c d
(52) * (50) (51)
(53) = a (52)
 SDT to Three address code:

1. Translation of Assignment Statements

In the syntax directed translation, assignment statement is mainly deals with expressions. The
expression can be of type real, integer, array and records.

Consider the grammar

1. S → id := E
2. E → E1 + E2
3. E → E1 * E2
4. E → (E1)
5. E → id

The translation scheme of above grammar is given below:

Production rule Semantic actions

S → id :=E {gen (id.place=E.place);}

E → E1 + E2 {E.place=newtemp();
Emit (E.place = E1.place '+' E2.place)
}

E → E1 * E2 {E.place=newtemp();
Emit (E.place = E1.place '*' E2.place)
}

E → (E1) {E.place = E1.place}

E → id {E.place = id.place}

o The Emit function is used for appending the three address code to the output file.
Otherwise it will report an error.
o The newtemp() is a function used to generate new temporary variables.
o E.place holds the value of E.
Three address code
t1=y*z
t2=x+t1
a=t2

2. Translation of Boolean Expression Numerical Representation

Boolean expressions have two primary purposes. They are used for computing the logical
values. They are also used as conditional expression using if-then-else or while-do.

Consider the grammar

1. E → E OR E
2. E → E AND E
3. E → NOT E
4. E → (E)
5. E → id relop id
6. E → TRUE
7. E → FALSE

The relop is denoted by <, >, <, >.

The AND and OR are left associated.

NOT has the higher precedence then AND and lastly OR.
Production rule Semantic actions

E → E1 OR E2 {E.place=newtemp();
Emit (E.place ':=' E1.place 'OR' E2.place)
}
E → E1 AND E2 {E.place=newtemp();
Emit (E.place ':=' E1.place 'AND' E2.place)
}

E → NOT E1 {E.place=newtemp();
Emit (E.place ':=' 'NOT' E1.place)
}

E → (E1) {E.place = E1.place}

E → id relop id2 {E.place=newtemp();


Emit ('if' id1.place relop.op id2.place 'goto'
nextstate+3);
EMIT(E.place':=''0')
EMIT('goto'nextstate+2)
EMIT(E.place':=''1')
}
E → TRUE {E.place:=newtemp();
Emit(E.place':=''1')
}
E → FALSE {E.place:=newtemp();
Emit(E.place':=''0')
}

The EMIT function is used to generate the three address code and the newtemp( ) function is
used to generate the temporary variables.

The E → id relop id2 contains the next_state and it gives the index of next three address
statements in the output sequence.

Here is the example which generates the three address code using the above translation
scheme:
Three address code
a<b OR c<d AND e<f
1. 100: if a<b goto 103
2. 101: t1:=0
3. 102: goto 104
4. 103: t1:=1
5. 104: if c<d goto 107
6. 105: t2:=0
7. 106: goto 108
8. 107: t2:=1
9. 108: if e<f goto 111
10. 109: t3:=0
11. 110: goto 112
12. 111: t3:= 1
13. 112: t4:= t2 AND t3
14. 113: t5:= t1 OR t4
3. Translation of Boolean Expression control flow statements

Control statements are the statements that change the flow of execution of statements.
Consider the Grammar
S → if E then S1
|if E then S1 else S2
|while E do S1
In this grammar, E is the Boolean expression depending upon which S1 or S2 will be
executed.
Following representation shows the order of execution of an instruction of if-then, ifthen-
else, & while do.

 𝐒 → 𝐢𝐟 𝐄 𝐭𝐡𝐞𝐧 𝐒𝟏

E.CODE & S.CODE are a sequence of statements which generate three address code.
E.TRUE is the label to which control flow if E is true.
E.FALSE is the label to which control flow if E is false.
The code for E generates a jump to E.TRUE if E is true and a jump to S.NEXT if E is false.
∴ E.FALSE=S.NEXT in the following table.
In the following table, a new label is allocated to E.TRUE.
When S1.CODE will be executed, and the control will be jumped to statement following S,
i.e., to S1.NEXT.
∴ S1. NEXT = S. NEXT.

Syntax Directed Translation for "If E then S1."

Production Semantic Rule


𝐒 → 𝐢𝐟 𝐄 𝐭𝐡𝐞𝐧 𝐒𝟏 E. TRUE = newlabel;
E. FALSE = S. NEXT;
S1. NEXT = S. NEXT;
S. CODE = E. CODE | |GEN (E. TRUE '− ') | | S1. CODE
 𝐒 → 𝐈𝐟 𝐄 𝐭𝐡𝐞𝐧 𝐒𝟏 𝐞𝐥𝐬𝐞 𝐒𝟐

If E is true, control will go to E.TRUE, i.e., S1.CODE will be executed and after that
S.NEXT appears after S1.CODE.

If E.CODE will be false, then S2.CODE will be executed.

Initially, both E.TRUE & E.FALSE are taken as new labels. Hen S1.CODE at label E.TRUE
is executed, control will jump to S.NEXT.

Therefore, after S1, control will jump to the next statement of complete statement S.
S1.NEXT=S.NEXT

Similarly, after S2.CODE, the next statement of S will be executed.


∴ S2.NEXT=S.NEXT
Syntax Directed Translation for "If E then S1 else S2."

Production Semantic Rule


𝐒 → 𝐢𝐟 𝐄 𝐭𝐡𝐞𝐧 𝐒𝟏 𝐞𝐥𝐬𝐞 𝐒𝟐 E. TRUE = newlabel;
E. FALSE = newlabel;
S1. NEXT = S. NEXT;
S2. NEXT = S. NEXT;
S. CODE = E. CODE | | GEN (E. TRUE '− ') | | S1. CODE
GEN(goto S. NEXT) | |
GEN (E. FALSE −) | | S2. CODE
 𝐒 → 𝐰𝐡𝐢𝐥𝐞 𝐄 𝐝𝐨 𝐒𝟏

Another important control statement is while E do S1, i.e., statement S1 will be executed till
Expression E is true. Control will arrive out of the loop as the expression E will become false

A Label S. BEGIN is created which points to the first instruction for E. Label E. TRUE is
attached with the first instruction for S1. If E is true, control will jump to the label E. TRUE
& S1. CODE will be executed. If E is false, control will jump to E. FALSE. After S1.
CODE, again control will jump to S. BEGIN, which will again check E. CODE for true or
false.
∴ S1. NEXT = S. BEGIN
If E. CODE is false, control will jump to E. FALSE, which causes the next statement after S
to be executed.
∴ E. FALSE = S. NEXT

Syntax Directed Translation for " → 𝐰𝐡𝐢𝐥𝐞 𝐄 𝐝𝐨 𝐒𝟏 "

Production Semantic Rule


𝐒 → 𝐰𝐡𝐢𝐥𝐞 𝐄 𝐝𝐨 𝐒𝟏 S. BEGIN = newlabel;
E. TRUE = newlabel;
E. FALSE = S. NEXT;
S1. NEXT = S. BEGIN;
S. CODE = GEN(S. BEGIN '− ') | |
E. CODE | | GEN(E. TRUE '− ')| |
S1. CODE | | GEN('goto' S. BEGIN)

Example:
While a < b do
If c < d then
x:=y+z
Else
x := y - z
Then the translation is
L1: if a < b goto L2
goto Lnext
L2: if c < d goto L3
goto L4
L3: t1 := y + z
x := t1
goto L1
L4: t2 := y - z
x := t2
goto L1
Lnext:
UNIT V
 Basic Blocks:
A basic block is a simple combination of statements. Except for entry and exit, the basic
blocks do not have any branches like in and out. It means that the flow of control enters at
the beginning and it always leaves at the end without any halt. The execution of a set of
instructions of a basic block always takes place in the form of a sequence.
The first step is to divide a group of three-address codes into the basic block. The new basic
block always begins with the first instruction and continues to add instructions until it
reaches a jump or a label. If no jumps or labels are identified, the control will flow from
one instruction to the next in sequential order.

The algorithm for the construction of the basic block is described below step by step:
Algorithm: The algorithm used here is partitioning the three-address code into basic
blocks.
Input: A sequence of three-address codes will be the input for the basic blocks.
Output: A list of basic blocks with each three address statements, in exactly one block, is
considered as the output.
Method: We’ll start by identifying the intermediate code’s leaders. The following are some
guidelines for identifying leaders:

1. The first instruction in the intermediate code is generally considered as a leader.


2. The instructions that target a conditional or unconditional jump statement can be
considered as a leader.
3. Any instructions that are just after a conditional or unconditional jump statement can be
considered as a leader.
Each leader’s basic block will contain all of the instructions from the leader until the
instruction right before the following leader’s start.

Example of basic block:


Three Address Code for the expression a = b + c – d is:
T1 = b + c
T2 = T1 - d
a = T2
This represents a basic block in which all the statements execute in a sequence one after the
other.
Basic Block Construction:

Let us understand the construction of basic blocks with an example:


Example:
1. PROD = 0
2. I = 1
3. T2 = addr(A) – 4
4. T4 = addr(B) – 4
5. T1 = 4 x I
6. T3 = T2[T1]
7. T5 = T4[T1]
8. T6 = T3 x T5
9. PROD = PROD + T6
10. I = I + 1
11. IF I <=20 GOTO (5)
Using the algorithm given above, we can identify the number of basic blocks in the above
three-address code easily-
There are two Basic Blocks in the above three-address code:
 B1 – Statement 1 to 4
 B2 – Statement 5 to 11
 Flow Graph:

A flow graph is simply a directed graph. For the set of basic blocks, a flow graph shows the
flow of control information. A control flow graph is used to depict how the program control
is being parsed among the blocks. A flow graph is used to illustrate the flow of control
between basic blocks once an intermediate code has been partitioned into basic blocks.
When the beginning instruction of the Y block follows the last instruction of the X block,
an edge might flow from one block X to another block Y.
Let’s make the flow graph of the example that we used for basic block formation:

 Optimization of Basic Blocks

Optimization is applied to the basic blocks after the intermediate code generation phase of
the compiler. Optimization is the process of transforming a program that improves the code
by consuming fewer resources and delivering high speed. In optimization, high-level codes
are replaced by their equivalent efficient low-level codes. Optimization of basic blocks can
be machine-dependent or machine-independent. These transformations are useful for
improving the quality of code that will be ultimately generated from basic block.
There are two types of basic block optimizations:
1. Function preserving transformations/ Structure preserving transformations
2. Algebraic transformations
Structure-Preserving Transformations:

The structure-preserving transformation on basic blocks includes:


1. Dead Code Elimination
2. Common Sub expression Elimination
3. Renaming of Temporary variables
4. Interchange of two independent adjacent statements

1.Dead Code Elimination:

Dead code is defined as that part of the code that never executes during the program
execution. So, for optimization, such code or dead code is eliminated. The code which is
never executed during the program (Dead code) takes time so, for optimization and speed,
it is eliminated from the code. Eliminating the dead code increases the speed of the program
as the compiler does not have to translate the dead code.
Example:
// Program with Dead code
int main()
{
x=2
if (x > 2)
cout << "code"; // Dead code
else
cout << "Optimization";
return 0;
}
// Optimized Program without dead code
int main()
{
x = 2;
cout << "Optimization"; // Dead Code Eliminated
return 0;
}

2.Common Sub expression Elimination:

In this technique, the sub-expression which are common are used frequently are calculated
only once and reused when needed.
Example:

a = (x+y)+z; b = x+y;

Temp saves value for later use. Becomes

t = x+y; a = t+z; b = t;

3.Renaming of Temporary Variables:

Statements containing instances of a temporary variable can be changed to instances of a


new temporary variable without changing the basic block value.
Example: Statement t = a + b can be changed to x = a + b where t is a temporary variable
and x is a new temporary variable without changing the value of the basic block.

4.Interchange of Two Independent Adjacent Statements:

If a block has two adjacent statements which are independent can be interchanged without
affecting the basic block value.
Example:
t1 = a + b
t2 = c + d
These two independent statements of a block can be interchanged without affecting the
value of the block.
Algebraic Transformation:

Countless algebraic transformations can be used to change the set of expressions


computed by a basic block into an algebraically equivalent set. Some of the algebraic
transformation on basic blocks includes:
1. Constant Folding
2. Copy Propagation
3. Strength Reduction
1. Constant Folding:
Solve the constant terms which are continuous so that compiler does not need to solve this
expression.
Example:
x = 2 * 3 + y ⇒ x = 6 + y (Optimized code)

2. Copy Propagation:
It is of two types, Variable Propagation, and Constant Propagation.
Variable Propagation:
x=y ⇒ z = y + 2 (Optimized code)
z=x+2
Constant Propagation:
x=3 ⇒ z = 3 + a (Optimized code)
z=x+a
3. Strength Reduction:
Replace expensive statement/ instruction with cheaper ones.
x = 2 * y (costly) ⇒ x = y + y (cheaper)
x = 2 * y (costly) ⇒ x = y << 1 (cheaper)

 Principal Sources of Optimization:

A transformation of a program is called local if it can be performed by looking only at the


statements in a basic block; otherwise, it is called global.
Many transformations can be performed at both the local and global levels. Local
transformations are usually performed first.
1.Function-Preserving Transformations

• There are a number of ways in which a compiler can improve a program without
changing the function it computes.
• Function preserving transformations examples:
– Common sub expression elimination
– Copy propagation,
– Dead-code elimination
– Constant folding

Common Sub expressions elimination:

• An occurrence of an expression E is called a common sub-expression if E was


previously computed, and the values of variables in E have not changed since the
previous computation. We can avoid re-computing the expression if we can use the
previously computed value.
• For example

t1: = 4*i

t2: = a [t1]

t3: = 4*j

t4: = 4*i

t5: = n

t6: = b [t4] +t5

• The above code can be optimized using the common sub-expression elimination as

t1: = 4*i

t2: = a [t1]

t3: = 4*j

t5: = n

t6: = b [t1] +t5

• The common sub expression t4: =4*i is eliminated as its computation is already in t1
and the value of i is not been changed from definition to use.

Copy Propagation:

• Copy propagation means use of one variable instead of another. This may not appear
to be an improvement, but as we shall see it gives us an opportunity to eliminate x.
• For example:

x=Pi;
A=x*r*r;

• The optimization using copy propagation can be done as follows: A=Pi*r*r;


• Here the variable x is eliminated

Dead-Code Eliminations:

• A variable is live at a point in a program if its value can be used subsequently;


otherwise, it is dead at that point.
• A related idea is dead or useless code, statements that compute values that never get
used.
• While the programmer is unlikely to introduce any dead code intentionally, it may
appear as the result of previous transformations.
• Example:

i=0;

if(i=1)

a=b+5;

• Here, ‘if’ statement is dead code because this condition will never get satisfied.

Constant folding:

• Deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding.
• For example,

a=6/2 can be replaced by

a=3 there by eliminating a division operation

2.Loop Optimizations:

• In loops, especially in the inner loops, programs tend to spend the bulk of their time.
• The running time of a program may be improved if the number of instructions in an
inner loop is decreased, even if we increase the amount of code outside that loop.

Three techniques are important for loop optimization:

1. Code motion- which moves code outside a loop;


2. Induction-variable elimination- which we apply to
replace variables from inner loop.

3.Reduction in strength -which replaces and expensive operation by a cheaper one,


such as a multiplication by an addition.

Code Motion:

• An important modification that decreases the amount of code in a loop is code


motion.
• This transformation takes an expression that yields the same result independent of the
number of times a loop is executed (a loop-invariant computation) and places the
expression before the loop.
• Note that the notion “before the loop” assumes the existence of an entry for the loop.
• For example, evaluation of limit-2 is a loop-invariant computation in the following
while-statement:

while (i <= limit-2) /* statement does not change limit*/

Code motion will result in the equivalent of

t= limit-2;

while (i<=t) /* statement does not change limit or t */

Induction Variables :

• Loops are usually processed inside out.


• For example consider the loop around B3. Note that the values of j and t4 remain in
lock-step; every time the value of j decreases by 1, that of t4 decreases by 4 because
4*j is assigned to t4. Such identifiers are called induction variables.
• For the inner loop around B3 in Fig.5.3 we cannot get rid of either j or t4 completely;
t4 is used in B3 and j in B4.
• As B3, t4 follows that just after the statement j:=j-1 the relationship t4:= 4*j-4 must
hold.
• We may therefore replace the assignment t4:= 4*j by t4:= t4-4. The only problem is
that t4 does not have a value when we enter block B3 for the first time. Since we must
maintain the relationship t4=4*j on entry to the block B3,
• The replacement of a multiplication by a subtraction will speed up the object code if
multiplication takes more time than addition or subtraction, as is the case on many
machines.

Reduction In Strength:

• Reduction in strength replaces expensive operations by equivalent cheaper ones on


the target machine.
• Certain machine instructions are considerably cheaper than others and can often be
used as special cases of more expensive operators.
• For example, x² is invariably cheaper to implement as x*x than as a call to an
exponentiation routine.

 DAG representation for basic blocks

A DAG for basic block is a directed acyclic graph with the following labels on nodes:

1. The leaves of graph are labelled by unique identifier and that identifier can be variable
names or constants.
2. Interior nodes of the graph is labelled by an operator symbol.
3. Nodes are also given a sequence of identifiers for labels to store the computed value.

o DAGs are a type of data structure. It is used to implement transformations on


basic blocks.
o DAG provides a good way to determine the common sub-expression.
o It gives a picture representation of how the value computed by the statement is
used in subsequent statements.
Algorithm for construction of DAG

Input:It contains a basic block

Output: It contains the following information:

o Each node contains a label. For leaves, the label is an identifier.


o Each node contains a list of attached identifiers to hold the computed values.

Case (i) x:= y OP z


Case (ii) x:= OP y
Case (iii) x:= y

Method:

Step 1:

If y operand is undefined then create node(y).

If z operand is undefined then for case(i) create node(z).

Step 2:

For case(i), create node(OP) whose right child is node(z) and left child is node(y).

For case(ii), check whether there is node(OP) with one child node(y).

For case(iii), node n will be node(y).

Output:

For node(x) delete x from the list of identifiers. Append x to attached identifiers list for the
node n found in step 2. Finally set node(x) to n.

Example:

Consider the following three address statement:

1. S1:= 4 * i
2. S2:= a[S1]
3. S3:= 4 * i
4. S4:= b[S3]
5. S5:= s2 * S4
6. S6:= prod + S5
7. Prod:= s6
8. S7:= i+1
9. i := S7
10. if i<= 20 goto (1)

Stages in DAG Construction:


Application of DAGs:

1. We can automatically detect common sub expressions.


2. We can determine which identifiers have their values used in the block.
3. We can determine which statements compute values that could be used outside the block.

 Simple Code Generator

Code generator is used to produce the target code for three-address statements. It uses
registers to store the operands of the three address statement.

Example:

Consider the three address statement x:= y + z. It can have the following sequence of codes:

MOV x, R0
ADD y, R0
Register and Address Descriptors:
o A register descriptor contains the track of what is currently in each register. The
register descriptors show that all the registers are initially empty.
o An address descriptor is used to store the location where current value of the name
can be found at run time.

A code-generation algorithm:

The algorithm takes a sequence of three-address statements as input. For each three address
statement of the form a:= b op c perform the various actions. These are as follows:

1. Invoke a function getreg to find out the location L where the result of computation
b op c should be stored.
2. Consult the address description for y to determine y'. If the value of y currently in
memory and register both then prefer the register y' . If the value of y is not already in
L then generate the instruction MOV y' , L to place a copy of y in L.
3. Generate the instruction OP z' , L where z' is used to show the current location of z. if
z is in both then prefer a register to a memory location. Update the address descriptor
of x to indicate that x is in location L. If x is in L then update its descriptor and
remove x from all other descriptor.
4. If the current value of y or z have no next uses or not live on exit from the block or in
register then alter the register descriptor to indicate that after execution of x : = y op z
those register will no longer contain y or z.

Generating Code for Assignment Statements:

The assignment statement d:= (a-b) + (a-c) + (a-c) can be translated into the following
sequence of three address code:

t:= a-b
u:= a-c
v:= t +u
d:= v+u

Code sequence for the example is as follows:


Statement Code Generated Register descriptor Address descriptor
Register empty
t:= a - b MOV a,R0 R0 contains t t in R0
SUB b,R0
u:= a - c MOV a,R1 R0 contains t t in R0
SUB c,R1 R1 Contains u u in R0
v:= t + u ADD R1, R0 R0 contains u u in R1
R1 Contains v v in R1
d:= v + u ADD R1,R0 R0 contains d d in R0
MOV R0,d d in R0 and Memory

 REGISTER ALLOCATION AND ASSIGNMENT

What is Register Allocation?


Register allocation deals with various strategies for deciding what values in a program should
reside in registers

Why is Register Allocation Important?


Generally, the instructions involving only register operands are faster than those involving
memory operands. So, efficient utilization of available registers is important in generating
good code, and register allocation plays a vital role in optimization.

When is Register Allocation Done?


 It can be done in intermediate language prior to the machine code generation, or it can
be done in the machine language
 In the latter case, the machine code uses symbolic names for registers, and the register
allocation turns into register numbers
 Register allocation in the intermediate language has the advantage that the same
register allocator can be used for several target machines

What are the approaches for Register Allocation?


There are two approaches, which are
 Local allocators: These are based on usage counts
 Global or intraproceduaral allocators: These are based on the concept of graph
coloring

Local register allocation


Register allocation is only within a basic block. It follows top-down approach.
Assign registers to the most heavily used variables
 Traverse the block

 Count uses
 Use count as a priority function

 Assign registers to higher priority variables first

• Usage Count

Use(a,B2)+2*(live(a,B2) = 1+2*0=1

Use(b,B3)+2*(live(a,B2)=1+2*1=3

If usage count of a=4 b=6 c= 4 d=5 e=4 f=4

R0 R1 R2 R3
b d a/c/f ……

Advantage
Heavily used values reside in registers

Disadvantage
Does not consider non-uniform distribution of uses

Global register allocation

The method is also very simple and involves two passes:


 In the first pass, the target machine instructions or intermediate code instructions are selected
as though there are infinite number of symbolic registers
 In the second pass, for each procedure, a register inference graph is constructed in which the
nodes are symbolic registers and an edge connects to nodes if one is live at a point where the
other is defined, i.e., if both are live at the same time

Local allocation does not take into account that some instructions (e.g. those in loops)
execute more frequently. It forces us to store/load at basic block endpoints since each block
has no knowledge of the context of others.
To find out the live range(s) of each variable and the area(s) where the variable is
used/defined global allocation is needed. Cost of spilling will depend on frequencies and
locations of uses.

Register allocation depends on:


 Size of live range
 Number of uses/definitions
 Frequency of execution
 Number of loads/stores needed.
 Cost of loads/stores needed.

Register allocation by graph colouring


Now, an attempt is made to color the graph using k colors, where k is the number of available
registers. We know that a graph is said to be colored if each node has been assigned a color in
such a way that no two adjacent nodes have the same color
Once the coloring is over, register allocation can be made

Global register allocation can be seen as a graph colouring problem.


Basic idea:
1. Identify the live range of each variable

2. Build an interference graph that represents conflicts between live ranges (two
nodes are connected if the variables they represent are live at the same
moment)
3. Try to assign as many colours to the nodes of the graph as there are registers
so that two neighbours have different colours

 PEEPHOLE OPTIMIZATION

A statement-by-statement code-generations strategy often produces target code that


contains redundant instructions and suboptimal constructs. The quality of such target code
can be improved by applying “optimizing” transformations to the target program.
A simple but effective technique for improving the target code is peephole
optimization, a method for trying to improving the performance of the target program by
examining a short sequence of target instructions (called the peephole) and replacing these
instructions by a shorter or faster sequence, whenever possible.
The peephole is a small, moving window on the target program. The code in the
peephole need not be contiguous, although some implementations do require this. It is
characteristic of peephole optimization that each improvement may spawn opportunities for
additional improvements.
Objectives of Peephole Optimization:
The objective of peephole optimization is as follows:
1. To improve performance
2. To reduce memory footprint
3. To reduce code size
Characteristics of peephole optimizations:
 Redundant-instructions elimination
 Flow-of-control optimizations
 Algebraic simplifications
 Use of machine idioms
 Eliminating Unreachable code

Redundant Loads And Stores:

If we see the instructions sequence


(1) MOV R0,a
(2) MOV a,R0

we can delete instructions (2) because whenever (2) is executed. (1) will ensure that
the value of a is already in register R0.If (2) had a label we could not be sure that (1) was
always executed immediately before (2) and so we could not remove (2).

Unreachable Code:

Another opportunity for peephole optimizations is the removal of unreachable


instructions. An unlabelled instruction immediately following an unconditional jump may be
removed. This operation can be repeated to eliminate a sequence of instructions. For
example, for debugging purposes, a large program may have within it certain segments that
are executed only if a variable debug is 1. In C, the source code might look like:

#define debug 0
….

If ( debug ) {
Print debugging information

}
In the intermediate representations the if-statement may be translated as:

If debug =1 goto L1 goto L2

L1: print debugging information L2: ………………………… (a)

One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter
what the value of debug; (a) can be replaced by:

If debug ≠1 goto L2
Print debugging information
L2: …………………………… (b)

If debug ≠0 goto L2
Print debugging information
L2: …………………………… (c)
As the argument of the statement of (c) evaluates to a constant true it can be replaced

By goto L2. Then all the statement that print debugging aids are manifestly
unreachable and can be eliminated one at a time.

Flows-Of-Control Optimizations:
The unnecessary jumps can be eliminated in either the intermediate code or the target
code by the following types of peephole optimizations. We can replace the jump sequence

goto L1
….

L1: gotoL2 (d)


by the sequence
goto L2
….

L1: goto L2

If there are now no jumps to L1, then it may be possible to eliminate the statement
L1:goto L2 provided it is preceded by an unconditional jump .Similarly, the sequence

if a < b goto L1
….

L1: goto L2 (e)

can be replaced by
If a < b goto L2

….

L1: goto L2

Ø Finally, suppose there is only one jump to L1 and L1 is preceded by an


unconditional goto. Then the sequence

goto L1

L1: if a < b goto L2 (f) L3:

may be replaced by

If a < b goto L2
goto L3

…….

L3:
While the number of instructions in(e) and (f) is the same, we sometimes skip the
unconditional jump in (f), but never in (e).Thus (f) is superior to (e) in execution time

Algebraic Simplification:

There is no end to the amount of algebraic simplification that can be attempted


through peephole optimization. Only a few algebraic identities occur frequently enough that
it is worth considering implementing them. For example, statements such as
x := x+0 or
x := x * 1

are often produced by straightforward intermediate code-generation algorithms, and they can
be eliminated easily through peephole optimization.

Reduction in Strength:

Reduction in strength replaces expensive operations by equivalent cheaper ones on


the target machine. Certain machine instructions are considerably cheaper than others and can
often be used as special cases of more expensive operators.

For example, x² is invariably cheaper to implement as x*x than as a call to an


exponentiation routine. Fixed-point multiplication or division by a power of two is cheaper to
implement as a shift. Floating-point division by a constant can be implemented as
multiplication by a constant, which may be cheaper.

X2 → X*X

Use of Machine Idioms:

The target machine may have hardware instructions to implement certain specific
operations efficiently. For example, some machines have auto-increment and auto-decrement
addressing modes. These add or subtract one from an operand before or after using its value.
The use of these modes greatly improves the quality of code when pushing or popping a
stack, as in parameter passing. These modes can also be used in code for statements like i :
=i+1.

i:=i+1 → i++
i:=i-1 → i- -

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy