CD Digital Notes
CD Digital Notes
CD Digital Notes
Course Objectives:
To introduce the Finite Automata, NFA and DFA.
To gain insight into the Context Free Language.
To study the Phases of a Compiler and Lexical Analysis and Syntax Analysis.
To acquaint the Intermediate Code Generation.
To acquaint Code Optimization and Code Generation.
Course Outcomes:
UNIT –I
Finite Automata and Regular Expressions: Finite Automata- Examples and Definitions - Accepting the
Union, Intersection, Difference of Two Languages. Regular Expressions: Regular Languages and Regular
Expressions– Conversion from Regular Expression to NFA and Deterministic Finite Automata. Context free
grammar: Derivations trees and ambiguity – Simplified forms and Normal forms.
UNIT –II
Introduction to Compiler: Introduction to Compilers: Definition of compiler – Interpreter- bootstrapping –
phases of compiler.
Lexical Analysis: Roles of Lexical analyzer –Input buffering – specification of tokens – Recognition of
Tokens – A language for specifying lexical analyzers – design of a Lexical analyzer.
UNIT –III
Parsing: Role of parser - Top Down Parser: Backtracking, Recursive Decent Parsing and Predictive Parsers.
Bottom up Parser: Shift Reduce Parsing – LR parsers : SLR Parser, CLR parser and LALR Parsers.
UNIT –IV
Syntax Directed Translation and Intermediate code Generator: Syntax Directed Definitions –
construction of Syntax tree. Intermediate code Generation : Abstract syntax tree – three address code – types
of three address statements – syntax directed translations into three address code. Boolean expression and
flow of control statements.
UNIT –V
Code optimization and Code generation: Basic blocks and flow graphs – optimization of basic blocks –
principal sources of optimization – loop optimization – DAG representation of basic blocks. Simple code
generator – register allocation and assignments – peephole optimization.
REFERENCE BOOKS:
1. John E.Hopgroft and Jeffrey D. Ullman, Introduction to Automata Theory, Languages and Computation, 3rd
Edition, Pearson Edition.
2. Alfred Aho, V.Ravisethi, and D.Jeffery Ullman, “Compilers Principles, techniques and Tools”, Addison-
Wesley, 2nd Edition, 2007.
3. Mishra K.L.P, “Theory of Computer Science: Automata, language and Computation”, PHI Press, 2006.
4. “Theory of Computation & Applications – Automata Theory Formal Languages”, by Anil Malviya, Malabika
Datta, BPB Publications, 2015.
UNIT I
Finite Automata:
A Finite Automata is the mathematical model of a digital computer. Finite Automata are used as string
or language acceptors. They are mainly used in pattern matching tools like LEX and Text editors.
The Finite State System represents a mathematical model of a system with certain input.
The model finally gives a certain output. The output given to the machine is processed by various
states. These states are called intermediate states.
A good example of finite state systems is the control mechanism of an elevator. This mechanism only
remembers the current floor number pressed, it does not remember all the previously pressed numbers.
The finite state systems are useful in design of text editors, lexical analyzers and natural
languageprocessing. The word “automaton” is singular and “automata” is plural.
An automaton in which the output depends only on the input is called an automaton without memory.
An automaton in which the output depends on the input and state is called as automation with memory.
Finite Automation Model:
Informally, a FA – Finite Automata is a simple machine that reads an input string – one symbol at a
time -- and then, after the input has been completely read, decides whether to accept or reject the input.
As the symbols are read from the tape, the automaton can changeits state, to reflect how it reacts to
what it has seen so far.
The Finite Automata can be represented as,
i) Input Tape: Input tape is a linear tape having some cells which can hold an input symbol from ∑.
ii) Finite Control: It indicates the current state and decides the next state on receiving a particular input from
the input tape. The tape reader reads the cells one by one from left to right and at any instance only one input
symbol is read. The reading head examines read symbol and the head moves to the right side with or without
changing the state. When the entire string is read and if finite control is in final state then the string is accepted
otherwise rejected. The finite automaton can be represented by a transition diagram in which the vertices
represent the states and the edges represent transitions.
A Finite Automaton (FA) consists of a finite set of states and set of transitions among statesin response to
inputs.
• Always associated with a FA is a transition diagram, which is nothing but a‘directed graph’.
• The vertices of the graph correspond to the states of the FA.
• The FA accepts a string of symbols from∑, x if the sequence of transitions corresponding to
symbols in x leads from the state to an accepting state.
Finite Automata can be classified into two type:
1. FA without output or Language Recognizers ( e.g. DFA and NFA)
2. FA with output or Transducers ( e.g. Moore and Mealy machines)
Finite Automaton (FA), a collection of states in which we make transitions based upon input symbols.
For any element q of Q and any symbol σ∈Σ, we interpret δ(q,σ) as the state to which the FA moves, if it is in
state q and receives the input σ.
The input is itself a pair because δ was defined as a function of the form Q×Σ→Qso the input
has the form Q×Σ the set of all pairs in which the first element is taken from set Q and the
second element from set Σ.
o Of course, there may be easier ways to visualize δ. In particular, we could do it via a table with
the input state on one axis and the input character on another:
Starting State
Input q0 q1 q2 0 1 2
0 q1 q2 q2 0 2 1
1 1 q2 q2 1 0 2
o The table representation is particularly useful because it suggests an efficient implementation. If
we numbered our states instead of using arbitrary labels:
Starting State
Input 0 1 2 4 5 6
0 1 2 2 3 5 4
1 4 2 2 4 3 5
The above diagram accepts all strings starting with a and ending with b.Here,
The above diagram accepts all strings starting with b and ending with a.Here,
L = L1 and L2 = L1 ∩ L2
It accepts all the string that accept with even number of 1’s.
State Transition Diagram ForL1 ∩ L2
State Intersection of L1 and L2 can be explained by language that a string over {0, 1} accept such that it
ends with 01 and has even number of 1’s.
L = L1 ∩ L2
= {1001, 0101, 01001, 10001, ....}
Thus as we see that L1 and L2 have been combined through intersection process and this final FA accept all
the language that has even number of 1’s and is ending with 01.
Regular Language:
The set of regular languages over an alphabet is defined recursively as below. Any language belonging to
this set is a regular language over .
Definition of set of Regular Languages:
Basis Clause: ,Ǿ{ } and {a} for any symbol a€ are regular languages.
Inductive Clause: If Lr and Ls are regular languages, then Lr Ls , LrLs and Lr* are regular languages.
Nothing is a regular language unless it is obtained from the above two clauses.
For example, let = {a, b}. Then since {a} and {b} are regular languages, {a, b} ( = {a} {b} ) and {ab} (
= {a}{b} ) are regular languages. Also since {a} is regular, {a}* is a regular language which is the set of
strings consisting of a's such as , a, aa, aaa, aaaa etc. Note also that *, whic h is the set of strings consisting
of a's and b's, is a regular language because {a, b} is regular.
Regular Expression:
Regular expressions are used to denote regular languages. They can represent regular languages and
operations on them succinctly.
The set of regular expressions over an alphabet is defined recursively as below. Any element of that set is
a regular expression.
Basis Clause: Ǿ and a are regular expressions corresponding to languages ,Ǿ{ } and {a}, respectively,
where a is an element of .
Inductive Clause: If r and s are regular expressions corresponding to languages Lr and Ls ,then ( r + s
),(rs) and ( r*) are regular expressions corresponding to languages Lr Ls , LrLs and Lr* , respectively.
Nothing is a regular expression unless it is obtained from the above two clauses.
In a DFA, for a particular input character, the machine goes to one state only. A transition function is
defined on every state for every input symbol. Also in DFA null (or ε) move is not allowed, i.e., DFA
cannot change state without any input character.
For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.
Due to the above additional features, NFA has a different transition function, the rest is the same as DFA.
δ: Transition Function
δ: Q X (Σ U ε ) --> 2 ^ Q.
As you can see in the transition function is for any input including null (or ε), NFA can go to any state
number of states. For example, below is an NFA for the above problem.
NFA
One important thing to note is, in NFA, if any path for an input string leads to a final state, then the input
string is accepted. For example, in the above NFA, there are multiple paths for the input string “00”. Since
one of the paths leads to a final state, “00” is accepted by the above NFA.
A Regular Expression is a representation of Tokens. But, to recognize a token, it can need a token Recognizer,
which is nothing but a Finite Automata (NFA). So, it can convert Regular Expression into NFA.
For a NFA is
For a + b, or a | b NFA is
a
ε 2 3 ε
star ε ε a b b
0 1 6 7 8 9 10
ε 4 5
b
which is the same as the state B itself. In other words, we have a repeating edge to B:
A={0,1,2,4,7} B={1,2,3,4,6,7,8} C={1,2,4,5,6,7}
Find the state D that has an edge on b from B
g. start with B{1,2,3,4,6,7,8}. Find which states in B have states reachable by b transitions. This set is called
move(B,b) The set is {5,9}: move(B,b) = {5,9}
h. now do an ε-closure on move(B,b). Find all the states in move(B,b) which are reachable with ε-transitions.
From 5 we can get to 5, 6, 7, 1, 2, 4. From 9 we get to 9 itself. So the complete set is {1,2,4,5,6,7,9}}. So
ε-closure(move(B,b)) = D = {1,2,4,5,6,7,9} This defines the new state D that has an edge on b from B
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D{1,2,4,5,6,7,9}
Find the state that has an edge on a from D
i. start with D{1,2,4,5,6,7,9}. Find which states in D have states reachable by a transitions. This set is called
move(D,a) The set is {3,8}: move(D,a) = {3,8}
j. now do an ε-closure on move(D,a). Find all the states in move(B,a) which are reachable with ε-transitions. We
have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1 and 7, and from 1 to 2 and 4.
Starting with 8 we can get to 8 only. So the complete set is {1,2,3,4,6,7,8}. So ε-closure(move(D,a)) =
{1,2,3,4,6,7,8} =B
This is a return edge to B:
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D{1,2,4,5,6,7,9}
Find the state E that has an edge on b from D
k. start with D{1,2,4,5,6,7,9}. Find which states in D have states reachable by b transitions. This set is called
move(B,b) The set is {5,10}: move(D,b) = {5,10}
l. now do an ε-closure on move(D,b). Find all the states in move(D,b) which are reachable with ε-transitions. From
5 we can get to 5, 6, 7, 1, 2, 4. From 10 we get to 10 itself. So the complete set is {1,2,4,5,6,7,10}}. So
ε-closure(move(D,b) = E = {1,2,4,5,6,7,10}
This defines the new state E that has an edge on b from D. Since it contains an accepting state, it is also an accepting
state.
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D={1,2,4,5,6,7,9}, E={1,2,4,5,6,7,10}
We should now examine state C
Find the state that has an edge on a from C
m. start with C{1,2,4,5,6,7}. Find which states in C have states reachable by a transitions. This set is called move(C,a)
The set is {3,8}:
move(C,a) = {3,8}=B
we have seen this before. It’s the state B
A={0,1,2,4,7}, B={1,2,3,4,6,7,8}, C={1,2,4,5,6,7}, D={1,2,4,5,6,7,9}, E={1,2,4,5,6,7,10}
Find the state that has an edge on b from C
n. start with C{1,2,4,5,6,7}. Find which states in C have states reachable by b transitions. This set is called move(C,b)
The set is {5}:
o. move(C,b) = {5}
p. now do an ε-closure on move(C,b). Find all the states in move(C,b) which are reachable with ε-transitions. From 5
we can get to 5,6,7,1,2,4. which is C itself So
ε-closure(move(C,b)) = C
This defines a loop on C
Finally we need to look at E. Although this is an accepting state, the regular expression allows us to repeat adding in
more a’s and b’s as long as we return to the accepting E state finally. So
Find the state that has an edge on a from E
q. start with E{1,2,4,5,6,7,10}. Find which states in E have states reachable by a transitions. This set is called
move(E,a) The set is {3,8}:
move(E,a) = {3,8}=B. We saw this before, it’s B
a
a
b b
a B E
start a
A b a b
C
That’s it ! There is only one edge from each state for a given input character. It’s a DFA. Disregard the fact that
each of these states is actually a group of NFA states. We can regard them as single states in the DFA. In fact it also
requires other as an edge beyond E leading to the ultimate accepting state. Also the DFA is not yet optimized (there can
be less states).
Context free grammar is a formal grammar which is used to generate all possible strings in a given formal
language. Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where,
In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly replacing a non-
terminal by the right hand side of the production, until all non-terminal have been replaced by terminal symbols.
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
By applying the production S → aSa, S → bSb recursively and finally applying the production S → c, we get
the string abbcbba.
Capabilities of CFG
Properties
Example
Problem
Solution
Let’s find out the derivation tree for the string "a+a*a". It has two leftmost derivations.
Derivation 1 − X → X+X → a +X → a+ X*X → a+a*X → a+a*a
Parse tree 1 −
Since there are two parse trees for a single string "a+a*a", the grammar G is ambiguous.
Simplified Forms:
As we have seen, various languages can efficiently be represented by a context-free grammar. All the
grammar are not always optimized that means the grammar may consist of some extra symbols (non-terminal).
Having extra symbols, unnecessary increase the length of grammar. Simplification of grammar means reduction
of grammar by removing useless symbols.
The properties of reduced grammar are given below:
1. Each variable (i.e. non-terminal) and each terminal of G appears in the derivation of some word in L.
A symbol can be useless if it does not appear on the right-hand side of the production rule and does not take part
in the derivation of any string. That symbol is known as a useless symbol. Similarly, a variable can be useless if
it does not take part in the derivation of any string. That variable is known as a useless variable.
For Example:
1. T → aaB | abA | aaT
2. A → aA
3. B → ab | b
4. C → ad
In the above example, the variable 'C' will never occur in the derivation of any string, so the production C → ad
is useless. So we will eliminate it, and the other productions are written in such a way that variable C can never
reach from the starting variable 'T'.
Production A → aA is also useless because there is no way to terminate it. If it never terminates, then it can
never produce a string. Hence this production can never take part in any derivation.
To remove this useless production A → aA, we will first find all the variables which will never lead to a
terminal string such as variable 'A'. Then we will remove all the productions in which the variable 'B' occurs.
Elimination of ε Production
The productions of type S → ε are called ε productions. These type of productions can only be removed from
those grammars that do not generate ε.
Step 1: First find out all nullable non-terminal variable which derives ε.
Step 2: For each production A → a, construct all production A → x, where x is obtained from a by removing
one or more non-terminal from step 1.
Step 3: Now combine the result of step 2 with the original production and remove ε productions.
Example:
Remove the production from the following CFG by preserving the meaning of it.
1. S → XYX
2. X → 0X | ε
3. Y → 1Y | ε
Solution:
Now, while removing ε production, we are deleting the rule X → ε and Y → ε. To preserve the meaning of CFG
we are actually placing ε at the right-hand side whenever X and Y have appeared.
Let us take
S → XYX
S → YX
S → XY
If Y = ε then
S → XX
If Y and X are ε then,
S→X
S → Y Now,
S → XY | YX | XX | X | Y
X → 0X
X→0
X → 0X | 0
Similarly Y → 1Y | 1
S → XY | YX | XX | X | Y
X → 0X | 0
Y → 1Y | 1
The unit productions are the productions in which one non-terminal gives another non-terminal. Use the
following steps to remove unit production:
Step 1: To remove X → Y, add production X → a to the grammar rule whenever Y → a occurs in the grammar.
Step 3: Repeat step 1 and step 2 until all unit productions are removed.
For example:
S → 0A | 1B | C
A → 0S | 00
B→1|A
C → 01
Solution:
S → C is a unit production. But while removing S → C we have to consider what C gives. So, we can add a rule
to S.
S → 0A | 1B | 01
B → 1 | 0S | 00
S → 0A | 1B | 01
A → 0S | 00
B → 1 | 0S | 00
C → 01
Normal Forms
Chomsky's Normal Form (CNF):
CNF stands for Chomsky normal form. A CFG(context free grammar) is in CNF(Chomsky normal form) if all
production rules satisfy one of the following conditions:
o Start symbol generating ε. For example, A → ε.
o A non-terminal generating two non-terminals. For example, S → AB.
o A non-terminal generating a terminal. For example, S → a.
For example:
G1 = {S → AB, S → c, A → a, B → b}
G2 = {S → aA, A → a, B → c}
The production rules of Grammar G1 satisfy the rules specified for CNF, so the grammar G1 is in CNF.
However, the production rule of Grammar G2 does not satisfy the rules specified for CNF as S → aZ contains
terminal followed by non-terminal. So the grammar G2 is not in CNF.
Steps for converting CFG into CNF
Step 1: Eliminate start symbol from the RHS. If the start symbol T is at the right-hand side of any production,
create a new production as:
S1 → S
Step 2: In the grammar, remove the null, unit and useless productions. You can refer to the Simplification of
CFG.
Step 3: Eliminate terminals from the RHS of the production if they exist with other non-terminals or terminals.
For example, production S → aA can be decomposed as:
S → RA
R→a
Step 4: Eliminate RHS with more than two non-terminals. For example, S → ASB can be decomposed as:
S → RS
R → AS
Example:
Convert the given CFG to CNF. Consider the given grammar G1:
S → a | aA | B
A → aBB | ε
B → Aa | b
Solution:
Step 1: We will create a new production S1 → S, as the start symbol S appears on the RHS. The grammar will
be:
S1 → S
S → a | aA | B
A → aBB | ε
B → Aa | b
Step 2: As grammar G1 contains A → ε null production, its removal from the grammar yields:
S1 → S
S → a | aA | B
A → aBB
B → Aa | b | a
Now, as grammar G1 contains Unit production S → B, its removal yield:
S1 → S
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Also remove the unit production S1 → S, its removal from the grammar yields:
S1 → a | aA | Aa | b
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Step 3: In the production rule S1 → aA | Aa, S → aA | Aa, A → aBB and B → Aa, terminal a exists on RHS
with non-terminals. So we will replace terminal a with X:
S1 → a | XA | AX | b
S → a | XA | AX | b
A → XBB
B → AX | b | a
X→a
Step 4: In the production rule A → XBB, RHS has more than two symbols, removing it from grammar yield:
S1 → a | XA | AX | b
S → a | XA | AX | b
A → RB
B → AX | b | a
X→a
R → XB Hence, for the given grammar, this is the required CNF.
Greibach Normal Form (GNF):
GNF stands for Greibach normal form. A CFG(context free grammar) is in GNF(Greibach normal form) if all
the production rules satisfy one of the following conditions:
For example:
The production rules of Grammar G1 satisfy the rules specified for GNF, so the grammar G1 is in GNF.
However, the production rule of Grammar G2 does not satisfy the rules specified for GNF as A → ε and B → ε
contains ε(only start symbol can generate ε). So the grammar G2 is not in GNF.
Steps for converting CFG into GNF
If the given grammar is not in CNF, convert it into CNF. You can refer the following topic to convert the CFG
into CNF: Chomsky normal form
If the context free grammar contains left recursion, eliminate it. You can refer the following topic to eliminate
left recursion: Left Recursion
Step 3: In the grammar, convert the given production rule into GNF form.
If any production rule in the grammar is not in GNF form, convert it.
Example:
S → XB | AA
A → a | SA
B→b
X→a
Solution:
As the given grammar G is already in CNF and there is no left recursion, so we can skip step 1 and step 2 and
directly go to step 3.
The production rule A → SA is not in GNF, so we substitute S → XB | AA in the production rule A → SA as:
S → XB | AA
A → a | XBA | AAA
B→b
X→a
The production rule S → XB and B → XBA is not in GNF, so we substitute X → a in the production rule S →
XB and B → XBA as:
S → aB | AA
A → a | aBA | AAA
B→b
X→a
S → aB | AA
A → aC | aBAC
C → AAC | ε
B→b
X→a
Department of Computer Science & Engineering Course File : Compiler Design
Now we will remove null production C → ε, we get:
S → aB | AA
A → aC | aBAC | a | aBA
C → AAC | AA
B→b
X→a
The production rule S → AA is not in GNF, so we substitute A → aC | aBAC | a | aBA in production rule S →
AA as:
The production rule C → AAC is not in GNF, so we substitute A → aC | aBAC | a | aBA in production rule C →
AAC as:
As Computers became inevitable and indigenous part of human life, and several languages
with different and more advanced features are evolved into this stream to satisfy or comfort the
user in communicating with the machine , the development of the translators or mediator
Software‘s have become essential to fill the huge gap between the human and machine
understanding.
This process is called Language Processing to reflect the goal and intent of the process. On the
way to this process to understand it in a better way, we have to be familiar with some key
terms and concepts explained in following lines.
LANGUAGE TRANSLATORS
Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target] language? The Source program is a high level language
whereas the Target language can be anything from the machine language of a target machine
(between Microprocessor to Supercomputer) to another high level language program.
Based on the input the translator takes and the output it produces, a language translator
can be called as any one of the following.
Preprocessor: A preprocessor takes the skeletal source program as input and produces an
extended version of it, which is the resultant of expanding the Macros, manifest constants if any,
and including header files etc in the source file. For example, the C preprocessor is a macro
processor that is used automatically by the C compiler to transform our source before actual
compilation. Over and above a preprocessor performs the following activities:
Collects all the modules, files in case if the source program is divided into different
modules stored at different files.
Compiler: Is a translator that takes as input a source program written in high level language and
converts it into its equivalent target program in machine language. In addition to above the
compiler also
Facilitates the user in rectifying the errors, and execute the code.
Assembler: Is a program that takes as input an assembly language program and converts it into
its equivalent machine language code.
Loader / Linker: This is a program that takes as input a relocatable code and collectsthe library
functions, relocatable object files, and produces its equivalent absolute machine code.
Specifically,
Loading consists of taking the relocatable machine code, altering the relocatable
addresses, and placing the altered instructions and data in memory at the proper
locations.
Linking allows us to make a single program from several files of relocatable machine
code. These files may have been result of several different compilations, one or more
may be library routines provided by the system available to any program that
needs them.
In addition to these translators, programs like interpreters, text formatters etc., may be used
in language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.
Normally the steps in a language processing system includes Preprocessing the skeletal
Source program which produces an extended or expanded source program or a ready to compile
unit of the source program, followed by compiling the resultant, then linking / loading , and
finally its equivalent executable code is produced. As I said earlier not all these steps are
mandatory. In some cases, the Compiler only performs this linking and loading functions
implicitly.
TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be
classified into the following types;
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.
Cross-Compilers: These are the compilers that run on one machine and produce code for
another machine.
Incremental Compilers: These compilers separate the source into user defined– steps;
Compiling/recompiling step- by- step; interpreting steps in a given order Converters (e.g.
COBOL to C++): These Programs will be compiling from one high level language to another.
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers from
intermediate language (byte code, MSIL) to executable code or native machine code. These
perform type –based verification which makes the executable code more trustworthy
Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native
code for Java and .NET
Binary Compilation: These compilers will be compiling object code of one platform into object
code of another platform.
BOOTSTRAPING
1. Create a compiler SCA A for subset, S of the desired language, L using language "A" and that compiler runs
on machine A.
PHASES OF A COMPILER:
Due to the complexity of compilation task, a Compiler typically proceeds in a Sequence of
compilation phases. The phases communicate with each other via clearly defined interfaces.
Generally an interface contains a Data structure (e.g., tree), Set of exported functions. Each
phase works on an abstract intermediate representation of the source program, not the
source program text itself (except the first phase)
Compiler Phases are the individual modules which are chronologically executed to perform
their respective Sub-activities, and finally integrate the solutions to give target code.
It is desirable to have relatively few phases, since it takes time to read and write immediate
files. Following diagram (Figure1.4) depicts the phases of a compiler through which it goes
during the compilation. There fore a typical Compiler is having the following Phases:
The Front-end of the compiler consists of phases that depend primarily on the Source
language and are largely independent on the target machine. For example, front-end of the
compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the
Intermediate Code Generator.
The Back-end of the compiler consists of phases that depend on the target machine, and
those portions don‘t dependent on the Source language, just the Intermediate language. In
this we have different aspects of Code Optimization phase, code generation along with the
necessary Error handling, and Symbol table operations.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:
Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an
identifier , a Keyword , a punctuation mark, a multi character operator like := .
The character sequence forming a token is called a lexeme of the token.
The Scanner generates a token-id, and also enters that identifiers name in the
Symbol table if it doesn‘t exist.
Also removes the Comments, and unnecessary spaces.
SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its
subsequent phase Semantic Analyzer and performs the following functions:
Groups the above received, and recorded token stream into syntactic structures,
usually into a structure called Parse Tree whose leaves are tokens.
The interior node of this tree represents the stream of tokens that logically
belongs together.
It means it checks the syntax of program elements.
SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically correct, it
may happen that they are not correct semantically. Therefore the semantic analyzer checks the
semantics (meaning) of the statements formed.
o The Syntactically and Semantically correct structures are produced here in the form of
aSyntax tree or DAG or some other sequential representation like matrix.
INTERMEDIATE CODE GENERATOR (ICG): This phase takes the syntactically and
semantically correct structure as input, and produces its equivalent intermediate notation of the
source program. The Intermediate Code should have two important properties specified below:
o It should be easy to produce, and Easy to translate into the target program.
Example intermediate code forms are:
o Three address codes,
o Polish notations, etc.
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
Attempts to improve the IC so as to have a faster machine code. Typical functions
include –Loop Optimization, Removal of redundant computations, Strength reduction,
Frequency reductions etc.
Sometimes the data structures used in representing the intermediate forms may also be
changed.
CODE GENERATOR: This is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or absolute machine
code.
Memory locations are selected for each variable used, and assignment of variables
to registers is done.
Intermediate instructions are translated into a sequence of machine instructions.
The Compiler also performs the Symbol table management and Error handling throughout
the compilation process. Symbol table is nothing but a data structure that stores different source
language constructs, and tokens generated during the compilation. These two interact with
allphases of the Compiler.
For example the source program is an assignment statement; the following figure shows the phases
of compiler will process the program.
As the first phase of a compiler, the main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce as output tokens for
each lexeme in the source program. This stream of tokens is sent to the parser for syntax
analysis. It is common for the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter
that lexeme into the symbol table. This process is shown in the following figure.
When lexical analyzer identifies the first token it will send it to the parser, the parser
receives the token and calls the lexical analyzer to send next token by issuing the
getNextToken() command. This Process continues until the lexical analyzer identifies all the
tokens. During this process the lexical analyzer will neglect or discard the white spaces and
comment lines.
There are a number of reasons why the analysis portion of a compiler is normally separated
into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration. The separation of Lexical and
Syntactic analysis often allows us to simplify at least one of these tasks. For example, a
parser that had to deal with comments and whitespace as syntactic units would be
considerably more complex than one that can assume comments and whitespace have
already been removed by the lexical analyzer.
Buffer Pairs
Because of the amount of time taken to process characters and the large numberof
characters that must be processed during the compilation of a large source program, specialized
buffering techniques have been developed to reduce the amount of overhead required to process
a single input character. An important scheme involves two buffers that are alternately reloaded.
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than using
one system call per character. If fewer than N characters remain in the input file, then a special
character, represented by eof, marks the end of the source file and is different from any possible
character of the source program. Two pointers to the input are maintained:
1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this
chapter.
Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.
Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer.
As long as we never need to look so far ahead of the actual lexeme that the sum of the
lexeme's length plus the distance we look ahead is greater than N, we shall never overwrite the
lexeme in its buffer before determining it.
If we use the above scheme as described, we must check, each time we advance
forward, that we have not moved off one of the buffers; if we do, then we must also reload the
other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and
one to determine what character is read (the latter may be a multi way branch).
We can combine the buffer-end test with the test for the current character if we extend
each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot
be part of the source program, and a natural choice is the character eof. Figure shows the same
arrangement as Figure above but with the sentinels added. Note that eof retains its use as a
marker for the end of the entire input.
Any eof that appears other than at the end of a buffer means that the input is at an end. Below
Figure summarizes the algorithm for advancing forward. Notice how the first test, which can be part
of a multiway branch based on the character pointed to by forward, is the only test we make,
except in the case where we actually are at the end of a buffer or the end of the
input.switch ( *forward++ )
{
){
Let us understand how the language theory undertakes the following terms:
1. Alphabets
2. Strings
3. Special symbols
4. Language
5. Longest match rule
6. Operations
7. Notations
8. Representing valid tokens of a language in regular expression
9. Finite automata
1. Alphabets: Any finite set of symbols
o {0,1} is a set of binary alphabets,
o {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,
o {a-z, A-Z} is a set of English language alphabets.
Assignment =
Preprocessor #
4. Language: A language is considered as a finite set of strings over some finite set
of
alphabets.
5. Operations: The various operations on languages are:
Union of two languages L and M is written as, L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as, LM = {st | s is in L and tis in M}
o The Kleene Closure of a language L is written as, L* = Zero or more occurrence of
language L.
6. Notations: If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : L(r)UL(s)
Concatenation : L(r)L(s)
Kleene closure : (L(r))*
7.Representing valid tokens of a language in regular expression: If x is a regular expression,
o x* means zero or more occurrence of x.
o x+ means one or more occurrence of x.
8.Finite automata: Finite automata is a state machine that takes a string of symbols as input
and changes its state accordingly. If the input string is successfully processed and the automata
reaches its final state, it is accepted. The mathematical model of finite automata consists of:
o Finite set of states (Q)
o Finite set of input symbols (Σ)
o One Start state (q0)
o Set of final states (qf)
o Transition function (δ)
RECOGNITION OF TOKENS
id -> letter
(letter|digit)*If -> if
Then ->
thenElse ->
else
o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence
oftokens.
o It reads the input stream and produces the source code as output through
implementingthe lexical analyzer in the C program.
o Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex
compiler runs the lex.1 program and produces a C program lex.yy.c.
o Finally C compiler runs the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Definitions include declarations of variables, constants, and regular definitions
…
pn {actionn}
where pi is regular expression and actioni describes what action the lexical analyzer should
takewhen pattern pi matches a lexeme. Actions are written in C code.
User subroutines are auxiliary procedures needed by the actions. These can
becompiled separately and loaded with the lexical analyzer.
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer
and verifies that the string can be generated by the grammar for the source language. It
reports any syntax errors in the program. It also recovers from commonly occurring errors so
that it can continue processing its input.
Issues :
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.
Error productions:
The parser is constructed using augmented grammar with error productions. If an
error production is used by the parser, appropriate error diagnostics can be generated to
indicate the erroneous constructs recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a
parse tree for a string y, such that the number of insertions, deletions and changes of tokens is
as small as possible. However, these methods are in general too costly in terms of time and
space.
CONTEXT-FREE GRAMMARS
A Context-Free Grammar is a quadruple that consists of
terminals,
Non-terminals,
start symbol and
productions.
Terminals: These are the basic symbols from which strings are formed.
Non-Terminals: These are the syntactic variables that denote a set of strings.
These help to define the language generated by the grammar.
Start Symbol: One non-terminal in the grammar is denoted as the “Start-symbol” and the
setof strings it denotes is the language defined by the grammar.
Productions : It specifies the manner in which terminals and non-terminals can be
combinedto form strings. Each production consists of a non-terminal,
followed by an arrow, followed by a string of non-terminals and terminals.
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ↑
In this grammar,
id + - * / ↑ ( ) are terminals.
Derivation is a process that generates a valid string with the help of grammar by
replacing the non-terminals on the left with the string on the right side of the production.
E→E+E|E*E|(E)|-E| id
To generate a valid string - ( id+id ) from the grammar the steps are
1. E → - E
2. E → - ( E )
3. E → - ( E+E )
4. E → - ( id+E )
5. E → - ( id+id )
In the above derivation,
E is the start symbol
-(id+id) is the required sentence(only terminals).
Strings such as E, -E, -(E), . . . are called sentinel forms.
Types of derivations:
The two types of derivation are:
1. Left most derivation
2. Right most derivation.
In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen first for
replacement.
In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen first for
replacement.
Example:
Given grammar G : E → E+E | E*E | ( E ) | - E | id Sentence to be derived : - (id+id)
WRITING A GRAMMAR
Eliminating ambiguity:
Ambiguity of the grammar that produces more than one parse tree for leftmost or
rightmost derivation can be eliminated by re-writing the grammar.
matched→ifexprstmtthenmatchedelsematchedtmt_stmt|other unmatched→ifexprsthenmtst
mt|ifexprthenmatchedelseunmatchedtmt_stmt
Eliminating Left Recursion:
A grammar is said to be left recursive if it has an on-terminal A such that there is a
derivation A=>Aα for some string α. Top-down parsing methods cannot
handle left-recursive grammars. Hence, left recursion can be eliminated as follows:
If there is→ aαA |β production A can be replaced with as
A→βA’
A’→αA’ | ε
without changing the set of strings derivable from A.
Left factoring:
Left factoring is a grammar transformation that is useful for producing a grammar
suitable for predictive parsing. When it is not clear which of two alternative productions to
use to expand a non-terminal A, we can rewrite the A-productions to defer the decision until
we have seen enough of the input to make the right choice.
If there is any→αβ1|2αβ, production it can be A rewritten as
A→αA’
A’→β1|2 β
S’ → eS |ε E → b
PARSING
It is the process of analyzing a continuous stream of input in order to determine its
grammatical structure with respect to a given formal grammar.
Parse tree:
Types of parsing:
Ø Top-down parsing : A parser can start with the start symbol and try to transform it to the
input string. Example : LL Parsers.
Ø Bottom-up parsing : A parser can start with input and attempt to rewrite it
into the start symbol. Example : LR Parsers.
TOP-DOWN PARSING
It can be viewed as an attempt to find a left-most derivation for an input string or an attempt
to construct a parse tree for the input starting from the root to the leaves.
Therefore, a parser using the single-symbol look-ahead method and top-down parsing
without backtracking is called LL(1) parser. In the following sections, we will also
use an extended BNF notation in which some regulation expression operators are to
be incorporated.
This parsing method may involve backtracking.
Example for :Backtracking
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’.But the third leaf of tree is b which does not match with the
input symbol d.Hence discard the chosen production and reset the pointer to
second backtracking.
Step4:
Predictive parsing
Input : Grammar G
Method :
1. For each production A → α of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A, a].
3. If ε is in FIRST(α), add A → α to M[A, b] for each terminal b in FOLLOW(A). If ε
is in FIRST(α) and $ is in FOLLOW(A) , add A → α to M[A, $].
4. Make each undefined entry of M be error.
Example:
Consider the following grammar :
E→E+T|T
T→T*F|F
F→(E)|id
First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ , ε }
FIRST(T) = { ( , id}
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
Stack Implementation
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry.
This type of grammar is called LL(1) grammar.
E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}
Parsing table:
Since there are more than one production, the grammar is not LL(1) grammar.
BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going towards
the root is called bottom-up parsing. A general type of bottom-up parser is a shift-reduce
parser.
SHIFT-REDUCE PARSING
Example:
Consider the grammar:
E→E+E
E→E*E
E→(E)
E→id
• shift - The next input symbol is shifted onto the top of the stack.
• reduce - The parser replaces the handle within a stack with a non-terminal.
• accept - The parser announces successful completion of parsing.
• error - The parser discovers that a syntax error has occurred and calls an error recovery
routine.
Conflicts in shift-reduce parsing:
There are two conflicts that occur in shift-reduce parsing:
1. Shift-reduce conflict: The parser cannot decide whether to shift or to reduce.
2. Reduce-reduce conflict: The parser cannot decide which of several reductions to make.
1. Shift-reduce conflict:
Example:
Consider the grammar:
E→E+E | E*E | id and input id+id*id
2. Reduce-reduce conflict:
Consider the grammar: M→R+R|R+c|R
R→c
Viable prefixes:
The set of prefixes of right sentinel forms that can appear on the stack of a shift-reduce
parser are called viable prefixes
The set of viable prefixes is a regular language.
LR PARSERS
An efficient bottom-up syntax analysis technique that can be used CFG is called
LR(k) parsing. The ‘L’ is for left-to-right scanning of the input, the ‘R’ for constructing a
rightmost derivation in reverse, and the ‘k’ for the number of input symbols. When ‘k’ is
omitted, it is
assumed to be 1.
Advantages of LR parsing:
1. It recognizes virtually all programming language constructs for which CFG can be
written.
2. It is an efficient non-backtracking shift-reduce parsing method.
3. A grammar that can be parsed using LR method is a proper superset of a grammar
that can be parsed with predictive parser
4. 4. It detects a syntactic error as soon as possible.
Drawbacks of LR method:
Action : The parsing program determines sm, the state currently on top of stack, and ai,
the current input symbol. It then consults action[sm,ai] in the action table which can have one
of four values:
3. accept,
4. 4. error.
Goto : The function goto takes a state and grammar symbol as arguments and produces a
state.
LR Parser:
LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.
"K" is the number of input symbols of the look ahead used to make number of parsing
decision.
LR algorithm:
The LR algorithm requires stack, input, output and parsing table. In all type of LR parsing,
input, output and stack are same but parsing table is different.
Input buffer is used to indicate end of input and it contains the string to be parsed followed by
a $ Symbol.
A stack is used to contain a sequence of grammar symbols with a $ at the bottom of the stack.
Parsing table is a two dimensional array. It contains two parts: Action part and Go To part.
LR (1) Parsing
Augment Grammar
Augmented grammar G` will be generated if we add one more production in the given
grammar G. It helps the parser to identify when to stop the parsing and announce the
acceptance of the input.
Example
Given grammar
1. S → AA
2. A → aA | b
The Augment grammar G` is represented by
1. S`→ S
2. S → AA
3. A → aA | b
SLR (1) Parsing
SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only difference is in the
parsing table.To construct SLR (1) parsing table, we use canonical collection of LR (0) item.
In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.
The steps which use to construct SLR (1) Table is given below:
SLR ( 1 ) Grammar
E→E+T|T
T→T*F|F
F → (E) | id
Add Augment Production and insert '•' symbol at the first position for every production in G
E`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F→•(E)
F → •id
Add Augment production to the I0 State and Compute the Closure
I0:E`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F→•(E)
F → •id
Go to (I0, E) = closure (E` → E•, E → E• + T) =I1
Go to (I0, T) = closure (E → T•, T→ T• * F) =I2
Go to (I0, F) = Closure ( T → F• ) = T → F•= I3
Go to (I0, ( ) = Closure (F→(•E),E→•E+T, E→•T, T→•T*F, T→•F, F→•(E), F → •id ) = I4
Go to (I0, id) = closure ( F → id•) = F → id• = I5
Explanation:
First(F)={(,id}
First(T)={(id}
First(E)={(,id}
Follow(E)=First(+T)∪{$}={+,$,)}
Follow(T)=First(*F)∪First(F)={*,+,$,)}
Follow (F) = {*, +, $,)}
o I1 contains the final item which drives E` → E• and follow (E`) = {$}, so action
{I1, $} = Accept
o I2 contains the final item which drives E → T• and follow (E) = {+,), $}, so action
{I2, +} = R2, action {I2, $} = R2, action {I2, )} = R2
o I3 contains the final item which drives T → F• and follow (T) = {+, *,), $}, so
action {I3, +} = R4, action {I3, *} = R4, action {I3, $} = R4, action {I3, )} = R4
o I5 contains the final item which drives F → id• and follow (F) = {+, *, $,)}, so
action {I5, +} = R6, action {I5, *} = R6, action {I5, $} = R6, action {I5, )} = R6
o I9 contains the final item which drives E → E + T• and follow (E) = {+,), $}, so
action {I9, +} = R1, action {I9, $} = R1, action {I9, )} = R1
o I10 contains the final item which drives T → T * F• and follow (T) = {+, *,), $}, so
action {I10, +} = R3, action {I10, *} = R3, action {I10, $} = R3,action {I10, )} = R3
o I11 contains the final item which drives F → (E)• and follow (T) = {+, *,), $}, so
action {I11, +} = R5, action {I11, *} = R5, action {I11, $} = R5,action {I11, )} = R5.
Parsing:
CLR ( 1 ) Grammar
CLR refers to canonical lookahead. CLR parsing use the canonical collection of LR (1) items
to build the CLR (1) parsing table. CLR (1) parsing table produces the more number of states
as compare to the SLR (1) parsing.
In the CLR (1), we place the reduce node only in the lookahead symbols.
LR (1) item
The look ahead is used to determine that where we place the final item.
The look ahead always add $ symbol for the argument production.
Example
CLR ( 1 ) Grammar
S → AA
A → aA
A→b
Add Augment Production, insert '•' symbol at the first position for every production in G and
also add the lookahead.
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "." is followed by the non-terminal.
So, the I0 State becomes
I0= S`→•S,$
S → •AA, $
Add all productions starting with A in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.
I0= S`→•S,$
S→•AA,$
A→•aA,a/b
A → •b, a/b
I1= Goto(I0,S)=closure(S`→S•,$)=S`→S•,$
I2= Go to (I0, A) = closure ( S → A•A, $ )
Add all productions starting with A in I2 State because "." is followed by the non-terminal.
So, the I2 State becomes
I2= S→A•A,$
A→•aA,$
A → •b, $
I3= Go to (I0, a) = Closure ( A → a•A, a/b )
Add all productions starting with A in I3 State because "." is followed by the non-terminal.
So, the I3 State becomes
I3= A→a•A,a/b
A→•aA,a/b
A → •b, a/b
Goto(I3,a)=Closure(A→a•A,a/b)=(sameasI3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)
I4= Goto(I0,b)=closure(A→b•,a/b)=A→b•,a/b
I5= Goto(I2,A)=Closure(S→AA•,$)=S→AA•,$
I6= Go to (I2, a) = Closure (A → a•A, $)
Add all productions starting with A in I6 State because "." is followed by the non-terminal.
So, the I6 State becomes
I6= A→a•A,$
A→•aA,$
A → •b, $
Goto(I6,a)=Closure(A→a•A,$)=(sameasI6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
I7= Goto(I2,b)=Closure(A→b•,$)=A→b•,$
I8= Goto(I3,A)=Closure(A→aA•,a/b)=A→aA•,a/b
I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $
Action Goto
States
a b $ S A
0 S3 S4 1 2
1 ACCEPT
2 S6 S7 5
3 S3 S4 8
4 r3 r3
5 r1
6 S6 S7 9
7 r3
8 r2 r2
9 r2
Productions are numbered as follows:
1. S → AA ... (1)
2. A → aA ....(2)
3. A → b ... (3)
The placement of shift node in CLR (1) parsing table is same as the SLR (1) parsing table.
Only difference in the placement of reduce node.
I4 contains the final item which drives ( A → b•, a/b), so action {I4, a} = R3, action {I4, b} =
R3.
I5 contains the final item which drives ( S → AA•, $), so action {I5, $} = R1.
I7 contains the final item which drives ( A → b•,$), so action {I7, $} = R3.
I8 contains the final item which drives ( A → aA•, a/b), so action {I8, a} = R2, action {I8, b}
=R2.
I9 contains the final item which drives ( A → aA•, $), so action {I9, $} = R2.
LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the
canonical collection of LR (1) items.
In the LALR (1) parsing, the LR (1) items which have same productions but different look
ahead are combined to form a single set of items
LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.
Example
S → AA
A → aA
A→b
Add Augment Production, insert '•' symbol at the first position for every production in G and
also add the look ahead.
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "•" is followed by the non-terminal.
So, the I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
Add all productions starting with A in I2 State because "•" is followed by the non-terminal.
So, the I2 State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
Add all productions starting with A in I3 State because "•" is followed by the non-terminal.
So, the I3 State becomes
Add all productions starting with A in I6 State because "•" is followed by the non-terminal.
So, the I6 State becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
If we analyze then LR (0) items of I3 and I6 are same but they differ only in their lookahead.
I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}
I6= { A → a•A, $
A → •aA, $
A → •b, $
}
Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can
combine them and called as I36.
The I4 and I7 are same but they differ only in their look ahead, so we can combine them and
called as I47.
The I8 and I9 are same but they differ only in their look ahead, so we can combine them and
called as I89.
Action Goto
States
a b $ S A
0 S36 S47 1 2
1 ACCEPT
2 S36 S47 5
36 S36 S47 89
47 r3 r3 r3
5 r1
89 r2 r2 r2
Difference between LL and LR parser:
LL Parser LR Parser
First L of LL is for left to right and second L is L of LR is for left to right and R is for
for leftmost derivation. rightmost derivation.
Using LL parser parser tree is constructed in top Parser tree is constructed in bottom up
down manner. manner.
Ends when stack used becomes empty. Starts with an empty stack.
Pre-order traversal of the parse tree. Post-order traversal of the parser tree.
Annotated Parse Tree – The parse tree containing the values of attributes at each node for
given input string is called annotated or decorated parse tree.
Features –
High level specification
Hides implementation details
Explicit order of evaluation is not specified
1. Synthesized Attributes – These are those attributes which derive their values from their
children nodes i.e. value of synthesized attribute at node is computed from the values of
attributes at children nodes in parse tree.
Example:
E --> E1 + T { E.val = E1.val + T.val}
In this, E.val derive its values from E 1.val and T.val
Computation of Synthesized Attributes –
Write the SDD using appropriate semantic rules for each production in given grammar.
The annotated parse tree is generated and attribute values are computed in bottom up
manner.
The value obtained at root node is the final output.
Example: Consider the following grammar
S --> E
E --> E1 + T
E --> T
T --> T1 * F
T --> F
F --> digit
The SDD for the above grammar can be written as follow
Let us assume an input string 4 * 5 + 6 for computing synthesized attributes. The annotated
parse tree for the input string is
For computation of attributes we start from leftmost bottom node.
The rule F –> digit is used to reduce digit to F and the value of digit is obtained
from lexical analyzer which becomes value of F i.e. from semantic action F.val =
digit.lexval.
Hence, F.val = 4 and since T is parent node of F so, we get T.val = 4 from semantic
action T.val = F.val.
Then, for T –> T1 * F production, the corresponding semantic action is T.val =
T1.val * F.val . Hence, T.val = 4 * 5 = 20
Similarly, combination of E1.val + T.val becomes E.val i.e. E.val = E 1.val + T.val =
26. Then, the production S –> E is applied to reduce E.val = 26 and semantic action
associated with it prints the result E.val . Hence, the output will be 26.
2. Inherited Attributes – These are the attributes which derive their values from their
parent or sibling nodes i.e. value of inherited attributes are computed by value of parent or
siblingnodes.
Example:
L --> T { L.in = T.type }
T --> int
T --> float
T --> double
L --> L1, id
L --> id
The value of L nodes is obtained from T.type (sibling) which is basically lexical value
obtained as int, float or double. Then L node gives type of identifiers a and c. The
computation of type is done in top down manner or preorder traversal. Using function
Enter_type the type of identifiers a and c is inserted in symbol table at corresponding
id.entry.
Syntax Tree
Tree in which each leaf node describes an operand & each interior node an operator. The
syntax tree is shortened form of the Parse Tree.
Each node in a syntax tree can be executed as data with multiple fields.
In the node for an operator, one field recognizes the operator and the remaining field
includes a pointer to the nodes for the operands.
The operator is known as the label of the node. The following functions are used to create
the nodes of the syntax tree for the expressions with binary operators.
mknode (op, left, right) − It generates an operator node with label op and two field
including pointers to left and right.
mkleaf (id, entry) − It generates an identifier node with label id and the field
including the entry, a pointer to the symbol table entry for the identifier.
mkleaf (num, val) − It generates a number node with label num and a field including
val, the value of the number.
For example, construct a syntax tree for an expression a − 4 + c. In this sequence, p 1, p2, … . .
p5 are pointers to the symbol table entries for identifier 'a' and 'c' respectively.
The function calls mkleaf (id, entry a) and mkleaf (num 4) construct the leaves for a
and 4.
The pointers to these nodes are stored using p1and p2. The call mknodes (′−′, p1, p2 )
then make the interior node with the leaves for a and 4 as children. The syntax tree
will be
E → E(1) + E(2) {E. VAL = Node (+, E(1). VAL, E(2). VAL)}
E → E(1) ∗ E(2) {E. VAL = Node (∗, E(1). VAL, E(2). VAL)})
Node (+, (𝟏), 𝐕𝐀𝐋, (𝟐). 𝐕𝐀𝐋) will create a node labeled +.
E(1). VAL &E(2). VAL are left & right children of this node.
Similarly, Node (∗, E(1). VAL, E(2). VAL) will make the syntax as −
Function UNARY (−, E(1). VAL)will make a node – (unary minus) & E(1). VAL will be
the only child of it.
Function LEAF (id) will create a Leaf node with label id.
a = b ∗ −c + d
Solution
If a = b then b = 2 * c
Solution
Intermediate code is used to translate the source code into the machine code. Intermediate
code lies between the high-level language and the machine language.
o If the compiler directly translates source code into the machine code without
generating intermediate code then a full native compiler is required for each new
machine.
o The intermediate code keeps the analysis portions same for all the compilers that's
why it doesn't need a full compiler for every unique machine.
o Intermediate code generator receives input from its predecessor phase and semantic
analyzer phase. It takes input in the form of an annotated syntax tree.
o Using the intermediate code, the second phase of the compiler synthesis phase is
changed according to the target machine.
Intermediate representation
High level intermediate code can be represented as source code. To enhance performance of
source code, we can easily apply code modification. But to optimize the target machine, it is
less preferred.
Low level intermediate code is close to the target machine, which makes it suitable for
register and memory allocation etc. it is used for machine-dependent optimizations.
2. Three-Address Code: A statement involving no more than three references (two for
operands and one for result) is known as a three address statement. A sequence of three
address statements is known as a three address code. Three address statement is of form
x = y op z, where x, y, and z will have address (memory location). Sometimes a
statement might contain less than three references but it is still called a three address
statement.
Example: The three address code for the expression
a+b*c+d:
T1=b*c
T2=a+T1
T3=T2+d T 1 , T 2 , T 3 are temporary variables.
There are 3 ways to represent a Three-Address Code in compiler design:
i)Quadruples
ii)Triples
iii)Indirect Triples
3. Syntax Tree: A syntax tree is nothing more than a condensed form of a parse tree. The
operator and keyword nodes of the parse tree are moved to their parents and a chain of
single productions is replaced by the single link in the syntax tree the internal nodes are
operators and child nodes are operands. To form a syntax tree put parentheses in the
expression, this way it’s easy to recognize which operand should come first.
Abstract Syntax Tree
Example 1
E→E+T|T
T→TxF|F
F → ( E ) | id
1. Parse tree
2. Syntax tree
3. Directed Acyclic Graph (DAG)
Parse Tree-
Syntax Tree-
Example 2
( a + b ) * ( c – d ) + ( ( e / f ) * ( a + b ))
Solution-
Step-01:
(a+b)*(c–d)+((e/f)*(a+b))
ab+ * ( c – d ) + ( ( e / f ) * ( a + b ) )
ab+ * cd- + ( ( e / f ) * ( a + b ) )
ab+cd-* + ef/ab+*
ab+cd-*ef/ab+*+
Step-02:
Steps Involved
Start pushing the symbols of the postfix expression into the stack one by one.
Example
Given Expression:
a := (-c * b) + (-c * d)
The three address code can be represented in two forms: quadruples and triples.
Quadruples
The quadruples have four fields to implement the three address code. The field of quadruples
contains the name of the operator, the first source operand, the second source operand and the
result respectively.
Example
a := -b * c + d
Three-address code is as follows:
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
These statements are represented by quadruples as follows:
Triples:
The triples have three fields to implement the three address code. The field of triples contains
the name of the operator, the first source operand and the second source operand.
In triples, the results of respective sub-expressions are denoted by the position of expression.
Triple is equivalent to DAG while representing expressions.
Indirect Triples:
a := -b * c + d
Three-address code is as follows:
t1 := -b
t2 := c + d
t3 := t1 * t2
a := t3
These statements are represented by Indirect triples follows:
Address
(50) (0)
(51) (1)
(52) (2)
(53) (3)
In the syntax directed translation, assignment statement is mainly deals with expressions. The
expression can be of type real, integer, array and records.
1. S → id := E
2. E → E1 + E2
3. E → E1 * E2
4. E → (E1)
5. E → id
E → E1 + E2 {E.place=newtemp();
Emit (E.place = E1.place '+' E2.place)
}
E → E1 * E2 {E.place=newtemp();
Emit (E.place = E1.place '*' E2.place)
}
E → id {E.place = id.place}
o The Emit function is used for appending the three address code to the output file.
Otherwise it will report an error.
o The newtemp() is a function used to generate new temporary variables.
o E.place holds the value of E.
Three address code
t1=y*z
t2=x+t1
a=t2
Boolean expressions have two primary purposes. They are used for computing the logical
values. They are also used as conditional expression using if-then-else or while-do.
1. E → E OR E
2. E → E AND E
3. E → NOT E
4. E → (E)
5. E → id relop id
6. E → TRUE
7. E → FALSE
NOT has the higher precedence then AND and lastly OR.
Production rule Semantic actions
E → E1 OR E2 {E.place=newtemp();
Emit (E.place ':=' E1.place 'OR' E2.place)
}
E → E1 AND E2 {E.place=newtemp();
Emit (E.place ':=' E1.place 'AND' E2.place)
}
E → NOT E1 {E.place=newtemp();
Emit (E.place ':=' 'NOT' E1.place)
}
The EMIT function is used to generate the three address code and the newtemp( ) function is
used to generate the temporary variables.
The E → id relop id2 contains the next_state and it gives the index of next three address
statements in the output sequence.
Here is the example which generates the three address code using the above translation
scheme:
Three address code
a<b OR c<d AND e<f
1. 100: if a<b goto 103
2. 101: t1:=0
3. 102: goto 104
4. 103: t1:=1
5. 104: if c<d goto 107
6. 105: t2:=0
7. 106: goto 108
8. 107: t2:=1
9. 108: if e<f goto 111
10. 109: t3:=0
11. 110: goto 112
12. 111: t3:= 1
13. 112: t4:= t2 AND t3
14. 113: t5:= t1 OR t4
3. Translation of Boolean Expression control flow statements
Control statements are the statements that change the flow of execution of statements.
Consider the Grammar
S → if E then S1
|if E then S1 else S2
|while E do S1
In this grammar, E is the Boolean expression depending upon which S1 or S2 will be
executed.
Following representation shows the order of execution of an instruction of if-then, ifthen-
else, & while do.
𝐒 → 𝐢𝐟 𝐄 𝐭𝐡𝐞𝐧 𝐒𝟏
E.CODE & S.CODE are a sequence of statements which generate three address code.
E.TRUE is the label to which control flow if E is true.
E.FALSE is the label to which control flow if E is false.
The code for E generates a jump to E.TRUE if E is true and a jump to S.NEXT if E is false.
∴ E.FALSE=S.NEXT in the following table.
In the following table, a new label is allocated to E.TRUE.
When S1.CODE will be executed, and the control will be jumped to statement following S,
i.e., to S1.NEXT.
∴ S1. NEXT = S. NEXT.
If E is true, control will go to E.TRUE, i.e., S1.CODE will be executed and after that
S.NEXT appears after S1.CODE.
Initially, both E.TRUE & E.FALSE are taken as new labels. Hen S1.CODE at label E.TRUE
is executed, control will jump to S.NEXT.
Therefore, after S1, control will jump to the next statement of complete statement S.
S1.NEXT=S.NEXT
Another important control statement is while E do S1, i.e., statement S1 will be executed till
Expression E is true. Control will arrive out of the loop as the expression E will become false
A Label S. BEGIN is created which points to the first instruction for E. Label E. TRUE is
attached with the first instruction for S1. If E is true, control will jump to the label E. TRUE
& S1. CODE will be executed. If E is false, control will jump to E. FALSE. After S1.
CODE, again control will jump to S. BEGIN, which will again check E. CODE for true or
false.
∴ S1. NEXT = S. BEGIN
If E. CODE is false, control will jump to E. FALSE, which causes the next statement after S
to be executed.
∴ E. FALSE = S. NEXT
Example:
While a < b do
If c < d then
x:=y+z
Else
x := y - z
Then the translation is
L1: if a < b goto L2
goto Lnext
L2: if c < d goto L3
goto L4
L3: t1 := y + z
x := t1
goto L1
L4: t2 := y - z
x := t2
goto L1
Lnext:
UNIT V
Basic Blocks:
A basic block is a simple combination of statements. Except for entry and exit, the basic
blocks do not have any branches like in and out. It means that the flow of control enters at
the beginning and it always leaves at the end without any halt. The execution of a set of
instructions of a basic block always takes place in the form of a sequence.
The first step is to divide a group of three-address codes into the basic block. The new basic
block always begins with the first instruction and continues to add instructions until it
reaches a jump or a label. If no jumps or labels are identified, the control will flow from
one instruction to the next in sequential order.
The algorithm for the construction of the basic block is described below step by step:
Algorithm: The algorithm used here is partitioning the three-address code into basic
blocks.
Input: A sequence of three-address codes will be the input for the basic blocks.
Output: A list of basic blocks with each three address statements, in exactly one block, is
considered as the output.
Method: We’ll start by identifying the intermediate code’s leaders. The following are some
guidelines for identifying leaders:
A flow graph is simply a directed graph. For the set of basic blocks, a flow graph shows the
flow of control information. A control flow graph is used to depict how the program control
is being parsed among the blocks. A flow graph is used to illustrate the flow of control
between basic blocks once an intermediate code has been partitioned into basic blocks.
When the beginning instruction of the Y block follows the last instruction of the X block,
an edge might flow from one block X to another block Y.
Let’s make the flow graph of the example that we used for basic block formation:
Optimization is applied to the basic blocks after the intermediate code generation phase of
the compiler. Optimization is the process of transforming a program that improves the code
by consuming fewer resources and delivering high speed. In optimization, high-level codes
are replaced by their equivalent efficient low-level codes. Optimization of basic blocks can
be machine-dependent or machine-independent. These transformations are useful for
improving the quality of code that will be ultimately generated from basic block.
There are two types of basic block optimizations:
1. Function preserving transformations/ Structure preserving transformations
2. Algebraic transformations
Structure-Preserving Transformations:
Dead code is defined as that part of the code that never executes during the program
execution. So, for optimization, such code or dead code is eliminated. The code which is
never executed during the program (Dead code) takes time so, for optimization and speed,
it is eliminated from the code. Eliminating the dead code increases the speed of the program
as the compiler does not have to translate the dead code.
Example:
// Program with Dead code
int main()
{
x=2
if (x > 2)
cout << "code"; // Dead code
else
cout << "Optimization";
return 0;
}
// Optimized Program without dead code
int main()
{
x = 2;
cout << "Optimization"; // Dead Code Eliminated
return 0;
}
In this technique, the sub-expression which are common are used frequently are calculated
only once and reused when needed.
Example:
a = (x+y)+z; b = x+y;
t = x+y; a = t+z; b = t;
If a block has two adjacent statements which are independent can be interchanged without
affecting the basic block value.
Example:
t1 = a + b
t2 = c + d
These two independent statements of a block can be interchanged without affecting the
value of the block.
Algebraic Transformation:
2. Copy Propagation:
It is of two types, Variable Propagation, and Constant Propagation.
Variable Propagation:
x=y ⇒ z = y + 2 (Optimized code)
z=x+2
Constant Propagation:
x=3 ⇒ z = 3 + a (Optimized code)
z=x+a
3. Strength Reduction:
Replace expensive statement/ instruction with cheaper ones.
x = 2 * y (costly) ⇒ x = y + y (cheaper)
x = 2 * y (costly) ⇒ x = y << 1 (cheaper)
• There are a number of ways in which a compiler can improve a program without
changing the function it computes.
• Function preserving transformations examples:
– Common sub expression elimination
– Copy propagation,
– Dead-code elimination
– Constant folding
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t4: = 4*i
t5: = n
• The above code can be optimized using the common sub-expression elimination as
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t5: = n
• The common sub expression t4: =4*i is eliminated as its computation is already in t1
and the value of i is not been changed from definition to use.
Copy Propagation:
• Copy propagation means use of one variable instead of another. This may not appear
to be an improvement, but as we shall see it gives us an opportunity to eliminate x.
• For example:
x=Pi;
A=x*r*r;
Dead-Code Eliminations:
i=0;
if(i=1)
a=b+5;
• Here, ‘if’ statement is dead code because this condition will never get satisfied.
Constant folding:
• Deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding.
• For example,
2.Loop Optimizations:
• In loops, especially in the inner loops, programs tend to spend the bulk of their time.
• The running time of a program may be improved if the number of instructions in an
inner loop is decreased, even if we increase the amount of code outside that loop.
Code Motion:
t= limit-2;
Induction Variables :
Reduction In Strength:
A DAG for basic block is a directed acyclic graph with the following labels on nodes:
1. The leaves of graph are labelled by unique identifier and that identifier can be variable
names or constants.
2. Interior nodes of the graph is labelled by an operator symbol.
3. Nodes are also given a sequence of identifiers for labels to store the computed value.
Method:
Step 1:
Step 2:
For case(i), create node(OP) whose right child is node(z) and left child is node(y).
For case(ii), check whether there is node(OP) with one child node(y).
Output:
For node(x) delete x from the list of identifiers. Append x to attached identifiers list for the
node n found in step 2. Finally set node(x) to n.
Example:
1. S1:= 4 * i
2. S2:= a[S1]
3. S3:= 4 * i
4. S4:= b[S3]
5. S5:= s2 * S4
6. S6:= prod + S5
7. Prod:= s6
8. S7:= i+1
9. i := S7
10. if i<= 20 goto (1)
Code generator is used to produce the target code for three-address statements. It uses
registers to store the operands of the three address statement.
Example:
Consider the three address statement x:= y + z. It can have the following sequence of codes:
MOV x, R0
ADD y, R0
Register and Address Descriptors:
o A register descriptor contains the track of what is currently in each register. The
register descriptors show that all the registers are initially empty.
o An address descriptor is used to store the location where current value of the name
can be found at run time.
A code-generation algorithm:
The algorithm takes a sequence of three-address statements as input. For each three address
statement of the form a:= b op c perform the various actions. These are as follows:
1. Invoke a function getreg to find out the location L where the result of computation
b op c should be stored.
2. Consult the address description for y to determine y'. If the value of y currently in
memory and register both then prefer the register y' . If the value of y is not already in
L then generate the instruction MOV y' , L to place a copy of y in L.
3. Generate the instruction OP z' , L where z' is used to show the current location of z. if
z is in both then prefer a register to a memory location. Update the address descriptor
of x to indicate that x is in location L. If x is in L then update its descriptor and
remove x from all other descriptor.
4. If the current value of y or z have no next uses or not live on exit from the block or in
register then alter the register descriptor to indicate that after execution of x : = y op z
those register will no longer contain y or z.
The assignment statement d:= (a-b) + (a-c) + (a-c) can be translated into the following
sequence of three address code:
t:= a-b
u:= a-c
v:= t +u
d:= v+u
Count uses
Use count as a priority function
• Usage Count
Use(a,B2)+2*(live(a,B2) = 1+2*0=1
Use(b,B3)+2*(live(a,B2)=1+2*1=3
R0 R1 R2 R3
b d a/c/f ……
Advantage
Heavily used values reside in registers
Disadvantage
Does not consider non-uniform distribution of uses
Local allocation does not take into account that some instructions (e.g. those in loops)
execute more frequently. It forces us to store/load at basic block endpoints since each block
has no knowledge of the context of others.
To find out the live range(s) of each variable and the area(s) where the variable is
used/defined global allocation is needed. Cost of spilling will depend on frequencies and
locations of uses.
2. Build an interference graph that represents conflicts between live ranges (two
nodes are connected if the variables they represent are live at the same
moment)
3. Try to assign as many colours to the nodes of the graph as there are registers
so that two neighbours have different colours
PEEPHOLE OPTIMIZATION
we can delete instructions (2) because whenever (2) is executed. (1) will ensure that
the value of a is already in register R0.If (2) had a label we could not be sure that (1) was
always executed immediately before (2) and so we could not remove (2).
Unreachable Code:
#define debug 0
….
If ( debug ) {
Print debugging information
}
In the intermediate representations the if-statement may be translated as:
One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter
what the value of debug; (a) can be replaced by:
If debug ≠1 goto L2
Print debugging information
L2: …………………………… (b)
If debug ≠0 goto L2
Print debugging information
L2: …………………………… (c)
As the argument of the statement of (c) evaluates to a constant true it can be replaced
By goto L2. Then all the statement that print debugging aids are manifestly
unreachable and can be eliminated one at a time.
Flows-Of-Control Optimizations:
The unnecessary jumps can be eliminated in either the intermediate code or the target
code by the following types of peephole optimizations. We can replace the jump sequence
goto L1
….
L1: goto L2
If there are now no jumps to L1, then it may be possible to eliminate the statement
L1:goto L2 provided it is preceded by an unconditional jump .Similarly, the sequence
if a < b goto L1
….
can be replaced by
If a < b goto L2
….
L1: goto L2
goto L1
may be replaced by
If a < b goto L2
goto L3
…….
L3:
While the number of instructions in(e) and (f) is the same, we sometimes skip the
unconditional jump in (f), but never in (e).Thus (f) is superior to (e) in execution time
Algebraic Simplification:
are often produced by straightforward intermediate code-generation algorithms, and they can
be eliminated easily through peephole optimization.
Reduction in Strength:
X2 → X*X
The target machine may have hardware instructions to implement certain specific
operations efficiently. For example, some machines have auto-increment and auto-decrement
addressing modes. These add or subtract one from an operand before or after using its value.
The use of these modes greatly improves the quality of code when pushing or popping a
stack, as in parameter passing. These modes can also be used in code for statements like i :
=i+1.
i:=i+1 → i++
i:=i-1 → i- -