0% found this document useful (0 votes)
34 views

CSE353 Slides

Compiler Design Notes

Uploaded by

Ayush Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

CSE353 Slides

Compiler Design Notes

Uploaded by

Ayush Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Compiler Design

CSE 353
Course Outcome
CO Course Outcome(CO)
Code

CO1 CO 1:Explain the concepts and different phases of compilation with


compile time error handling

CO2 :Represent language tokens using regular expressions, context free


grammar and finite automata and design lexical analyzer for a language

CO3 Compare top down with bottom up parsers, and develop appropriate
parser to produce parse tree representation of the input
CO4 Design syntax directed translation schemes for a given context free
grammar.
CO5 Generate intermediate code for statements in high level language,
Benefits and limitations of automatic memory management.

CO6 Apply optimization techniques to intermediate code and generate


machine code for high level language program
Introduction to Language processing
System

HLL

Pure HLL
• Pre-Processor – The pre-processor removes all the #include directives by including
the files called file inclusion and all the #define directives using macro expansion. It
performs file inclusion, augmentation, macro-processing etc.

• Assembly Language – Its neither in binary form nor high level. It is an intermediate
state that is a combination of machine instructions and some other useful data
needed for execution.

• Assembler – For every platform (Hardware + OS) we will have an assembler. They
are not universal since for each platform we have one. The output of assembler is
called object file. It translates assembly language to machine code.

• Interpreter – An interpreter converts high level language into low level machine
language, just like a compiler. But they are different in the way they read the input.
The Compiler in one go reads the inputs, does the processing and executes the
source code whereas the interpreter does the same line by line. Compiler scans the
entire program and translates it as a whole into machine code whereas an
interpreter translates the program one statement at a time. Interpreted programs
are usually slower with respect to compiled ones.
• Relocatable Machine Code – It can be loaded at any point and can be run. The
address within the program will be in such a way that it will cooperate for the
program movement.

• Loader/Linker – It converts the relocatable code into absolute code and tries to
run the program resulting in a running program or an error message (or sometimes
both can happen). Linker loads a variety of object files into a single file to make it
executable. Then loader loads it in memory and executes it.
Compiler
Compiler is a translator program that translates a program written in (HLL)
the source program and translate it into an equivalent program in (MLL) the
target program. As an important part of a compiler is error showing to the
programmer.

Structure of Compiler
Executing a program written in HLL programming language is basically of two parts.
the source program must first be compiled translated into an object program. Then
the result object program is loaded into a memory is executed.

Execution process of source program in Compiler


LIST OF COMPILERS
1. Ada compilers
2. ALGOL compilers
3. BASIC compilers
4. C# compilers
5. C compilers
6. C++ compilers
7. COBOL compilers
8. Common Lisp compilers
9. ECMAScript interpreters
10. Fortran compilers
11. Java compilers
12. Pascal compilers
13. PL/I compilers
14. Python compilers
15. Smalltalk compilers
STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically
interrelated operation that takes source program in one representation and produces
output in another representation.

There are two phases of compilation.


a. Analysis (Machine Independent/Language Dependent)
b. Synthesis (Machine Dependent/Language Independent)

Compilation process is partitioned into no. of sub-processes called ‘phases’.


Structure of Compiler

Stream of Tokens

Parse Tree

refined Parse
Tree

3- Address Code

Optimized Code
Lexical Analyzer-

• It is the first phase of the compiler.


• It gets input from the source program and produces tokens as output.
• It reads the characters one by one, starting from left to right and forms the tokens.
Token : It represents a logically cohesive sequence of characters a + b = 20 Here,
a,b,+,=,20 are all separate tokens. Group of characters forming a token is called the
Lexeme.
•The lexical analyser not only generates a token but also characters such as keywords,
operators, identifiers, special symbols etc. Example: enter the lexeme into the symbol
table if it is not already there.

Syntax Analyzer –

•It is the second phase of the compiler. It is also known as parser.


•It gets the token stream as input from the lexical analyser of the compiler and
generates syntax tree as the output.
•Syntax tree: It is a tree in which interior nodes are operators and exterior nodes are
operands.
•Example: For a=b+c*2, syntax tree is
Semantic Analyzer –
•It is the third phase of the compiler.
•It gets input from the syntax analysis as parse tree and checks whether the given
syntax is correct or not.
•It performs type conversion of all the data types into real data types.

Intermediate Code Generator –


•It is the fourth phase of the compiler.
•It gets input from the semantic analysis and converts the input into output as
intermediate code such as three-address code.
•The three-address code consists of a sequence of instructions, each of which has
atmost three operands.
Example: t1=t2+t3
T1=t3
CODE OPTIMIZATION:
• It is the fifth phase of the compiler.
• It gets the intermediate code as input and produces optimized intermediate code as
output.
•This phase reduces the redundant code and attempts to improve the intermediate
code so that faster-running machine code will result.
•During the code optimization, the result of the program is not affected.
•To improve the code generation, the optimization involves
-deduction and removal of dead code (unreachable code).
- - calculation of constants in expressions and terms.
-- collapsing of repeated expression into temporary string.
-- loop unrolling. - moving code outside the loop.
-- removal of unwanted temporary variables.

CODE GENERATION:
•It is the final phase of the compiler.
•It gets input from code optimization phase and produces the target code or object
code as result.
•Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
•The code generation involves - allocation of register and memory - generation of
correct references - generation of correct data types - generation of missing code.
SYMBOL TABLE MANAGEMENT:
•Symbol table is used to store all the information about identifiers used in the program.
•It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
•It allows to find the record for each identifier quickly and to store or retrieve data from
that record.
•Whenever an identifier is detected in any of the phases, it is stored in the symbol table.

ERROR HANDLING:
•Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
•In lexical analysis, errors occur in separation of tokens.
•In syntax analysis, errors occur during construction of syntax tree.
•In semantic analysis, errors occur when the compiler detects constructs with right
syntactic structure but no meaning during type conversion.
•In code optimization, errors occur when the result is affected by the optimization. In
code generation, it shows error when code is missing etc.
Example
Lexical Analyzer
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts
the High level input program into a sequence of Tokens.
•Lexical Analysis can be implemented with the Deterministic finite Automata.
•The output is a sequence of tokens that is sent to the parser for syntax analysis.
•Its task is to read the input characters and produce as output a sequence of tokens
that the parser uses for syntax analysis

•Upon receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify the next token.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA).
The process of forming tokens from an input stream of characters is called
tokenization.
Consider this expression in the C programming language: sum=3+2;
Lexeme Token Type
Sum Identifier
= Assignment Operator
3 Number
+ Addition Operator
2 Number
; End of statement

LEXEME: Collection or group of characters forming tokens is called Lexeme.


PATTERN: A pattern is a description of the form that the lexemes of a token may
take. In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many strings.
How Lexical Analyzer functions

1. Tokenization i.e. Dividing the program into valid tokens.


2. Remove white space characters.
3. Remove comments.

The function of lexical analysis is to tokenize and separate them out from the
program or statement.

How to separate them from program?


We have designed strong regular expression to represent the tokens and
design NFA/DFA to act as token recognizer.

Letter/Digit
Start
0 1 2
Letter delimiter

Transition Diagram for identifier


Consider the Program and find the number of token

int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
Solution

'int' 'main‘ '(' ')‘ '{' 'int‘ 'a‘

',‘ 'b‘ ';‘ 'a‘ '=‘ '10‘

';‘ 'return‘ '0‘ ';’ '}‘

Question 2:
int max(x,y)
int x, y;
/* find max of x and y*/
{
return(x>y ? X:y);
}
Solution:
int max ( x , y )

int x , y ; { return

( x > y ? x :

y ) ; }

Question3 :

Printf (“%d, hi”, &x);


Passes

A pass refers to the number of times the compiler goes through the source
code.

Two types of Passes in compiler


Single-pass compiler goes through the program only once. In other words,
the single pass compiler allows the source code to pass through each
compilation unit only once. It immediately translates each code section into
its final machine code.

Multi-pass compiler goes through the source code several times. In other
words, it allows the source code to pass through each compilation unit several
times. Each pass takes the result of the previous pass as input and creates
intermediate outputs. Therefore, the code improves in each pass. The final
code is generated after the final pass

The main difference between phases and passes of compiler is that phases are the
steps in the compilation process while passes are the number of times
the compiler traverses through the source code.
Bootstrapping and Cross Compiler
Bootstrapping is a process in which simple language is used to translate
more complicated program which in turn may handle far more complicated
program. This complicated program can further handle even more complicated
program and so on.

A cross compiler is a compiler capable of creating executable code for a


platform other than the one on which the compiler is running. For example, a
compiler that runs on a Windows 7 PC but generates code that runs on Android
smartphone is a cross compiler.
Continued..

▪ Suppose we want to write a cross


compiler for new language X.
▪ The implementation language of this
compiler is say Y and the target
code being generated is in language
Z. That is, we create XYZ.
▪ Now if existing compiler Y runs on
machine M and generates code for
M then it is denoted as YMM.
▪ Now if we run XYZ using YMM
then we get a compiler XMZ.
▪ That means a compiler for source
language X that generates a target
code in language Z.
Continued..
Difference between Native Compiler and
Cross Compiler
Native Compiler Cross Compiler
Translates program for same Translates program for different
hardware/platform/machine on it is hardware/platform/machine other
running. than the platform which it is running.

It is used to build programs for same It is used to build programs for other
system/machine & OS it is installed. system/machine like AVR/ARM.

It is dependent on System/machine and It is independent of System/machine


OS and OS

It can generate executable file like .exe It can generate raw code .hex

TurboC or GCC is native Compiler. Keil is a cross compiler.


Issues in Lexical Analysis

Lexical analysis is the process of producing tokens from the source program.
It has the following issues:

•Lookahead

•Ambiguities
Lookahead
•Lookahead is required to decide when one token will end
and the next token will begin. The simple example which
has lookahead issues are i vs if,
•= vs. ==. Therefore, a way to describe the lexemes of
each token is required.
•A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
•Hence, the number of lookahead to be considered and a
way to describe the lexemes of each token is also needed.
Ambiguities

The lexical analysis programs written with lex accept ambiguous


specifications and choose the longest match possible at each input point.
Lex can handle ambiguous specifications. When more than one expression
can match the current input, lex chooses as follows:

•The longest match is preferred.

• Among rules which matched the same number of characters, the rule given
first is preferred.
Lexical Errors
A character sequence which is not possible to scan into any valid
token is a lexical error. Important facts about the lexical error:
▪ Lexical errors are not very common, but it should be managed by a
scanner

▪ Misspelling of identifiers, operators, keyword are considered as


lexical errors

▪ Generally, a lexical error is caused by the appearance of some


illegal character, mostly at the beginning of a token.
Error Recovery in Lexical Analyzer

Most common error recovery techniques:

▶ Removes one character from the remaining input


▶ In the panic mode, the successive characters are always ignored
until we reach a well-formed token

▶ By inserting the missing character into the remaining input

▶ Replace a character with another character

▶ Transpose two serial characters


Lexical Analyzer vs. Parser
Lexical Analyser Parser

Scan Input program Perform syntax analysis

Identify Tokens Create an abstract


representation of the code

Insert tokens into Symbol Update symbol table entries


Table
It generates lexical errors It generates a parse tree of the
source code
Parser
• Parser in a compiler that is used to break the
data into smaller elements coming from lexical
analysis phase.
• A parser takes input in the form of sequence
of tokens and produces output in the form of
parse tree.
• Parsing is of two types: top down parsing and
bottom up parsing.
Stream of tokens

Parse tree
Parser
Context-free grammar
Parsing Techniques
Parsing

Top Down Parsing Bottom Up Parsing

Backtracking Non-Backtracking Parsing Operator Precedence Table driven LR


Parsing (Predictive Parsing) Parsing Parsing
Works on Operator Grammar

Recursive Descent SLR Canonical LALR


Table Driven Predictive
Parsing Parsing LR Parsing Parsing
Parsing (LL1)
There are several types of parsing algorithms used in syntax
analysis, including:

• LL parsing: This is a top-down parsing algorithm that starts with the


root of the parse tree and constructs the tree by successively
expanding non-terminals. LL parsing is known for its simplicity and
ease of implementation.
• LR parsing: This is a bottom-up parsing algorithm that starts with
the leaves of the parse tree and constructs the tree by successively
reducing terminals. LR parsing is more powerful than LL parsing and
can handle a larger class of grammars.
• LR(1) parsing: This is a variant of LR parsing that uses lookahead to
disambiguate the grammar.
• LALR parsing: This is a variant of LR parsing that uses a reduced set
of lookahead symbols to reduce the number of states in the LR
parser.
• Once the parse tree is constructed, the compiler can perform
semantic analysis to check if the source code makes sense and
follows the semantics of the programming language.
• The parse tree or AST can also be used in the code generation phase
of the compiler design to generate intermediate code or machine
code.
Parsing
Techniques
Top-down parsers (LL(1), recursive descent)
• Start at the root of the parse tree and grow toward leaves
• Pick a production & try to match the input
• Bad “pick”, may need to backtrack
• Some grammars are backtrack-free (predictive parsing)

Bottom-up parsers ( Shift Reduce Parser, LR(1), operator precedence)


• Start at the leaves and grow toward root
• As input is consumed, encode possibilities in an internal state
• Start in a state valid for legal first tokens
• Bottom-up parsers handle a large class of grammars
Top Down Parsing
Bottom UP Parsing

• Bottom up parsing is also known as


shift- reduce parsing.
• Bottom up parsing is used to construct a parse
tree for an input string.
• In the bottom up parsing, the parsing starts
with the input symbol and construct the parse
tree up to the start symbol by tracing out the
rightmost derivations of string in reverse.
Top–Down Parsing Bottom–Up Parsing
• A parse tree is created from • A parse tree is created from
root to leaves leaves to root
• The traversal of parse trees is a • The traversal of parse trees
preorder traversal is a reversal of postorder
• Tracing leftmost derivation traversal

• Two types: • Tracing rightmost derivation


— Backtracking parser • More powerful than top-
down parsing, eg shift
— Predictive parser reduce parser, operator
precedence parser, LR
• Guess the structure of the parse parser
tree from the next input
•Try different structures and
backtrack if it does not matched
Basic Idea in Top-Down Parsing
• Top-Down Parsing is an attempt to find a left-most derivation
for an input string
• Example:
S -> c A d Find a derivation for w = c a d
A -> a b | a

S S backtrack S
/ | \ 🡪 / | \ 🡪 / | \
c A d c A d c Ad
/ \ |
a b a
Bottom-up
Parsing
A bottom-up parser builds a derivation by working from the input
sentence back toward the start symbol S

S ⇒ γ0 ⇒ γ1 ⇒ γ2 ⇒ … ⇒ γn–1 ⇒ γn ⇒ sentence

To reduce γi to γi–1 match some rhs β against γi then replace β with its
corresponding LHS, A. ( production A→β)

In terms of the parse tree-


🡪 working from leaves to root
Finding Reductions
Consider the simple grammar Sentential Next
→ Form Reduction
1 S aABe
Prod’n

2 A Abc abbcde 3
3| b a A bcde 2
→ a A de 4
4B d
aABe 1
And the input string abbcde S —

The trick is scanning the input and finding the


next reduction
The mechanism for doing this must be efficient
Example list → list +
digit
Parse tree for 9-5+2 | list – digit
| digit
digit
list →0|1|…|9

list + digit

list - digit 2

digit 5

9
47
Ambiguity
• A Grammar can have more than one
parse tree for a string
• Consider grammar
list 🡪 list+ list
| list – list
|0|1|…|9

• String 9-5+2 has two parse trees


48
list list

list + list list - list

list - list 2 9 list + list

9 5 5 2

49
Ambiguity

• Ambiguity is problematic because meaning
of the programs can be incorrect
• Ambiguity can be handled in several ways
– Enforce associativity and precedence
– Rewrite the grammar (cleanest way)
• There is no algorithm to convert
automatically any ambiguous grammar to
an unambiguous grammar accepting the
same language
• Worse, there are inherently ambiguous
50
languages!
Ambiguity in Programming Lang.
• Dangling else problem
stmt → if expr stmt
| if expr stmt else stmt
• For this grammar, the string
if e1 if e2 then s1 else s2
has two parse trees
51
if e1
if e2
stmt
s1
else s2
if expr stmt else stmt

if e1 e1 if expr stmt s2
if e2
s1
else s2 e2 s1
stmt

if expr stmt

e1 if expr stmt else stmt

e2 s1 s2 17
Resolving dangling else problem
• General rule: match each else with the closest
previous unmatched if. The grammar can be
rewritten as
stmt → matched-stmt
| unmatched-stmt
matched-stmt → if expr matched-stmt
else matched-stmt
| others
unmatched-stmt → if expr stmt
| if expr matched-stmt
else unmatched-stmt 18
Associativity
• If an operand has operator on both the
sides, the side on which operator takes this
operand is the associativity of that
operator
• In a+b+c b is taken by left +
• +, -, *, / are left associative
• ^, = are right associative
• Grammar to generate strings with right
associative operators
right 🡪 letter = right | letter
letter 🡪 a| b |…| z
54
Precedence
• String a+5*2 has two possible
interpretations because of two different
parse trees corresponding to
(a+5)*2 and a+(5*2)
• Precedence determines the correct
interpretation.
• Next, an example of how precedence
rules are encoded in a grammar 55
Precedence/Associativity in the Grammar for
Arithmetic Expressions
Ambiguous Unambiguous,
E🡪E+E with precedence
| E*E and associativity
| (E) rules honored
| num | id E🡪E+T|T
T🡪T*F|F
3+2+5 F 🡪 (E)|num|id
3+2*5
Suppose Production rules for the Grammar of a language are:
S -> cAd A -> bc|a
And the input string is “cad”.

Backtrack was needed to get the correct syntax tree, which is really a complex
process to implement.
There can be an easier way to solve this using “Concepts of FIRST and
FOLLOW sets in Compiler Design”.
Why FIRST?
• We saw the need of backtrack in Syntax Analysis, which is really a complex
process to implement.

• There can be easier way to sort out this problem: If the compiler would have
come to know in advance, that what is the “first character of the string
produced when a production rule is applied” and comparing it to the current
character or token in the input string it sees, it can wisely take decision on
which production rule to apply.

• Example: S -> cAd A -> bc|a and the input string is “cad”.

• Thus, if we knew that after reading character ‘c’ in the input string and
applying S->cAd, next character in the input string is ‘a’, then we would have
ignored the production rule A->bc and directly use the production rule A->a.

• Hence if parser knows first character of the string that can be obtained by
applying a production rule, then it can wisely apply the correct production rule
to get the correct syntax tree for the given input string.
Why FOLLOW?
• The parser faces one more problem.

• Consider grammar A -> aBb B -> c | ε and input string is “ab”. As the first character
in the input is a, the parser applies the rule A->aBb.

• Now the parser checks for the second character of the input string which is b,
and Non-Terminal to derive is B

• Parser can’t get any string derivable from B that contains b as first character.

• But the Grammar does contain a production rule B -> ε, if that is applied then
B will vanish, and the parser gets the input “ab”.
Why FOLLOW?
• But the parser can apply it only when it knows that the character that follows
B in the production rule is same as the current character in the input. In RHS of
A -> aBb, b follows Non-Terminal B, i.e. FOLLOW(B) = {b}, and the current
input character read is also b.

• Hence the parser applies this rule. And it is able to get the string “ab” from the
given grammar.

FOLLOW can make a Non-terminal vanish out if needed to generate the string
from the parse tree. The conclusions is, we need to find FIRST and FOLLOW sets
for a given grammar so that the parser can properly apply the needed rule at the
correct position.
FIRST
FIRST(X) for a grammar symbol X is the set of terminals that begin the strings
derivable from X.

Example:
E -> TE’
E’ -> +T E’
T -> F T’
T’ -> *F T’
F -> (E) | id

FIRST(E) = { ( , id }
FIRST
FIRST
FOLLOW
Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy