CSE353 Slides
CSE353 Slides
CSE 353
Course Outcome
CO Course Outcome(CO)
Code
CO3 Compare top down with bottom up parsers, and develop appropriate
parser to produce parse tree representation of the input
CO4 Design syntax directed translation schemes for a given context free
grammar.
CO5 Generate intermediate code for statements in high level language,
Benefits and limitations of automatic memory management.
HLL
Pure HLL
• Pre-Processor – The pre-processor removes all the #include directives by including
the files called file inclusion and all the #define directives using macro expansion. It
performs file inclusion, augmentation, macro-processing etc.
• Assembly Language – Its neither in binary form nor high level. It is an intermediate
state that is a combination of machine instructions and some other useful data
needed for execution.
• Assembler – For every platform (Hardware + OS) we will have an assembler. They
are not universal since for each platform we have one. The output of assembler is
called object file. It translates assembly language to machine code.
• Interpreter – An interpreter converts high level language into low level machine
language, just like a compiler. But they are different in the way they read the input.
The Compiler in one go reads the inputs, does the processing and executes the
source code whereas the interpreter does the same line by line. Compiler scans the
entire program and translates it as a whole into machine code whereas an
interpreter translates the program one statement at a time. Interpreted programs
are usually slower with respect to compiled ones.
• Relocatable Machine Code – It can be loaded at any point and can be run. The
address within the program will be in such a way that it will cooperate for the
program movement.
• Loader/Linker – It converts the relocatable code into absolute code and tries to
run the program resulting in a running program or an error message (or sometimes
both can happen). Linker loads a variety of object files into a single file to make it
executable. Then loader loads it in memory and executes it.
Compiler
Compiler is a translator program that translates a program written in (HLL)
the source program and translate it into an equivalent program in (MLL) the
target program. As an important part of a compiler is error showing to the
programmer.
Structure of Compiler
Executing a program written in HLL programming language is basically of two parts.
the source program must first be compiled translated into an object program. Then
the result object program is loaded into a memory is executed.
Stream of Tokens
Parse Tree
refined Parse
Tree
3- Address Code
Optimized Code
Lexical Analyzer-
Syntax Analyzer –
CODE GENERATION:
•It is the final phase of the compiler.
•It gets input from code optimization phase and produces the target code or object
code as result.
•Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
•The code generation involves - allocation of register and memory - generation of
correct references - generation of correct data types - generation of missing code.
SYMBOL TABLE MANAGEMENT:
•Symbol table is used to store all the information about identifiers used in the program.
•It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
•It allows to find the record for each identifier quickly and to store or retrieve data from
that record.
•Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
ERROR HANDLING:
•Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
•In lexical analysis, errors occur in separation of tokens.
•In syntax analysis, errors occur during construction of syntax tree.
•In semantic analysis, errors occur when the compiler detects constructs with right
syntactic structure but no meaning during type conversion.
•In code optimization, errors occur when the result is affected by the optimization. In
code generation, it shows error when code is missing etc.
Example
Lexical Analyzer
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts
the High level input program into a sequence of Tokens.
•Lexical Analysis can be implemented with the Deterministic finite Automata.
•The output is a sequence of tokens that is sent to the parser for syntax analysis.
•Its task is to read the input characters and produce as output a sequence of tokens
that the parser uses for syntax analysis
•Upon receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify the next token.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA).
The process of forming tokens from an input stream of characters is called
tokenization.
Consider this expression in the C programming language: sum=3+2;
Lexeme Token Type
Sum Identifier
= Assignment Operator
3 Number
+ Addition Operator
2 Number
; End of statement
The function of lexical analysis is to tokenize and separate them out from the
program or statement.
Letter/Digit
Start
0 1 2
Letter delimiter
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
Solution
Question 2:
int max(x,y)
int x, y;
/* find max of x and y*/
{
return(x>y ? X:y);
}
Solution:
int max ( x , y )
int x , y ; { return
( x > y ? x :
y ) ; }
Question3 :
A pass refers to the number of times the compiler goes through the source
code.
Multi-pass compiler goes through the source code several times. In other
words, it allows the source code to pass through each compilation unit several
times. Each pass takes the result of the previous pass as input and creates
intermediate outputs. Therefore, the code improves in each pass. The final
code is generated after the final pass
The main difference between phases and passes of compiler is that phases are the
steps in the compilation process while passes are the number of times
the compiler traverses through the source code.
Bootstrapping and Cross Compiler
Bootstrapping is a process in which simple language is used to translate
more complicated program which in turn may handle far more complicated
program. This complicated program can further handle even more complicated
program and so on.
It is used to build programs for same It is used to build programs for other
system/machine & OS it is installed. system/machine like AVR/ARM.
It can generate executable file like .exe It can generate raw code .hex
Lexical analysis is the process of producing tokens from the source program.
It has the following issues:
•Lookahead
•Ambiguities
Lookahead
•Lookahead is required to decide when one token will end
and the next token will begin. The simple example which
has lookahead issues are i vs if,
•= vs. ==. Therefore, a way to describe the lexemes of
each token is required.
•A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
•Hence, the number of lookahead to be considered and a
way to describe the lexemes of each token is also needed.
Ambiguities
• Among rules which matched the same number of characters, the rule given
first is preferred.
Lexical Errors
A character sequence which is not possible to scan into any valid
token is a lexical error. Important facts about the lexical error:
▪ Lexical errors are not very common, but it should be managed by a
scanner
Parse tree
Parser
Context-free grammar
Parsing Techniques
Parsing
S S backtrack S
/ | \ 🡪 / | \ 🡪 / | \
c A d c A d c Ad
/ \ |
a b a
Bottom-up
Parsing
A bottom-up parser builds a derivation by working from the input
sentence back toward the start symbol S
S ⇒ γ0 ⇒ γ1 ⇒ γ2 ⇒ … ⇒ γn–1 ⇒ γn ⇒ sentence
To reduce γi to γi–1 match some rhs β against γi then replace β with its
corresponding LHS, A. ( production A→β)
list + digit
list - digit 2
digit 5
9
47
Ambiguity
• A Grammar can have more than one
parse tree for a string
• Consider grammar
list 🡪 list+ list
| list – list
|0|1|…|9
9 5 5 2
49
Ambiguity
…
• Ambiguity is problematic because meaning
of the programs can be incorrect
• Ambiguity can be handled in several ways
– Enforce associativity and precedence
– Rewrite the grammar (cleanest way)
• There is no algorithm to convert
automatically any ambiguous grammar to
an unambiguous grammar accepting the
same language
• Worse, there are inherently ambiguous
50
languages!
Ambiguity in Programming Lang.
• Dangling else problem
stmt → if expr stmt
| if expr stmt else stmt
• For this grammar, the string
if e1 if e2 then s1 else s2
has two parse trees
51
if e1
if e2
stmt
s1
else s2
if expr stmt else stmt
if e1 e1 if expr stmt s2
if e2
s1
else s2 e2 s1
stmt
if expr stmt
e2 s1 s2 17
Resolving dangling else problem
• General rule: match each else with the closest
previous unmatched if. The grammar can be
rewritten as
stmt → matched-stmt
| unmatched-stmt
matched-stmt → if expr matched-stmt
else matched-stmt
| others
unmatched-stmt → if expr stmt
| if expr matched-stmt
else unmatched-stmt 18
Associativity
• If an operand has operator on both the
sides, the side on which operator takes this
operand is the associativity of that
operator
• In a+b+c b is taken by left +
• +, -, *, / are left associative
• ^, = are right associative
• Grammar to generate strings with right
associative operators
right 🡪 letter = right | letter
letter 🡪 a| b |…| z
54
Precedence
• String a+5*2 has two possible
interpretations because of two different
parse trees corresponding to
(a+5)*2 and a+(5*2)
• Precedence determines the correct
interpretation.
• Next, an example of how precedence
rules are encoded in a grammar 55
Precedence/Associativity in the Grammar for
Arithmetic Expressions
Ambiguous Unambiguous,
E🡪E+E with precedence
| E*E and associativity
| (E) rules honored
| num | id E🡪E+T|T
T🡪T*F|F
3+2+5 F 🡪 (E)|num|id
3+2*5
Suppose Production rules for the Grammar of a language are:
S -> cAd A -> bc|a
And the input string is “cad”.
Backtrack was needed to get the correct syntax tree, which is really a complex
process to implement.
There can be an easier way to solve this using “Concepts of FIRST and
FOLLOW sets in Compiler Design”.
Why FIRST?
• We saw the need of backtrack in Syntax Analysis, which is really a complex
process to implement.
• There can be easier way to sort out this problem: If the compiler would have
come to know in advance, that what is the “first character of the string
produced when a production rule is applied” and comparing it to the current
character or token in the input string it sees, it can wisely take decision on
which production rule to apply.
• Example: S -> cAd A -> bc|a and the input string is “cad”.
• Thus, if we knew that after reading character ‘c’ in the input string and
applying S->cAd, next character in the input string is ‘a’, then we would have
ignored the production rule A->bc and directly use the production rule A->a.
• Hence if parser knows first character of the string that can be obtained by
applying a production rule, then it can wisely apply the correct production rule
to get the correct syntax tree for the given input string.
Why FOLLOW?
• The parser faces one more problem.
• Consider grammar A -> aBb B -> c | ε and input string is “ab”. As the first character
in the input is a, the parser applies the rule A->aBb.
• Now the parser checks for the second character of the input string which is b,
and Non-Terminal to derive is B
• Parser can’t get any string derivable from B that contains b as first character.
• But the Grammar does contain a production rule B -> ε, if that is applied then
B will vanish, and the parser gets the input “ab”.
Why FOLLOW?
• But the parser can apply it only when it knows that the character that follows
B in the production rule is same as the current character in the input. In RHS of
A -> aBb, b follows Non-Terminal B, i.e. FOLLOW(B) = {b}, and the current
input character read is also b.
• Hence the parser applies this rule. And it is able to get the string “ab” from the
given grammar.
FOLLOW can make a Non-terminal vanish out if needed to generate the string
from the parse tree. The conclusions is, we need to find FIRST and FOLLOW sets
for a given grammar so that the parser can properly apply the needed rule at the
correct position.
FIRST
FIRST(X) for a grammar symbol X is the set of terminals that begin the strings
derivable from X.
Example:
E -> TE’
E’ -> +T E’
T -> F T’
T’ -> *F T’
F -> (E) | id
FIRST(E) = { ( , id }
FIRST
FIRST
FOLLOW
Thank You