Chapter 1 - Introduction
Chapter 1 - Introduction
Cross Compiler that runs on a machine ‘A’ and produces a code for another machine ‘B’. It is capable of
creating code for a platform other than the one on which the compiler is running.
Source-to-source Compiler or transcompiler or transpiler is a compiler that translates source code written in
one programming language into source code of another programming language.
Compiler design principles provide an in-depth view of translation and optimization process.
It includes lexical, syntax, and semantic analysis as front end, and code generation and optimization
as back-end.
1
Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
i) Macro processing: A preprocessor may allow a user to define macros that are short hands for longer constructs.
ii) File inclusion: A preprocessor may include header files into the program text.
iii) Rational preprocessor: these preprocessors augment older languages with more modern flow-of-control
and data structuring facilities.
iv) Language Extensions: These preprocessor attempts to add capabilities to the language by certain amounts to
build-in macro.
3. Assembly Language – Its neither in binary form nor high level. It is an intermediate state that is a
combination of machine instructions and some other useful data needed for execution.
4. ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use a mnemonic
(symbols) for each machine instruction, which they would subsequently translate into machine language. Such a
mnemonic machine language is now called an assembly language. An assembler translates assembly language
programs into machine code. The output of an assembler is called an object file, which contains a combination of
machine instructions as well as the data required to place these instructions in memory.
5. Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language. The difference lies in
the way they read the source code or input. A compiler reads the whole source code at once, creates tokens, checks
semantics, generates intermediate code, executes the whole program and may involve many passes. In contrast, an
interpreter reads a statement from the input, converts it to an intermediate code, executes it, then takes the next
2
statement in sequence. If an error occurs, an interpreter stops execution and reports it. whereas a compiler reads
the whole program even if it encounters several errors.
6. Relocatable Machine Code – It can be loaded at any point and can be run. The address within the program
will be in such a way that it will cooperate for the program movement.
7. Linker: Linker is a computer program that links and merges various object files together in order to make an
executable file. All these files might have been compiled by separate assemblers.
8. Loader : Loader is a part of operating system and is responsible for loading executable files into memory and
executes them. It calculates the size of a program (instructions and data) and creates memory space for it. It
initializes various registers to initiate execution.
Loader It converts the relocatable code into absolute code and tries to run the program resulting in a running
program or an error message (or sometimes both can happen). Linker loads a variety of object files into a
single file to make it executable. Then loader loads it in memory and executes it.
9. Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform (B) is called a cross-
compiler.
Memory Memory requirement is more due to It requires less memory as it does not
the creation of object code. create intermediate object code.
Errors Display all errors after compilation, all Displays error of each line one by one.
at the same time.
Error detection Difficult Easier comparatively
Pertaining Programming C, C++, C#, Scala, typescript uses PHP, Perl, Python, Ruby uses an
languages compiler. interpreter.
3
COMPILER DESIGN ISSUES
The compilation process is a sequence of various phases. Each phase takes input from its previous stage, has its
own representation of source program, and feeds its output to the next phase of the compiler. Let us understand the
phases of a compiler.
4
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream of characters and
converts it into meaningful lexemes. Lexical analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of alphabets,
e.g., the length of the string tutorials point is 14 and is denoted by |tutorials point| = 14. A string having no alphabets,
i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).
Assignment =
Preprocessor #
5
Syntax Analysis:- The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis is aided by using
techniques based on formal grammar of the programming language.
Parser converts the tokens produced by lexical analyzer into a tree like representation called parse tree.
Syntax tree is a compressed representation of the parse tree in which the operators appear as interior nodes
and the operands of the operator are the children of the node for that operator.
Input: Tokens c = a + b * 5;
c, a, b (identifier), =(assignment), +*(operator), ;(symbol)
Output: Syntax tree
Intermediate Code Generations:- An intermediate representation of the final machine language code is produced.
This phase bridges the analysis and synthesis phases of translation.
Most commonly used form is the three address code.
a). Three address code
t1 = inttofloat (5)
t2 = id3* tl
t3 = id2 + t2
id1 = t3
6
Example – The three address code for the expression: a+b*c+d: BODMAS
T1=b*c
T2=a+T1
T3=T2+d T 1 , T 2 , T 3 are temporary variables.
c). Syntax Tree Example: syntax tree the internal nodes are operators and child nodes are operands.
Example: x = (a + b * c) / (a – b * c)
Code Optimization: - This is optional phase described to improve the intermediate code so that the output runs
faster and takes less space.
Code Generation:- The last phase of translation is code generation. A number of optimizations to reduce the length
of machine language program are carried out during this phase. The output of the code generator is the machine
language program of the specified computer.
7
Table Management (or) Book-keeping: - Symbol table is used to store all the information about identifiers
used in the program. Symbol Table is an important data structure created and maintained by the compiler in
order to keep track of semantics of variable i.e. it stores information about scope and binding information about
names, information about instances of various entities such as variable and function names, classes, objects, etc.
• It is a data structure containing a record for each identifier, with fields for the attributes of the identifier.
• It allows finding the record for each identifier quickly and to store or retrieve data from that record.
• Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
1. It is built in lexical and syntax analysis phases.
2. The information is collected by the analysis phases of compiler and is used by synthesis phases of compiler to
generate code.
3. It is used by compiler to achieve compile time efficiency.
4. It is used by various phases of compiler as follows :-
1. Lexical Analysis: Creates new table entries in the table, example like entries about token.
2. Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of reference, use, etc in the
table.
3. Semantic Analysis: Uses available information in the table to check for semantics i.e. to verify that expressions
and assignments are semantically correct (type checking) and update it accordingly.
4. Intermediate Code generation: Refers symbol table for knowing how much and what type of run-time is
allocated and table helps in adding temporary variable information.
5. Code Optimization: Uses information present in symbol table for machine dependent optimization.
6. Target Code generation: Generates code by using address information of identifier present in the table.
Symbol Table entries – Each entry in symbol table is associated with attributes that support compiler in different
phases.
Items stored in Symbol table:
Variable names and constants
Procedure and function names
Literal constants and strings
Compiler generated temporaries
Labels in source languages
Information used by compiler from Symbol table:
Data type and name
Declaring procedures
Offset in storage
If structure or record then, pointer to structure table.
For parameters, whether parameter passing by value or by reference
Number and type of arguments passed to function
Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:
8
Example
int a, b; float c; char z, x;
9
Example of Compilation Process
10
Error detection and Recovery in Compiler
In this phase of compilation, all possible errors made by the user are detected and reported to the user in form of
error messages. This process of locating errors and reporting it to user is called Error Handling process.
Functions of Error handler
Detection
Reporting
Recovery
Classification of Errors
• Each phase can encounter errors. After detecting an error, a phase must handle the error so that compilation can
proceed.
i. In lexical analysis, errors occur in separation of tokens.
ii. In syntax analysis, errors occur during construction of syntax tree.
iii. In semantic analysis, errors may occur at the following cases:
(i) When the compiler detects constructs that have right syntactic structure but no meaning
(ii) During type conversion.
• In code optimization, errors occur when the result is affected by the optimization. In code generation, it shows
error when code is missing etc.
Figure illustrates the translation of source code through each phase, considering the statement
c =a+ b * 5.
Lexical Errors
It includes incorrect or misspelled name of some identifier i.e., identifiers typed incorrectly. INT integer
Lexical phase errors
These errors are detected during the lexical analysis phase. Typical lexical errors are
Exceeding length of identifier or numeric constants.
Appearance of illegal characters
Unmatched string
Example 1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $ appears at the end of statement.
11
Syntactical Errors
It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by syntax analyzer
(parser).
When an error is detected, it must be handled by parser to enable the parsing of the rest of the input. In general,
errors may be expected at various stages of compilation but most of the errors are syntactic errors and hence the
parser should be able to detect and report those errors in the program.
There are four common error-recovery strategies that can be implemented in the parser to deal with errors in
the code.
o Panic mode.
o Statement level.
o Error productions.
o Global correction.
Semantical Errors
These errors are detected during semantic analysis phase. These errors are a result of incompatible value
assignment. The semantic errors that the semantic analyzer is expected to recognize are:
Incompatible type of operands
Undeclared variables
Type mismatch.
Reserved identifier misuse.
Multiple declaration of variable in a scope.
Accessing an out of scope variable.
Not matching of actual arguments with formal one
12
Error Handling in Compiler Design
The tasks of the Error Handling process are to detect each error, report it to the user, and then make some recover
strategy and implement them to handle error. During this whole process processing time of program should not be
slow. An Error is the blank entries in the symbol table.
Types or Sources of Error – There are two types of error: run-time and compile-time error:
1. A run-time error is an error which takes place during the execution of a program, and usually happens because of
adverse system parameters or invalid input data. The lack of sufficient memory to run an application or a memory
conflict with another program and logical error are example of this.
2. Logic errors, occur when executed code does not produce the expected result. Logic errors are best handled by
meticulous program debugging.
3. Compile-time errors rises at compile time, before execution of the program. Syntax error or missing file reference
that prevents the program from successfully compiling is the example of this.
13