Day - 1 Intro To Compilers
Day - 1 Intro To Compilers
Design
Yared Y
Outlines
• Intro to Compiling
• Lexical Analysis
• Syntax Analysis
• Syntax Directed Translation
• Symbol Tables & Type Checking
• Intermediate Code Generation(ICG)
• Run-Time Environment (CG)
• Code Generation
• Code Optimization
Text books
• Compiler Design: Syntactic and Semantic Analysis
Reinhad Wilhelm, Helmut Seidl, Sebastian Hack
• Introduction to Compiler Design
Torben Ægidius Mogensen
• Compiler Design: Analysis and Transformation
Reinhad Wilhelm, Helmut Seidl, Sebastian Hack
• Design and implementation of compiler
Sing r., Sharma V., Varshney M.
First thing first….
• Compiler design is the process of creating a program
called a compiler, which translates code written in a
high-level programming language into machine
language that a computer can execute.
• Example:
You write a program in a language like Java
The compiler takes that code and translates it into
machine code (ones and zeros) so your computer can
execute it.
Compilers
Features of Compilers
• Ensuring Correctness of code
• Speed of compilation
• Preserve the correct meaning of the code
• Recognize legal and illegal program constructs
• Good error reporting/handling
• Code debugging help
Why use compilers?
• Efficient Code Execution
• Compilers optimize code for faster execution & lower resource consumption.
• Example: A compiler can rearrange code to make loops run more quickly.
• Error Detection
• Compilers detect syntax & semantic errors & check for data types
• Example: If you forget a semicolon, the compiler will point it out.
• Ease of programming
• Compilers allow the use of high-level programming languages that are easier to write, read, and maintain
compared to low-level machine code.
• Cross-platform Dev’t
• Portability: Compilers can generate code for multiple platforms from a single source code base, enabling
cross-platform software development.
• Platform Independence: High-level languages combined with compilers allow developers to write code
once and run it anywhere.
• Example: Java code can run on Windows, Mac, and Linux without changes.
• Security
• Static Analysis: Compilers perform static analysis to identify potential security vulnerabilities in the code.
• Type Safety: Ensures that operations are performed on compatible data types, reducing the risk of type-
related errors and vulnerabilities.
Examples of compilers
• javac – used by Java
• Microsoft Visual C++ Compiler (MSVC) - used by C/C++
• GCC (GNU Compiler Collection) – used by C/C++
• Cpython – used by Python
Phases of Compiler Design
• Compiler operates in various phases each phase transforms
the source program from one representation to another.
Every phase takes inputs from its previous stage and feeds
its output to the next phase of the compiler.
• 6 phases
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generator
Code Optimizer
Code Generator
Summary of steps
• Lexical Analysis: Breaks down code into tokens
(keywords, identifiers, operators).
• Tokenization, Pattern Matching, Error Handling
• Syntax Analysis: Checks the structure of the code.
• Semantic Analysis: Ensures the code makes sense.
• ICG: Converts code into an intermediate form.
• Optimization: Improves the efficiency of the code.
• Code Generation: Translates the code into machine
language
Phases of Compiler
Lexical Analysis
(Scanning/lexer)
• Analyzes the character string presented to it and divides it up
into tokens that are legal members of the vocabulary of the
language in which the program is written (and may produce
error messages if the character string is not parseable into a
string of legal tokens)
lexemes:
position (id, 1) NB. Id is a symbol for identifier, and 1 is an entry in the symbol
table
= (=)
initial (id, 2)
+ (+)
rate (id, 3)
* (*)
60 (60)
• representation of the assignment statement after lexical
analysis as the sequence of tokens
<id, 1> <=> <id, 2> <+> <id, 3> <*> <60>
More notes
• This component reads the source program represented as a
sequence of characters mostly from a file. It decomposes this
sequence of characters into a sequence of lexical units of the
programming language. These lexical units are called symbols.
• Typical lexical units are keywords such as if, else, while or
switch and special characters and character combinations such
as =, ===, !=, >, >=, (, ), {, ] or comma and semicolon. These
need to be recognized and converted into corresponding internal
representations.
Lexeme identification
int → Recognized as a keyword.
x → Recognized as an identifier.
= → Recognized as an assignment operator.
10 → Recognized as a numeric literal.
; → Recognized as a punctuation (semicolon).
Token generation
int → Token: `<keyword, int>`
x → Token: `<identifier, x>`
= → Token: `<operator, =>`
10 → Token: `<literal, 10>`
; → Token: `<punctuation, ;>`
Token and
lexeme
• A token is a categorized
unit of the input
language. It represents
a meaningful element of
the language, such as a
keyword, identifier,
literal, operator, or
punctuation mark.
Tokens are the output of
the lexical analyzer (or
scanner) and are passed
on to the parser.
lexeme
• A lexeme is the
actual sequence of
characters that
makes up a token. It
is the specific textual
representation of the
token in the input
program.
Identify the tokens and lexemes in
the following C program
#include <stdio.h>
int main() {
int x = 10;
printf("The value of x is: %d\n", x);
return 0;
}
#include <stdio.h>
int main() {
int x = 10;
printf("The
value of x is: %d\
n", x);
return 0;
}
Alternatives might vary based on
lexical specification
The lexical analysis process for the input int x = 10;.
• Example: a a * 2 * b * c * d
This expression might be syntactically well-formed, but if b and d
are character strings, it will be invalid. Compilers check for
consistency of type.
Intermediate Representations (IRs)
• A compiler uses some set of data structures to
represent the code that it processes. That form is called
an intermediate representation, or IR
• The parser uses the first components of the tokens produced by the
lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token
stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the
arguments of the operation
• The tree in the previous slide shows the order in which
the operations in the assignment
position = initial + rate * 60
are to be performed
Semantic Checking
• checking of the program for static-semantic validity (or
semantic checking), which takes as input the
intermediate code and symbol table and determines
whether the program satisfies the static-semantic
properties required by the source language, e.g.,
whether identifiers are consistently declared and used
(and may produce error messages if the program is
semantically inconsistent or fails in some other way to
satisfy the requirements of the language’s definition)
• Uses the syntax tree and the information in the symbol
table to check the source program for semantic
consistency with the language definition
• An important part of semantic analysis is type
checking, where the compiler checks that each
operator has matching operands. For example, many
programming language definitions require an array
index to be an integer; the compiler must report an
error if a floating-point number is used to index an
array.
Intermediate code Generation
• Low-level or machine-like intermediate representation
• a simplified form of assembly code suitable for detailed
analysis.
t1 = inttofloat(60)
t2 = t1 * id3
t3 = t2 + id2
id1 = t3
• Three-address instruction
Code Generation
• transforms the intermediate code into equivalent
machine code in the form of a relocatable object
module or directly runnable object code.
Code Optimization
• Better code
• Faster code, short running time
• Shorter code
• Efficient code
• Example: eliminate the inttofloat conversion by
replacing 60 by 60.0
eliminate t3 by replacing with id1 = id2 + t1
Fundamental principles of
compilation
• Correctness: The compiler must preserve the meaning
of the program being compiled.