0% found this document useful (0 votes)
32 views

Day - 1 Intro To Compilers

Nlp ml

Uploaded by

Yeabsira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Day - 1 Intro To Compilers

Nlp ml

Uploaded by

Yeabsira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Principles of Compiler

Design
Yared Y
Outlines
• Intro to Compiling
• Lexical Analysis
• Syntax Analysis
• Syntax Directed Translation
• Symbol Tables & Type Checking
• Intermediate Code Generation(ICG)
• Run-Time Environment (CG)
• Code Generation
• Code Optimization
Text books
• Compiler Design: Syntactic and Semantic Analysis
Reinhad Wilhelm, Helmut Seidl, Sebastian Hack
• Introduction to Compiler Design
Torben Ægidius Mogensen
• Compiler Design: Analysis and Transformation
Reinhad Wilhelm, Helmut Seidl, Sebastian Hack
• Design and implementation of compiler
Sing r., Sharma V., Varshney M.
First thing first….
• Compiler design is the process of creating a program
called a compiler, which translates code written in a
high-level programming language into machine
language that a computer can execute.

• The main goal of a compiler is to bridge the gap


between human-friendly programming languages and
the machine's binary code, making software
development easier and more efficient.
Why don’t we just write in
machine code?
• Complexity
• Binary code is complex for humans. Error-prone
• Development speed
• Slow dev’t, inefficiency
• Maintainability
• Difficult to understand; hard to debug existing code; lack of abstraction
• Portability
• Machine code is specific to a particular type of processor
• Productivity tools
• Limited tools,
• Security and safety
• Vulnerability as machine code requires detailed mgmt of hardware resources
So….
• Because writing directly in machine code
is impractical due to its complexity and
lack of readability, we need a translator
like a compiler.

• A compiler serves several important


functions to bridge the gap between
human-friendly high-level programming
languages and the machine code that
computers can execute.
Introduction to Compilers
• A compiler translates (or compiles) a program written in a high-
level programming language, that is suitable for human
programmers, into the low-level machine language that is
required by computers.

• Simply stated, a compiler is a program that can read a program


in one language-the source language- and translate it into an
equivalent program in another language- the target language

• A compiler is a translator that converts the HLL into the


machine language.
• A compiler is a special program that turns the code you
write in a programming language (like C or Java) into
machine language, which a computer can understand
and run.

• Example:
You write a program in a language like Java
The compiler takes that code and translates it into
machine code (ones and zeros) so your computer can
execute it.
Compilers
Features of Compilers
• Ensuring Correctness of code
• Speed of compilation
• Preserve the correct meaning of the code
• Recognize legal and illegal program constructs
• Good error reporting/handling
• Code debugging help
Why use compilers?
• Efficient Code Execution
• Compilers optimize code for faster execution & lower resource consumption.
• Example: A compiler can rearrange code to make loops run more quickly.
• Error Detection
• Compilers detect syntax & semantic errors & check for data types
• Example: If you forget a semicolon, the compiler will point it out.
• Ease of programming
• Compilers allow the use of high-level programming languages that are easier to write, read, and maintain
compared to low-level machine code.
• Cross-platform Dev’t
• Portability: Compilers can generate code for multiple platforms from a single source code base, enabling
cross-platform software development.
• Platform Independence: High-level languages combined with compilers allow developers to write code
once and run it anywhere.
• Example: Java code can run on Windows, Mac, and Linux without changes.
• Security
• Static Analysis: Compilers perform static analysis to identify potential security vulnerabilities in the code.
• Type Safety: Ensures that operations are performed on compatible data types, reducing the risk of type-
related errors and vulnerabilities.
Examples of compilers
• javac – used by Java
• Microsoft Visual C++ Compiler (MSVC) - used by C/C++
• GCC (GNU Compiler Collection) – used by C/C++
• Cpython – used by Python
Phases of Compiler Design
• Compiler operates in various phases each phase transforms
the source program from one representation to another.
Every phase takes inputs from its previous stage and feeds
its output to the next phase of the compiler.
• 6 phases
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generator
Code Optimizer
Code Generator
Summary of steps
• Lexical Analysis: Breaks down code into tokens
(keywords, identifiers, operators).
• Tokenization, Pattern Matching, Error Handling
• Syntax Analysis: Checks the structure of the code.
• Semantic Analysis: Ensures the code makes sense.
• ICG: Converts code into an intermediate form.
• Optimization: Improves the efficiency of the code.
• Code Generation: Translates the code into machine
language
Phases of Compiler
Lexical Analysis
(Scanning/lexer)
• Analyzes the character string presented to it and divides it up
into tokens that are legal members of the vocabulary of the
language in which the program is written (and may produce
error messages if the character string is not parseable into a
string of legal tokens)

• Reads the stream of characters (making up the source


program) & groups the characters into meaningful sequences
called lexemes.
For each lexeme, the lexical analyzer produces as output a
token of the form:
< token-name, attribute-value>
• For example, suppose a source program contains the assignment
statement:
position = initial + rate * 60

lexemes:
position (id, 1) NB. Id is a symbol for identifier, and 1 is an entry in the symbol
table
= (=)
initial (id, 2)
+ (+)
rate (id, 3)
* (*)
60 (60)
• representation of the assignment statement after lexical
analysis as the sequence of tokens

<id, 1> <=> <id, 2> <+> <id, 3> <*> <60>
More notes
• This component reads the source program represented as a
sequence of characters mostly from a file. It decomposes this
sequence of characters into a sequence of lexical units of the
programming language. These lexical units are called symbols.
• Typical lexical units are keywords such as if, else, while or
switch and special characters and character combinations such
as =, ===, !=, >, >=, (, ), {, ] or comma and semicolon. These
need to be recognized and converted into corresponding internal
representations.

• The same holds for reserved identifiers such as names of basic


types int, float, double, char, bool or string, etc.
• Further symbols are identifiers and constants.
• Examples for identifiers are value42, abc, Myclass, x,
while the character sequences 42, 3:14159 and
00HalloWorldŠ00 represent constants.
Example
int x = 10;

Lexeme identification
int → Recognized as a keyword.
x → Recognized as an identifier.
= → Recognized as an assignment operator.
10 → Recognized as a numeric literal.
; → Recognized as a punctuation (semicolon).
Token generation
int → Token: `<keyword, int>`
x → Token: `<identifier, x>`
= → Token: `<operator, =>`
10 → Token: `<literal, 10>`
; → Token: `<punctuation, ;>`
Token and
lexeme
• A token is a categorized
unit of the input
language. It represents
a meaningful element of
the language, such as a
keyword, identifier,
literal, operator, or
punctuation mark.
Tokens are the output of
the lexical analyzer (or
scanner) and are passed
on to the parser.
lexeme
• A lexeme is the
actual sequence of
characters that
makes up a token. It
is the specific textual
representation of the
token in the input
program.
Identify the tokens and lexemes in
the following C program
#include <stdio.h>
int main() {
int x = 10;
printf("The value of x is: %d\n", x);
return 0;
}
#include <stdio.h>

int main() {
int x = 10;
printf("The
value of x is: %d\
n", x);
return 0;
}
Alternatives might vary based on
lexical specification
The lexical analysis process for the input int x = 10;.

Character Stream: The lexical analyzer will read the input


character by character, starting from the first character 'i' and
ending with the last character ';'.
Pattern Matching: As the lexical analyzer reads the characters,
it will try to match the characters against the predefined token
patterns. The following tokens will be recognized:
• The sequence "int" will be recognized as the KEYWORD_INT token.
• The sequence "x" will be recognized as the IDENTIFIER token.
• The character "=" will be recognized as the ASSIGN token.
• The sequence "10" will be recognized as the INTEGER_LITERAL token.
• The character ";" will be recognized as the SEMICOLON token.
• Token Stream: After the lexical analysis phase, the
input int x = 10; will be represented as the following
sequence of tokens:

KEYWORD_INT, IDENTIFIER(x), ASSIGN, INTEGER_LITERAL(10), SEMICOLON


KEYWORD_INT, IDENTIFIER(x), ASSIGN, INTEGER_LITERAL(10),
SEMICOLON

• Here's how the lexical analysis process works step-by-


step:
• The lexical analyzer starts by reading the first character 'i'.
• It compares the sequence "int" against the predefined patterns
and recognizes it as the KEYWORD_INT token.
• It then reads the next character 'x' and recognizes it as the
IDENTIFIER token.
• The character '=' is recognized as the ASSIGN token.
• The sequence "10" is recognized as the INTEGER_LITERAL
token.
• Finally, the character ';' is recognized as the SEMICOLON token.
Example: English Sentence
Sentence  Subject Verb Object endmark
E.g. Compilers are engineered objects.
Verb and endmark are parts of speech(p), and sentence, subject
and object are syntactical variables.
• The first step in understanding the syntax of this sentence is to
identify distinct words in the input program and to classify each
word with a part of speech.
• In a compiler, this task falls to a pass called the scanner. The
scanner takes a Scanner the compiler pass that converts a
string of characters into a stream of words stream of characters
and converts it to a stream of classified words—that is, pairs of
the form (p,s), where p is the word’s part of speech and s is its
spelling.
• Scanner: the compiler pass that converts a string of characters
into a stream of words
• A scanner would convert the example sentence into the following
stream of classified words:
(noun,“Compilers”), (verb,“are”), (adjective,“engineered”), (noun,“objects”), (endmark,“.”)

• In the next step (parsing), the compiler tries to match the


stream of categorized words against the rules (the
grammar of the language) that specify syntax for the
input language

• Parser: the compiler pass that determines if the input


stream is a sentence in the source language
• However, sometimes, a grammatically (syntactically) correct
sentence can be meaningless. This is where semantic analysis
comes into play.
Example: Rocks are green vegetables.
• Solution: Semantic Analysis and type checking
• Type checking: the compiler pass that checks for type-
consistent uses of names in the input program

• Example: a  a * 2 * b * c * d
This expression might be syntactically well-formed, but if b and d
are character strings, it will be invalid. Compilers check for
consistency of type.
Intermediate Representations (IRs)
• A compiler uses some set of data structures to
represent the code that it processes. That form is called
an intermediate representation, or IR

• The front end focuses on understanding the source-


language program.
• The back end focuses on mapping programs to the
target machine
• The front end must encode its knowledge of the source
program in some structure for later use by the back end.
• This intermediate representation (IR) becomes the
compiler’s definitive representation for the code it is
translating.
• At each point in compilation, the compiler will have a
definitive representation.
• The front end must ensure that the source program is well
formed, and it must map that code into the IR.
• The back end must map the IR program into the instruction
set and the finite resources of the target machine.
• the intermediate code uses temporary variables (t1, t2,
t3) to hold the intermediate results of various
operations, such as assignment, function call, and
parameter passing.
IRs Example
• Example: a  a * 2 * b * c * d
t0  a×2
t1  t0 × b
t2  t1 × c
t3  t2 × d
a t3
Syntactic analysis (Parsing)
• Aka parsing, which processes the sequence of tokens and produces
an intermediate-level representation, such as a parse tree or a
sequential intermediate code, and a symbol table that records the
identifiers used in the program and their attributes (and may produce
error messages if the token string contains syntax errors);

• The parser uses the first components of the tokens produced by the
lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token
stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the
arguments of the operation
• The tree in the previous slide shows the order in which
the operations in the assignment
position = initial + rate * 60
are to be performed
Semantic Checking
• checking of the program for static-semantic validity (or
semantic checking), which takes as input the
intermediate code and symbol table and determines
whether the program satisfies the static-semantic
properties required by the source language, e.g.,
whether identifiers are consistently declared and used
(and may produce error messages if the program is
semantically inconsistent or fails in some other way to
satisfy the requirements of the language’s definition)
• Uses the syntax tree and the information in the symbol
table to check the source program for semantic
consistency with the language definition
• An important part of semantic analysis is type
checking, where the compiler checks that each
operator has matching operands. For example, many
programming language definitions require an array
index to be an integer; the compiler must report an
error if a floating-point number is used to index an
array.
Intermediate code Generation
• Low-level or machine-like intermediate representation
• a simplified form of assembly code suitable for detailed
analysis.

t1 = inttofloat(60)
t2 = t1 * id3
t3 = t2 + id2
id1 = t3

• Three-address instruction
Code Generation
• transforms the intermediate code into equivalent
machine code in the form of a relocatable object
module or directly runnable object code.
Code Optimization
• Better code
• Faster code, short running time
• Shorter code
• Efficient code
• Example: eliminate the inttofloat conversion by
replacing 60 by 60.0
eliminate t3 by replacing with id1 = id2 + t1
Fundamental principles of
compilation
• Correctness: The compiler must preserve the meaning
of the program being compiled.

• The compiler must improve the input program in some


discernible way.
Why learn compilers?
• Deep understanding of programming languages
• Well-Cultured in Computer Science
• Understanding compilers is considered essential for being
well-rounded in computer science.
• Tool mastery: A Good Craftsman Knows His Tools
• Need for Writing Compilers or Interpreters
• Customization
• Specialized needs
• Innovation
Reading Assignment 1
• What is the difference between a compiler and
an interpreter?
• What are the advantages of
(a) a compiler over an interpreter
(b) an interpreter over a compiler?
• Read about Assemblers
Question
• What is a self-hosting compiler? What is self-hosting?
• Why would we want a self-hosting compiler?
• What is bootstrap compiler?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy