0% found this document useful (0 votes)
35 views63 pages

Introduction1

Uploaded by

bhuralqau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views63 pages

Introduction1

Uploaded by

bhuralqau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

CS-411:Compiler Construction

1. Introduction
Why Compilers?
• Compiler
– A program that translates from 1 language to another
– It must preserve semantics of the source
– It should create an efficient version of the target
language
• In the beginning, there was machine language
– Ugly – writing code, debugging
– Then came textual assembly
– High-level languages – Fortran, Pascal, C, C++
– Machine structures became too complex and software
management too difficult to continue with low-level
languages
Compiler learning
• Isn’t it an old discipline?
– Yes, it is a well-established discipline
– Algorithms, methods and techniques are researched and
developed in early stages of computer science growth
– There are many compilers around and many tools to
generate them automatically
• So, why we need to learn it?
– Although you may never write a full compiler
– But the techniques we learn is useful in many tasks like
writing an interpreter for a scripting language,
validation checking for forms and so on
Why Study Compilers? (1)
• Become a better programmer(!)
– Insight into interaction between languages,
compilers, and hardware
– Understanding of implementation techniques
– Better intuition about what your code does
Why Study Compilers? (2)
• Compiler techniques are everywhere
– Parsing (little languages, interpreters, XML)
– Database engines, query languages
– AI: domain-specific languages
– Text processing
• Tex/LaTex -> dvi -> Postscript -> pdf
– Hardware: VHDL; model-checking tools
– Mathematics (Mathematica, Matlab)
Why Study Compilers? (3)
• Fascinating blend of theory and engineering
– Direct applications of theory to practice
• Parsing, scanning, static analysis
– Some very difficult problems (NP-hard or
worse)
• Resource allocation, “optimization”, etc.
• Need to come up with good-enough
approximations/heuristics
Why Study Compilers? (4)
• Ideas from many parts of CS
– AI: Greedy algorithms, heuristic search
– Algorithms: graph algorithms, dynamic programming,
approximation algorithms
– Theory: Grammars, DFAs and PDAs, pattern matching,
fixed-point algorithms
– Systems: Allocation & naming, synchronization,
locality
– Architecture: pipelines, instruction set use, memory
hierarchy management
Why Study Compilers? (5)
• You might even write a compiler some day!
– You’ll almost certainly write parsers and
interpreters in some context if you haven’t
already
Course scope
• Aim:
– To learn techniques of a modern compiler
• Books:
– Des Watson, A Practical Approach to Compiler
Construction, Springer, 2017.
– Compilers – Principles, Techniques and Tools, Second
Edition by Alfred V. Aho, Ravi Sethi, Jeffery D.
Ullman
Prerequisites
• Problem Solving & Programming (C++,
Python)
• Data Structures
• Algorithm Design
• Theory of Automata
• Computer Architecture
• Assembly Programming
• Operating Systems
Topics
• High Level Languages
• Lexical analysis (Scanning)
• Syntax Analysis (Parsing)
• Syntax Directed Translation
• Intermediate Code Generation
• Run-time environments
• Code Generation
• Machine Independent Optimization
Grading policy
• Midterm (25 Marks)
• Final exam (50 Marks)
• Assignments (20 Marks)
• Quizes (5 Marks)
Terminology
• Compiler:
– a program that translates an executable program in one
language into an executable program in another
language
– we expect the program produced by the compiler to be
better, in some way, than the original
• Interpreter:
– a program that reads an executable program and
produces the results of running that program
– usually, this involves executing the source program in
some fashion
• Our course is mainly about compilers but many of
the same issues arise in interpreters
Common Issues
• Compilers and interpreters both must read
the input – a stream of characters – and
“understand” it; analysis

w h i l e ( k < l e n g t h ) { <nl> <tab> i f ( a [ k ] > 0


) <nl> <tab> <tab>{ n P o s + + ; } <nl> <tab> }
Interpreter
• Interpreter
– Execution engine
– Program execution interleaved with analysis
running = true;
while (running) {
analyze next statement;
execute that statement;
}
– Usually need repeated analysis of statements
(particularly in loops, functions)
– But: immediate execution, good debugging &
interaction
Compiler
• Read and analyze entire program
• Translate to semantically equivalent program in
another language
– Easier to execute or more efficient
– Should “improve” the program in some fashion
Typical Implementations
• Compilers
– FORTRAN, C, C++, Java, COBOL, etc. etc.
– Strong need for optimization in many cases
• Interpreters
– PERL, Python, Ruby, awk, sed, shells,
Scheme/Lisp/ML, postscript/pdf, Java VM
– Particularly effective if interpreter overhead is
low relative to execution cost of individual
statements
Hybrid approaches
• Well-known example: Java
– Compile Java source to byte codes – Java Virtual
Machine language (.class files)
– Execution
• Interpret byte codes directly, or
• Compile some or all byte codes to native code
– Just-In-Time compiler (JIT) – detect hot spots & compile on the
fly to native code – standard these days
• Variation: .NET
– Compilers generate MSIL
– All IL compiled to native code before execution
High-Level Languages
• In the 1940s it was clear that there was a
need for software tools to support the
programming process.
• Programming was done in machine code
– required considerable skill and was hard work
• Assembly language
– relieving the programmer from having to deal
with much of the low-level detail
– but require an assembler
High-Level Languages (Cont..)
• Development of high-level languages
gathered speed in the 1950.
• Need for compilers and other tools for the
implementation of these languages.
• The importance of formal language
specifications was recognized
• The correspondence between particular
grammar types and implementation was
understood
High-Level Languages (Cont..)
• The extensive use of high-level languages
prompted the rapid development of a wide
range of new languages.
– COBOL for business applications
– FORTRAN for numerical computation
– PL/I general-purpose
Advantages of High-Level Languages

• Problem solving is significantly faster


• High-level language programs
– Easier to read
– Understand
– Maintain
• High-level languages are easier to learn.
• Programs are structured more easily
• Data structuring features
Advantages of High-Level Languages (Cont..)

• Object orientation
• Support for asynchronous processes and
parallelism
• Software portability
– Machine independence
– Java
Advantages of High-Level Languages (Cont..)

• Compile-time checking can remove many


bugs at an early stage
– variable declarations
– type checking
– variables initialization
– compatibility in function arguments
• The compiler can insert runtime code such
as array bound checking
Disadvantages of High-Level Languages

• The program may need to perform some


low-level, hardware-specific operations
which do not correspond to a high-level
language feature.
• In most high-level languages there is no
way to express direct machine addressing
• Less efficiency in terms of execution speed
Abstract view

Source Machine
code Compiler code

errors
• Recognizes legal (and illegal) programs
• Generate correct code
• Manage storage of all variables and code
• Agreement on format for object (or
assembly) code
Structure of a Compiler
• First approximation
– Front end: analysis
• Read source program and understand its structure
and meaning
– Back end: synthesis
• Generate equivalent target language program

Source Front End Back End Target


Implications
• Must recognize legal programs (& complain about
illegal ones)
• Must generate correct code
• Must manage storage of all variables/data
• Must agree with OS & linker on target format

Source Front End Back End Target


More Implications
• Need some sort of Intermediate Representation(s)
(IR)
• Front end maps source into IR
• Back end maps IR to target machine code
• Often multiple IRs – higher level at first, lower
level in later phases

Source Front End Back End Target


Front End
• Split into two parts
– Scanner: Responsible for converting character stream to
token stream
• Also strips out white space, comments
– Parser: Reads token stream; generates IR
• Both of these can be generated automatically
– Source language specified by a formal grammar
– Tools read the grammar and generate scanner & parser
(either table-driven or hard-coded)

source tokens IR
Scanner Parser
Front end
Source tokens IR
Scanner Parser
code

errors

• Scanner:
– Maps characters into tokens – the basic unit of syntax
• x = x + y becomes <id, x> = <id, x> + <id, y>
– Typical tokens: number, id, +, -, *, /, do, end
– Eliminate white space (tabs, blanks, comments)
• A key issue is speed so instead of using a tool like
LEX it sometimes needed to write your own
scanner
Front end
Source tokens IR
Scanner Parser
code

errors
• Parser:
– Recognize context-free syntax
– Guide context-sensitive analysis
– Construct IR
– Produce meaningful error messages
– Attempt error correction
• There are parser generators like YACC which
automates much of the work
Front end
• Context free grammars are used to represent
programming language syntaxes:

<expr> ::= <expr> <op> <term> | <term>


<term> ::= <number> | <id>
<op> ::= + | -
Front end
• A parser tries to map a
program to the syntactic
elements defined in the
grammar
• A parse can be
represented by a tree
called a parse or syntax
tree
• x+2-y
Front end
• A parse tree can be
represented more
compactly referred to as
Abstract Syntax Tree
(AST)
• AST is often used as IR
between front end and
back end
Back end
Instruction Register Machine code
IR selection Allocation

errors

• Translate IR into target machine code


• Choose instructions for each IR operation
• Decide what to keep in registers at each
point
• Ensure conformance with system interfaces
Back end
Instruction Register Machine code
IR selection Allocation

errors

• Produce compact fast code


• Use available addressing modes
Back end
Instruction Register Machine code
IR selection Allocation

errors

• Have a value in a register when used


• Limited resources
• Optimal allocation is difficult
Traditional three pass compiler

Source IR Middle IR Machine


Front end Back end
code end code

errors

• Code improvement analyzes and change IR


• Goal is to reduce runtime
Middle end (optimizer)
• Modern optimizers are usually built as a set
of passes
• Typical passes
– Constant propagation
– Common sub-expression elimination
– Redundant store elimination
– Dead code elimination
Defining a Programming Language

• Important to users of the programming


language as well as to the compiler writer.
• how to write programs in the language
• Syntax
– Defines the sequences of characters that could
form a valid program
• Semantics
– the meaning of these programs
Defining a Programming Language

• The language definition should be


– Clear
– Precise
– Complete
– Unambiguous
– Understandable
• Programming languages have been defined
using natural language and ambiguities
• syntax is usually done using a more formal
approach
Backus–Naur Form (BNF)
• Metalanguage
• Its use in the definition of the syntax of
ALGOL 60
• It is a very simple
• powerful metalanguage
• used extensively to support the formal
definitions of a huge range of languages
Backus–Naur Form (BNF)
• BNF specification consists of a set of rules
• Rule defining a symbol
– non-terminal symbols
• in angle brackets in BNF
– terminal symbol
• cannot be expanded further
• Tokens
– A Start Symbol
• non-terminal
A trivial language
• <sentence> ::= <subject> <verb> <object>
• <subject> ::= <article> <noun>
• <object> ::= <article> <noun> | <article>
<adjective> <noun>
• <verb> ::= watches | hears | likes
• <article> ::= a | the
• <noun> ::= man | woman | bicycle | book
• <adjective> ::= red | big | beautiful
Example
• <sentence>
• <subject><verb><object>
• <article><noun><verb><object>
• the <noun><verb><object>
• the woman <verb><object>
• the woman watches <object>
• the woman watches <article><adjective><noun>
• the woman watches a <adjective><noun>
• the woman watches a beautiful <noun>
• the woman watches a beautiful bicycle
BNF for Expression
• Slightly more complicated example
– <expr> ::= <term> | <expr> + <term> |
<expr> - <term>
– <term> ::= <factor> | <term> * <factor> |
<term> / <factor>
– <factor> ::= <integer> | (<expr>)
– <integer> ::= <digit> | <integer> <digit>
– <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
– Expressions such as 2+3+4/5 and 2+3*(44-567)
BNF
• There is no upper limit to the length of
expressions that can be generated by these
particular rules.
• This is because these rules make use of
recursion
• Consider the generation or derivation of the
expression 1+2*3
BNF
• <expr>
• <expr> + <term>
• <term> + <term>
• <factor> + <term>
• 1 + <term>
• 1 + <term> * <factor>
• 1 + <factor> * <factor>
• 1 + 2 * <factor>
• 1+2*3
BNF
• Note particularly that in the expansion
– 1+2*3
– 1+ <term>
– 2*3 grouped as a single entity called a <term>
• BNF rules can be used to support the idea of
operator precedence
• * operator is higher than the precedence of the
+
• The parsing of the BNF rules allow the
specification of the precedence of operators.
BNF
• BNF rules can be used to express the
associativity of operators.
• consider the generation 1+2+3.
• Here, the 1+2 part of the expression is
grouped as a single entity called a <term>.
• Therefore, the expression 1+2+3 is
interpreted as (1+2)+3.
• The + operator is left-associative.
BNF
• If different associativity or precedence rules
are required, then the BNF could be
modified to express these different rules.
• It is perhaps surprising that such a simple
metalanguage can do all this.
Extended Backus–Naur Form (EBNF)

• There are several different variants of the


low-level syntax of BNF-like
metalanguages, but one variant became
popular after its use in the ISO Pascal
Standard.
• This variant was called Extended Backus–
Naur Form (EBNF)
• It retains the basic principles of BNF but the
syntactic detail is a little different.
Extended Backus–Naur Form (EBNF)

• expr = term | expr "+" term | expr "-" term.


• term = factor | term "*" factor | term "/"
factor.
• factor = integer | "("expr")".
• integer = digit | integer digit.
• digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" |
"7" | "8" | "9".
Extended Backus–Naur Form (EBNF)

• Some other key features.


– Parentheses can be used to indicate grouping in
the rule.
– There is a specific feature to indicate
optionality in a rule: [X] specifies zero or one
instance of X, in other words specifying that X
is optional.
– Repetition (not by recursion) is supported too:
{X} implies zero or more instances of X.
Extended Backus–Naur Form (EBNF)

• We can therefore write an essentially


equivalent set of rules:
• expr = term | expr ("+" | "-") term.
• term = factor | term ("*" | "/") factor.
• factor = integer | "("expr")".
• integer = digit {digit}.
• digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" |
"7" | "8" | "9".
Syntax Diagram
• Graphical representation of Syntax
• Complete syntax definition is made up of a
set of syntax diagrams
Grammars
• The term “grammar” has a wide range of
definitions
• We need a tight and formal definition for it
when used in the context of programming
languages.
• The idea of a set of BNF rules to represent
the grammar of a language has already been
discussed, but formally a little more is
required.
Grammars
• The grammar (G) of a language is defined
as a 4-tuple G = (N, T, S, P) where:
– N is the finite set of non-terminal symbols.
– T is the finite set of terminal symbols (N and T
are disjoint.)
– S is the starting symbol, S ∈ N. The starting
symbol is the unique non-terminal symbol that
is used as the starting point for the generation
of all the strings of the language.
Grammars
• P is the finite set of production rules
– A production rule defines a string transformation and it has the general
form
• α → β.
– This rule specifies that any occurrence of the string α in the string to
be transformed is replaced by the string β.
• There are constraints on the constitution of the strings α
and β.
– If U is defined by U = N ∪ T
• U is the set of all non-terminal and terminal symbols
• then α has to be a member of the set of all non-empty strings that
can be formed by the concatenation of members of U, and it has to
contain at least one member of N.
• β has to be a member of the set of all strings that can be formed by
the concatenation of members of U, including the empty string
– β ∈ U∗
Chomsky Hierarchy

Type 0

Type 1

Type 2

Type 3
Chomsky Hierarchy
• Chomsky type 0
– a free grammar or an unrestricted grammar
–α→β
• Chomsky type 1
– a context-sensitive grammar
– αAβ → αγβ
Chomsky Hierarchy
• Chomsky type 2
– a context-free grammar
– A→γ
– where A is a single non-terminal symbol.
– These productions correspond directly to BNF
rules.
• Chomsky type 3
– a regular grammar or a finite-state grammar
– A → a or A → aB

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy