Unit 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

1908401-PRINCIPLES OF COMPILER DESIGN

UNIT 1
INTRODUCTION TO COMPILERS
COMPILER:
A compiler is a software that converts the source code to the object code. In other
words, we can say that it converts the high-level language to machine/binary
language.

PHASES OF COMPILER/STRUCTURE OF COMPILER:


The compilation process contains the sequence of various phases. Each phase
takes source program in one representation and produces output in another
representation. Each phase takes input from its previous stage.
There are the various phases of compiler:
Lexical Analysis:
Lexical analyzer phase is the first phase of compilation process. It takes source
code as input. It reads the source program one character at a time and converts it
into meaningful lexemes. Lexical analyzer represents these lexemes in the form
of tokens.
Syntax Analysis
Syntax analysis is the second phase of compilation process. It takes tokens as
input and generates a parse tree as output. In syntax analysis phase, the parser
checks that the expression made by the tokens is syntactically correct or not.
Semantic Analysis
Semantic analysis is the third phase of compilation process. It checks whether
the parse tree follows the rules of language. Semantic analyzer keeps track of
identifiers, their types and expressions. The output of semantic analysis phase is
the annotated tree syntax.
Intermediate Code Generation
In the intermediate code generation, compiler generates the source code into the
intermediate code. Intermediate code is generated between the high-level
language and the machine language. The intermediate code should be generated
in such a way that you can easily translate it into the target machine code.
Code Optimization
Code optimization is an optional phase. It is used to improve the intermediate
code so that the output of the program could run faster and take less space. It
removes the unnecessary lines of the code and arranges the sequence of
statements in order to speed up the program execution.
Code Generation
Code generation is the final stage of the compilation process. It takes the
optimized intermediate code as input and maps it to the target machine
language. Code generator translates the intermediate code into the machine code
of the specified computer.
Symbol Table Management:
A symbol table is a data structure that contains a record for each identifier with
field for attributes of the identifier. The type information about the identifier is
detected during the lexical analysis phases and is entered into the symbol table.
ERROR HANDLING:
The tasks of the Error Handling process are to detect each error, report it to
the user, and then make some recovery strategy and implement them to handle
the error. During this whole process processing time of the program should not
be slow.
Functions of Error Handler:
 Error Detection
 Error Report
 Error Recovery
Error handler=Error Detection+Error Report+Error Recovery.
EXAMPLE:
ROLE OF LEXICAL ANALYZER:
Lexical Analysis is the first phase of the compiler also known as a scanner. It
converts the High level input program into a sequence of Tokens.
 Lexical Analysis can be implemented with the Deterministic finite
Automata.
 The output is a sequence of tokens that is sent to the parser for syntax
analysis
Token
It is basically a sequence of characters that are treated as a unit as it cannot be
further broken down. There are different types of tokens:

o Identifiers (user-defined)
o Delimiters/ punctuations (;, ,, {}, etc.)
o Operators (+, -, *, /, etc.)
o Special symbols
o Keywords
o Numbers

Lexemes: A lexeme is a sequence of characters matched in the source program


that matches the pattern of a token. (, ) are lexemes of type punctuation where
punctuation is the token.

Patterns: A pattern is a set of rules a scanner follows to match a lexeme in the


input program to identify a valid token. It is like the lexical analyzer's
description of a token to validate a lexeme. For example, the characters in the
keyword are the pattern to identify a keyword. To identify an identifier the pre-
defined set of rules to create an identifier is the pattern.

Everything that a lexical analyzer has to do:

1. Stripping out comments and white spaces from the program


2. Read the input program and divide it into valid tokens
3. Find lexical errors
4. Return the Sequence of valid tokens to the syntax analyzer
5. When it finds an identifier, it has to make an entry into the symbol table.

Lexical Errors
A character sequence which is not possible to scan into any valid token is a
lexical error. Important facts about the lexical error:
 Lexical errors are not very common, but it should be managed by a
scanner
 Misspelling of identifiers, operators, keyword are considered as lexical
errors
 Generally, a lexical error is caused by the appearance of some illegal
character, mostly at the beginning of a token.

Error recovery for lexical errors:


Panic Mode Recovery
 In this method, successive characters from the input are removed one at a
time until a designated set of synchronizing tokens is found. Synchronizing
tokens are delimiters such as; or }
 The advantage is that it is easy to implement and guarantees not to go into
an infinite loop
 The disadvantage is that a considerable amount of input is skipped without
checking it for additional errors
Error-recovery actions are:
1. Transpose of two adjacent characters.
2. Insert a missing character into the remaining input.
3. Replace a character with another character.
4. Delete one character from the remaining input.

ISSUES IN LEXICAL ANALYZER:


Complexity: Lexical analysis can be complex and require a lot of
computational power. This can make it difficult to implement in some
programming languages.
Limited Error Detection: While lexical analysis can detect certain types of
errors, it cannot detect all errors. For example, it may not be able to detect
logic errors or type errors.
Increased Code Size: The addition of keywords and reserved words can
increase the size of the code, making it more difficult to read and understand.
Reduced Flexibility: The use of keywords and reserved words can also reduce
the flexibility of a programming language. It may not be possible to use certain
words or phrases in a way that is intuitive to the programmer.

There are several reasons for separating the analysis phase of compiling into
lexical analysis and parsing

1) Simpler design is the most important consideration. The separation of lexical


analysis from syntax analysis often allows us to simplify one or the other of
these phases.
2) Compiler efficiency is improved.
3) Compiler portability is enhanced.

RECOGNITION OF TOKENS:
 Tokens obtained during lexical analysis are recognized by Finite
Automata.
 Finite Automata (FA) is a simple idealized machine that can be used to
recognize patterns within input taken from a character set or alphabet.
 The primary task of an FA is to accept or reject an input based on
whether the defined pattern occurs within the input.
 There are two notations for representing Finite Automata. They are:
 Transition Table
 Transition Diagram

TRANSITION TABLE
It is a tabular representation that lists all possible transitions for each state
and input symbol combination.
EXAMPLE
Assume the following grammar fragment to generate a specific language.

where the terminals if, then, else, relop, id and num generates sets of
strings given by following regular definitions.
where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-
9])
For this language, the lexical analyzer will recognize the keywords if,
then, and else, as well as lexemes that match the patterns for relop, id, and
number.
To simplify matters, we make the common assumption that keywords are
also reserved words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space, consisting
of nonnull sequences of blanks, tabs, and newlines.
Our lexical analyzer will strip out white space. It will do so by comparing
a string against the regular definition ws, below.
Recognition of Tokens in Compiler Design

If a match for ws is found, the lexical analyzer does not return a token to
the parser.
TRANSITION DIAGRAM
 It is a directed labeled graph consisting of nodes and edges. Nodes
represent states, while edges represent state transitions.
SPECIFICATION OF TOKENS:
TOKENS:
A token is the smallest individual element of a program that is meaningful to the
compiler. It cannot be further broken down. Identifiers, strings, keywords, etc.,
can be the example of the token. In the lexical analysis phase of the compiler,
the program is converted into a stream of tokens.
Different Types of Tokens
There can be multiple types of tokens. Some of them are-

Keywords
Keywords are words reserved for particular purposes and imply a special
meaning to the compilers. The keywords must not be used for naming a
variable, function, etc.
Identifier
The names given to various components in the program, like the function's
name or variable's name, etc., are called identifiers. Keywords cannot be
identifiers.
Operators
Operators are different symbols used to perform different operations in a
programming language.
Punctuations
Punctuations are special symbols that separate different code elements in a
programming language.
Consider the following line of code
int x = 45;
The above statement has multiple tokens, which are-
Keywords: int
Identifier: x , 45
Operators: =
Punctuators ;

SPECIFICATION OF TOKENS
There are 3 specifications of tokens:
1)Strings
2) Language
3)Regular expression

Strings and Languages


 An alphabet or character class is a finite set of symbols.
 A string over an alphabet is a finite sequence of symbols drawn from
that alphabet.
 A language is any countable set of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as
synonyms for "string." The length of a string s, usually written |s|, is the number
of occurrences of symbols in s. For example, banana is a string of length six.
The empty string, denoted ε, is the string of length zero.

Operations on strings
The following string-related terms are commonly used:

1. A prefix of string s is any string obtained by removing zero or


more symbols from the end of string s. For example, ban is a prefix of banana.

2. A suffix of string s is any string obtained by removing zero or more


symbols from the beginning of s. For example, nana is a suffix of banana.

3. A substring of s is obtained by deleting any prefix and any suffix


from s. For example, nan is a substring of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those


prefixes, suffixes, and substrings, respectively of s that are not ε or not equal to
s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s
6. For example, baan is a subsequence of banana.

Operations on languages:
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure

The following example shows the operations on strings: Let L={0,1} and
S={a,b,c}

Regular Expressions
 The language accepted by finite automata can be easily described by
simple expressions called Regular Expressions. It is the most effective
way to represent any language.
 The languages accepted by some regular expression are referred to as
Regular languages.
 A regular expression can also be described as a sequence of pattern that
defines a string.
 Regular expressions are used to match character combinations in strings.
String searching algorithm used this pattern to find the operations on a
string.

· Each regular expression r denotes a language L(r).

· Here are the rules that define the regular expressions over some
alphabet Σ and the languages that those expressions denote:
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole
member is the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is,
the language with one string, of length one, with ‘a’ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then,
a) (r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative.
5.Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.

Regular set

A language that can be defined by a regular expression is called a regular set. If


two regular expressions r and s denote the same regular set, we say they are
equivalent and write r = s.

There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.

Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is
an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form
dl → r 1
d2 → r2

………
dn → rn
1.Each di is a distinct name.
2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.

Example: Identifiers is the set of strings of letters and digits beginning with a
letter. Regular
definition for this set:

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *

Shorthands

Certain constructs occur so frequently in regular expressions that it is


convenient to introduce notational short hands for them.
1. One or more instances (+):
- The unary postfix operator + means “ one or more instances of” .

- If r is a regular expression that denotes the language L(r), then ( r ) + is a regular


expression that denotes the language (L (r ))+

- Thus the regular expression a+ denotes the set of all strings of one or more a’s.
- The operator + has the same precedence and associativity as the operator *.

2. Zero or one instance ( ?):


- The unary postfix operator ? means “zero or one instance of”.

- The notation r? is a shorthand for r | ε.


- If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the
language

3. Character Classes:
- The notation [abc] where a, b and c are alphabet symbols denotes the regular
expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
- We can describe identifiers as being strings generated by the regular
expression, [A–Za–z][A– Za–z0–9]*

Non-regular Set

A language which cannot be described by any regular expression is a


non-regular set. Example: The set of all strings of balanced parentheses and
repeating strings cannot be described by a regular expression. This set can be
specified by a context-free grammar.

LEX IN COMPILER DESIGN


Lex in compiler design is a program used to generate scanners or lexical
analyzers, also called tokenizers. These tokenizers identify the lexical pattern in
the input program and convert the input text into the sequence of tokens. It is
used with the YACC parser generator.
Eric Schmidt and Mike Lesk initially developed the code for Lex, which was
intended for Unix-based systems.

The function of Lex is as follows:


1. Lexical Analyzer Creation: The process begins by creating a program
called lex.1 using Lex's language. This program defines the rules and
patterns for recognizing tokens in the source code

2. Lex Compiler Execution: The lex.1 program is then executed using the
Lex compiler. This step generates a C program named lex.yy.c

3. C Compiler Execution: The C compiler is then used to compile the


generated lex.yy.c program. The result is an object program referred to as
a.out

4. Lexical Analysis: The a.out object program is essentially a lexical


analyzer. When this program is run, it takes an input stream of source code
and transforms it into a sequence of tokens based on the rules defined in the
original lex.1 program
LEX Source Program
A LEX source program is a collection of instructions and patterns. These
instructions and patterns are written in the LEX programming language. LEX is
a tool used for generating lexical analyzers. It tokenizes input source code into
meaningful units called tokens.
The LEX source program defines how these tokens are recognized and
processed. It consists of regular expressions that describe the patterns of tokens
and corresponding actions to be taken when those patterns are encountered in
the source code. This program serves as a set of rules that instructs LEX on how
to break down a stream of characters from the input source code into tokens.
These tokens can represent keywords, identifiers, operators, and other language
constructs. There are two main components of this Lex source program:
1. Auxilary Definitions: These are often located in the "Definitions
Section" of the LEX source program. These are user-defined macros or
variables that simplify the expression of regular expressions and actions in
the program

2. Translation Rules: These are commonly found in the "Rules Section" of


the LEX source program. They establish the mapping between patterns and
actions. Each translation rule consists of a regular expression pattern
followed by the action to be executed when that pattern is matched in the
input source code
Lex file format

A Lex program is separated into three sections by %% delimiters. The formal of


Lex source is as follows:

{ definitions }
%%
{ rules }
%%
{ user subroutines }

Definitions include declarations of constant, variable and regular definitions.

Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.

Where pi describes the regular expression and action1 describes the actions
what action the lexical analyzer should take when pattern pi matches a lexeme.

User subroutines are auxiliary procedures needed by the actions. The


subroutine can be loaded with the lexical analyzer and compiled separately.

Working of Lex

The working of lex in compiler design as a lexical analysis takes place in


multiple steps. Firstly we create a file that describes the generation of the lex
analyzer. This file is written in Lex language and has a .l extension. The lex
compiler converts this program into a C file called lex.yy.c. The C compiler
then runs this C file, and it is compiled into a.out file. This a.out file is our
working Lexical Analyzer which will produce the stream of tokens based on the
input text.

Regular Expressions
As we have discussed, we use the Regular expression to define the rules in Lex
to see if the Input String matches the pattern. So regular expressions represent a
finite pattern in the language containing a finite set of symbols. Regular
expressions define a regular grammar, and regular language is defined by the
regular grammar.
Conflicts in Lex
Conflicts arise in Lex if a given string matches two or more different rules in
Lex. Or when an input string has a common prefix with two or more rules. This
causes uncertainty for the lexical analyzer, so it needs to be resolved.
Conflict resolution
Conflicts in Lex can be resolved by following two rules-
 The longer prefix should be preferred over a shorter prefix.

 Pick the pattern listed first in the Lex program if the longest possible prefix
corresponds to two or more patterns.
The Architecture of Lexical Analyzer
The task of the lexical analyzer is to read the input character in the source code
and produce tokens one by one. The scanner produces tokens when it is
requested by the parser. The lexical analyzer also avoids any whitespace and
comments while creating tokens. If any error occurs, the analyzer correlates
these errors with the source file and the line number.
EXAMPLE:

%{
#include<stdio.h>
#include<string.h>
int count = 0;
%}

/* Rules Section*/
%%
([a-zA-Z0-9])* {count++;} /* Rule for counting number of words*/

"\n" {printf("Total Number of Words : %d\n", count); count = 0;}


%%

int yywrap(void){}

int main()
{
// The function that starts the analysis
yylex();

return 0;
}

How to execute
Type lex lexfile.l

Type gcc lex.yy.c.

Type ./a.exe
Explanation
In the definition section of the program, we have included the standard library for input-
output operations and string operations and a variable count to keep count of the number of
words.
In the rules section, we have specified the regular expression ([a-zA-Z0-9])*, which matches
any string formed by alphabets and numeric digits, such as “AYUSHI28”, “BOND007”, etc.
There is a rule for a newline too. When a newline character is encountered, the current count
of words is printed, and the counter is set to zero.

FINITE AUTOMATA:
 Finite automata are used to recognize patterns.
 It takes the string of symbol as input and changes its state accordingly.
When the desired symbol is found, then the transition occurs.
 At the time of transition, the automata can either move to the next state or
stay in the same state.
 Finite automata have two states, Accept state or Reject state. When the
input string is processed successfully, and the automata reached its final
state, then it will accept.

Formal Definition of FA
A finite automaton is a collection of 5-tuple (Q, ∑, δ, q0, F), where:

1. Q: finite set of states


2. ∑: finite set of the input symbol
3. q0: initial state
4. F: final state
5. δ: Transition function
Finite Automata Model:
Finite automata can be represented by input tape and finite control.

Input tape: It is a linear tape having some number of cells. Each input symbol
is placed in each cell.
Finite control: The finite control decides the next state on receiving particular
input from input tape. The tape reader reads the cells one by one from left to
right, and at a time only one input symbol is read.

Types of Automata:
There are two types of finite automata:

1. DFA(deterministic finite automata)


2. NFA(non-deterministic finite automata)

1. DFA

DFA refers to deterministic finite automata. Deterministic refers to the


uniqueness of the computation. In the DFA, the machine goes to one state only
for a particular input character. DFA does not accept the null move.
2. NFA

NFA stands for non-deterministic finite automata. It is used to transmit any


number of states for a particular input. It can accept the null move.
 Every DFA is NFA, but NFA is not DFA.
 There can be multiple final states in both NFA and DFA.
 DFA is used in Lexical Analysis in Compiler.
 NFA is more of a theoretical concept.

DFA NFA
DFA stands for Deterministic Finite NFA stands for Nondeterministic
Automata. Finite Automata.
For each symbolic representation of No need to specify how does the
the alphabet, there is only one state NFA react according to some
transition in DFA. symbol.
DFA cannot use Empty String NFA can use Empty String
transition. transition.
NFA can be understood as multiple
DFA can be understood as one
little machines computing at the
machine.
same time.
In NFA, each pair of state and input
In DFA, the next possible state is
symbol can have many possible next
distinctly set.
states.
DFA is more difficult to construct. NFA is easier to construct.
DFA rejects the string in case it NFA rejects the string in the event of
terminates in a state that is different all branches dying or refusing the
from the accepting state. string.
Time needed for executing an input Time needed for executing an input
string is less. string is more.
All DFA are NFA. Not all NFA are DFA.
DFA requires more space. NFA requires less space then DFA.
Dead configuration is not allowed. Dead configuration is allowed.
eg: if we give input as 0 on q0 state eg: if we give input as 0 on q0 state
so we must give 1 as input to q0 as so we can give next input 1 on q1
self loop. which will go to next state.
δ: Qx(Σ U ε) -> 2^Q i.e. next
δ: QxΣ -> Q i.e. next possible state
possible state belongs to power set of
belongs to Q.
Q.
Backtracking is not always possible
Backtracking is allowed in DFA.
in NFA.
Conversion of Regular expression to Conversion of Regular expression to
DFA is difficult. NFA is simpler compared to DFA.
Epsilon move is not allowed in DFA Epsilon move is allowed in NFA
DFA allows only one move for There can be choice (more than one
single input alphabet. move) for single input alphabet.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy