Unit 1
Unit 1
Unit 1
UNIT 1
INTRODUCTION TO COMPILERS
COMPILER:
A compiler is a software that converts the source code to the object code. In other
words, we can say that it converts the high-level language to machine/binary
language.
o Identifiers (user-defined)
o Delimiters/ punctuations (;, ,, {}, etc.)
o Operators (+, -, *, /, etc.)
o Special symbols
o Keywords
o Numbers
Lexical Errors
A character sequence which is not possible to scan into any valid token is a
lexical error. Important facts about the lexical error:
Lexical errors are not very common, but it should be managed by a
scanner
Misspelling of identifiers, operators, keyword are considered as lexical
errors
Generally, a lexical error is caused by the appearance of some illegal
character, mostly at the beginning of a token.
There are several reasons for separating the analysis phase of compiling into
lexical analysis and parsing
RECOGNITION OF TOKENS:
Tokens obtained during lexical analysis are recognized by Finite
Automata.
Finite Automata (FA) is a simple idealized machine that can be used to
recognize patterns within input taken from a character set or alphabet.
The primary task of an FA is to accept or reject an input based on
whether the defined pattern occurs within the input.
There are two notations for representing Finite Automata. They are:
Transition Table
Transition Diagram
TRANSITION TABLE
It is a tabular representation that lists all possible transitions for each state
and input symbol combination.
EXAMPLE
Assume the following grammar fragment to generate a specific language.
where the terminals if, then, else, relop, id and num generates sets of
strings given by following regular definitions.
where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-
9])
For this language, the lexical analyzer will recognize the keywords if,
then, and else, as well as lexemes that match the patterns for relop, id, and
number.
To simplify matters, we make the common assumption that keywords are
also reserved words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space, consisting
of nonnull sequences of blanks, tabs, and newlines.
Our lexical analyzer will strip out white space. It will do so by comparing
a string against the regular definition ws, below.
Recognition of Tokens in Compiler Design
If a match for ws is found, the lexical analyzer does not return a token to
the parser.
TRANSITION DIAGRAM
It is a directed labeled graph consisting of nodes and edges. Nodes
represent states, while edges represent state transitions.
SPECIFICATION OF TOKENS:
TOKENS:
A token is the smallest individual element of a program that is meaningful to the
compiler. It cannot be further broken down. Identifiers, strings, keywords, etc.,
can be the example of the token. In the lexical analysis phase of the compiler,
the program is converted into a stream of tokens.
Different Types of Tokens
There can be multiple types of tokens. Some of them are-
Keywords
Keywords are words reserved for particular purposes and imply a special
meaning to the compilers. The keywords must not be used for naming a
variable, function, etc.
Identifier
The names given to various components in the program, like the function's
name or variable's name, etc., are called identifiers. Keywords cannot be
identifiers.
Operators
Operators are different symbols used to perform different operations in a
programming language.
Punctuations
Punctuations are special symbols that separate different code elements in a
programming language.
Consider the following line of code
int x = 45;
The above statement has multiple tokens, which are-
Keywords: int
Identifier: x , 45
Operators: =
Punctuators ;
SPECIFICATION OF TOKENS
There are 3 specifications of tokens:
1)Strings
2) Language
3)Regular expression
Operations on strings
The following string-related terms are commonly used:
Operations on languages:
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings: Let L={0,1} and
S={a,b,c}
Regular Expressions
The language accepted by finite automata can be easily described by
simple expressions called Regular Expressions. It is the most effective
way to represent any language.
The languages accepted by some regular expression are referred to as
Regular languages.
A regular expression can also be described as a sequence of pattern that
defines a string.
Regular expressions are used to match character combinations in strings.
String searching algorithm used this pattern to find the operations on a
string.
· Here are the rules that define the regular expressions over some
alphabet Σ and the languages that those expressions denote:
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole
member is the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is,
the language with one string, of length one, with ‘a’ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then,
a) (r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative.
5.Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.
Regular set
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is
an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form
dl → r 1
d2 → r2
………
dn → rn
1.Each di is a distinct name.
2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a
letter. Regular
definition for this set:
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
Shorthands
- Thus the regular expression a+ denotes the set of all strings of one or more a’s.
- The operator + has the same precedence and associativity as the operator *.
3. Character Classes:
- The notation [abc] where a, b and c are alphabet symbols denotes the regular
expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
- We can describe identifiers as being strings generated by the regular
expression, [A–Za–z][A– Za–z0–9]*
Non-regular Set
2. Lex Compiler Execution: The lex.1 program is then executed using the
Lex compiler. This step generates a C program named lex.yy.c
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Where pi describes the regular expression and action1 describes the actions
what action the lexical analyzer should take when pattern pi matches a lexeme.
Working of Lex
Regular Expressions
As we have discussed, we use the Regular expression to define the rules in Lex
to see if the Input String matches the pattern. So regular expressions represent a
finite pattern in the language containing a finite set of symbols. Regular
expressions define a regular grammar, and regular language is defined by the
regular grammar.
Conflicts in Lex
Conflicts arise in Lex if a given string matches two or more different rules in
Lex. Or when an input string has a common prefix with two or more rules. This
causes uncertainty for the lexical analyzer, so it needs to be resolved.
Conflict resolution
Conflicts in Lex can be resolved by following two rules-
The longer prefix should be preferred over a shorter prefix.
Pick the pattern listed first in the Lex program if the longest possible prefix
corresponds to two or more patterns.
The Architecture of Lexical Analyzer
The task of the lexical analyzer is to read the input character in the source code
and produce tokens one by one. The scanner produces tokens when it is
requested by the parser. The lexical analyzer also avoids any whitespace and
comments while creating tokens. If any error occurs, the analyzer correlates
these errors with the source file and the line number.
EXAMPLE:
%{
#include<stdio.h>
#include<string.h>
int count = 0;
%}
/* Rules Section*/
%%
([a-zA-Z0-9])* {count++;} /* Rule for counting number of words*/
int yywrap(void){}
int main()
{
// The function that starts the analysis
yylex();
return 0;
}
How to execute
Type lex lexfile.l
Type ./a.exe
Explanation
In the definition section of the program, we have included the standard library for input-
output operations and string operations and a variable count to keep count of the number of
words.
In the rules section, we have specified the regular expression ([a-zA-Z0-9])*, which matches
any string formed by alphabets and numeric digits, such as “AYUSHI28”, “BOND007”, etc.
There is a rule for a newline too. When a newline character is encountered, the current count
of words is printed, and the counter is set to zero.
FINITE AUTOMATA:
Finite automata are used to recognize patterns.
It takes the string of symbol as input and changes its state accordingly.
When the desired symbol is found, then the transition occurs.
At the time of transition, the automata can either move to the next state or
stay in the same state.
Finite automata have two states, Accept state or Reject state. When the
input string is processed successfully, and the automata reached its final
state, then it will accept.
Formal Definition of FA
A finite automaton is a collection of 5-tuple (Q, ∑, δ, q0, F), where:
Input tape: It is a linear tape having some number of cells. Each input symbol
is placed in each cell.
Finite control: The finite control decides the next state on receiving particular
input from input tape. The tape reader reads the cells one by one from left to
right, and at a time only one input symbol is read.
Types of Automata:
There are two types of finite automata:
1. DFA
DFA NFA
DFA stands for Deterministic Finite NFA stands for Nondeterministic
Automata. Finite Automata.
For each symbolic representation of No need to specify how does the
the alphabet, there is only one state NFA react according to some
transition in DFA. symbol.
DFA cannot use Empty String NFA can use Empty String
transition. transition.
NFA can be understood as multiple
DFA can be understood as one
little machines computing at the
machine.
same time.
In NFA, each pair of state and input
In DFA, the next possible state is
symbol can have many possible next
distinctly set.
states.
DFA is more difficult to construct. NFA is easier to construct.
DFA rejects the string in case it NFA rejects the string in the event of
terminates in a state that is different all branches dying or refusing the
from the accepting state. string.
Time needed for executing an input Time needed for executing an input
string is less. string is more.
All DFA are NFA. Not all NFA are DFA.
DFA requires more space. NFA requires less space then DFA.
Dead configuration is not allowed. Dead configuration is allowed.
eg: if we give input as 0 on q0 state eg: if we give input as 0 on q0 state
so we must give 1 as input to q0 as so we can give next input 1 on q1
self loop. which will go to next state.
δ: Qx(Σ U ε) -> 2^Q i.e. next
δ: QxΣ -> Q i.e. next possible state
possible state belongs to power set of
belongs to Q.
Q.
Backtracking is not always possible
Backtracking is allowed in DFA.
in NFA.
Conversion of Regular expression to Conversion of Regular expression to
DFA is difficult. NFA is simpler compared to DFA.
Epsilon move is not allowed in DFA Epsilon move is allowed in NFA
DFA allows only one move for There can be choice (more than one
single input alphabet. move) for single input alphabet.