Unit II
Unit II
Example:
L= {wcwR | w € (a, b)*}
Production rules:
S → aSa | bSb | c
Now check that abbcbba string can be derived from the given CFG.
S ⇒ aSa
⇒ abSba
⇒ abbSbba
⇒ abbcbba
By applying the production S → aSa, and S → bSb recursively and finally applying the production
S → c, we get the string abbcbba.
Department of Computer Science and Engineering
Capabilities of CFG
Context free grammar is useful to describe most of the programming languages.
If the grammar is properly designed then an efficientparser can be constructed
automatically.
Using the features of associatively & precedence information, suitable grammars for
expressions can be constructed.
Context free grammar is capable of describing nested structures like: balanced
parentheses, matching begin-end, corresponding if-then-else's & so on.
1. Construct a CFG for the language of palindrome strings over {a,b}.
G=({S},{a,b},P, S) ,
where P={S -> a S a | b S b | epsilon | a | b}.
2. Write the CFG for the language L= {anbn/n≥1}.
G=({S},{a,b},P, S) ,
where P={S -> a S b | epsilon}.
3. Find out the context free language
S->aSb/aAb
A->bAa
A->ba
CFL for the given grammar is L={(ab)n | n is even}
4. Construct a CFG over a and b that contains equal number of a and b.
G=({S},{a,b},P, S) ,
where P is
S aSbS | bSaS | SS
5. Write down the context free grammar for the language L = { anbn |n>=1}.
G=({S},{a,b},P, S)
where P is S aSb |ab
Derivation
Derivation is a sequence of production rules. It is used to get the input string through these
production rules. During parsing we have to take two decisions. These are as follows:
We have to decide the non-terminal which is to be replaced.
We have to decide the production rule by which the non-terminal will be replaced.
We have two options to decide which non-terminal to be replaced with production rule.
Left-most Derivation
In the left most derivation, the input is scanned and replaced with the production rule from left to
right. So in left most derivatives we read the input string from left to right.
S=S+S
S=S-S
S = a | b |c
Input:
a-b+c
The left-most derivation is:
S⇒S+S
⇒S-S+S S->S-S
⇒a-S+S S->a
⇒a-b+S S->b
⇒a-b+c S->c
S⇒a-b+c
Right-most Derivation
In the right most derivation, the input is scanned and replaced with the production rule from right
to left. So in right most derivatives we read the input string from right to left.
S=S+S
S=S-S
S = a | b |c
Input:
a-b+c
The right-most derivation is:
S⇒S-S
⇒S - S + S S->S-S
⇒S - S + c S->c
⇒S - b + c S->b
⇒a - b + c S->a
S⇒a-b+c
Parse tree
Parse tree is the graphical representation of symbol. The symbol can be terminal or non-
terminal.
In parsing, the string is derived using the start symbol. The root of the parse tree is that
start symbol.
It is the graphical representation of symbol that can be terminals or non-terminals.
Parse tree follows the precedence of operators. The deepest sub-tree traversed first. So,
the operator in the parent node has less precedence over the operator in the sub-tree.
The parse tree follows these points:
All leaf nodes have to be terminals.
Step 1: Step 2:
Step 3: Step 4
Step 5
Example
Derive the string"00101" for left most derivation (LMD) and right most derivation (RMD) using
context free grammar (CFG).
S->A1B
A->0A| ε
B->0B| 1B| ε
The left-most derivation is:
S⇒ A1B
⇒ 0A1B A->0A
⇒ 00A1B A->0A
⇒ 001B A->ε
⇒ 0010B B->0B
⇒ 00101B B->1B
⇒ 00101 B->ε
Derived the string 00101 using LMD
The right-most derivation is:
S⇒A1B
⇒A10B B->0B
⇒A101B B->1B
⇒A101 B->ε
⇒0A101 A->0A
⇒00A101 A->0A
⇒00101 A->ε
Derived the string 00101 using RMD
Example
Derive the string "abb" for leftmost derivation and rightmost derivation using a CFG given by,
S → aB | bA
S → a | aS | bAA
S → b | aS | aBB
The left-most derivation is:
S⇒ aB S → aB
⇒ aaBB B → aBB
⇒ aabB B→b
⇒ aabbS B → bS
⇒ aabbaB S → aB
⇒ aabbabS B → bS
⇒ aabbabbA S → bA
⇒ aabbabba A→a
Construct Leftmost and Rightmost derivation and parse tree for the string 3*2+5 from
the given grammar. Set of alphabets ∑ = {0,…,9, +, *, (, )}
EI
E E + E
E E * E
E (E)
I ε | 0 | 1 | … | 9
Solution:
Ambiguous Grammar
A CFG is said to be ambiguous if there exists more than one derivation for the given input string
i.e., more than one Left Most Derivation (LMD) or Right Most Derivation (RMD).
G = (V, T, P, S) is a CFG that is said to be ambiguous if and only if there exists a string in T*
that has more than one parse tree.
Example:
Consider the following grammar-
S → aB / bA
S → aS / bAA / a
B → bS / aBB / b
Let us consider a string w = aaabbabbba, Derive the string w using leftmost derivation.
Leftmost Derivation:
S → aB
→ aaBB B → aBB
→ aaaBBB B → aBB
→ aaabBB B→b
→ aaabbB B→b
→ aaabbaBB B → aBB
→ aaabbabB B→b
→ aaabbabbS B → bS
→ aaabbabbbA S → bA
→ aaabbabbba A→a
Rightmost Derivation
The process of deriving a string by expanding the rightmost non-terminal at each step is called
as rightmost derivation. The geometrical representation of rightmost derivation is called as
a rightmost derivation tree
S → aB
→ aaBB B → aBB
→ aaBaBB B → aBB
→ aaBaBbS B → bS
→ aaBaBbbA S → bA
→ aaBaBbba A→a
→ aaBabbba B→b
→ aaaBBabbba B → aBB
→ aaaBbabbba B→b
→ aaabbabbba B→b
Here, there exists only one LMD and RMD so the given grammar was unambiguous.
Show that the given grammar is ambiguous for the input string 3*2+5.
Set of alphabets ∑ = {0,…,9, +, *, (, )}
EI
E E + E
E E * E
E (E)
I ε | 0 | 1 | … | 9
Solution:
First Left Most Derivation: Derivation Tree 1:
E=>E*E E I
=>I*E I 3
=>3*E+E E E+E
=>3*I+E E I
=>3*2+E I 2
=>3*2+I E I
=>3*2+5 I 5
=>E*E+E E I
=>I*E+E I 3
=>3*E+E E I
=>3*I+E I 2
=>3*2+I E I
=>3*2+5 I 5
From the given grammar, String 3*2+5 can be derived by two Left Most Derivation, so
the given grammar is ambiguous.
→ id + E
→ id + E - E
→ id + id - E
→ id + id- id
→E+E-E
→ id + E - E
→ id + id - E
→ id + id - id
Since there are two leftmost derivation for a single string "id + id - id", the grammar G is ambiguous.
S → aSb | SS
S→ε
Solution:
First Left Most Derivation:
S→SS S→ ε
→S S→ aSb
→aSb S→ ε
→aaSbb S→ aSb
→aabb S→ ε
S→aSb S→ aSb
→aaSbb S→ aSb
→aabb S→ ε
A→AA A→AA
→AAA A→a
→aAA A→(A)
→a(A)A A→a
→a(a)A A→a
→a(a)a
LEFT RECURSION:
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’
itself as the left-most symbol. Left-recursive grammar is considered to be a problematic situation
for top-down parsers. Top-down parsers start parsing from the Start symbol, which in itself is
non-terminal. So, when the parser encounters the same non-terminal in its derivation, it becomes
hard for it to judge when to stop parsing the left non-terminal and it goes into an infinite loop.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left
recursion.
Final answer as
E → TE′
E′ → +TE′| ε
T → FT′
T →* FT′|ε
F → (E)| id
Solution:
S → a|^|(T) does not have any left recursion
E → TE′
E → (T)E′|ε
T → FT′
T′ → (F)T′|ε
F → id
LEFT FACTORING:
If a more than one grammar production rule has a common prefix string, then the top-down
parser cannot make a choice as to which of the production it should take to parse the string in
hand.
Left factoring is a process by which the grammar with common prefixes is transformed to make
it useful for Top down parsers.
Example
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique,
we make one production for each common prefixes and the rest of the derivation is added by
new productions.
Step 1
A → aA’
A’ → AB / Bc / Ac
Again, this is a grammar with common prefixes.
Step 2
A → aA’
A’ → AD / Bc
D→B/c
Step 2:
S → bSS’ / a
S’ → SaA / b
A → aS / Sb
This is a left factored grammar.
Solution:
S → aS’
S’ → Ad / B
A → aA’
A’ → b / ∈
B → ccd / ddc
SIMPLIFICATION OF CFG:
we have seen, various languages can efficiently be represented by a context-free grammar. All
the grammar are not always optimized that means the grammar may consist of some extra
symbols(non-terminal). Having extra symbols, unnecessary increase the length of grammar.
Simplification of grammar means reduction of grammar by removing useless symbols. The
properties of reduced grammar are given below:
1. Each variable (i.e. non-terminal) and each terminal of G appears in the derivation of some
word in L.
2. There should not be any production as X → Y where X and Y are non-terminal.
3. If ε is not in the language L then there need not to be the production X → ε.
Production A → aA is also useless because there is no way to terminate it. If it never terminates,
then it can never produce a string. Hence this production can never take part in any derivation.
To remove this useless production A → aA, we will first find all the variables which will never
lead to a terminal string such as variable 'A'. Then we will remove all the productions in which
the variable 'B' occurs.
Elimination of ε Production
The productions of type S → ε are called ε productions. These type of productions can only be
removed from those grammars that do not generate ε.
Step 1: First find out all nullable non-terminal variable which derives ε.
Step 2: For each production A → a, construct all production A → x, where x is obtained from a
by removing one or more non-terminal from step 1.
Step 3: Now combine the result of step 2 with the original production and remove ε productions.
Example:
Remove the production from the following CFG by preserving the meaning of it.
S → XYX
X → 0X | ε
Y → 1Y | ε
Solution:
Now, while removing ε production, we are deleting the rule X → ε and Y → ε. To preserve the
meaning of CFG we are actually placing ε at the right-hand side whenever X and Y have
appeared.
Let us take
S → XYX
If the first X at right-hand side is ε. Then
S → YX
Similarly if the last X in R.H.S. = ε. Then
S → XY
If Y = ε then
S → XX
If Y and X are ε then,
S→X
If both X are replaced by ε
S→Y
Now
S → XY | YX | XX | X | Y
Now let us consider
X → 0X
If we place ε at right-hand side for X then,
X→0
X → 0X | 0
Similarly Y → 1Y | 1
Collectively we can rewrite the CFG with removed ε production as
S → XY | YX | XX | X | Y
X → 0X | 0
Y → 1Y | 1
Example:
S → 0A | 1B | C
A → 0S | 00
B→1|A
C → 01
Solution:
S → C is a unit production. But while removing S → C we have to consider what C gives. So,
we can add a rule to S.
S → 0A | 1B | 01
Similarly, B → A is also a unit production so we can modify it as
B → 1 | 0S | 00
G1 = {S → AB, S → c, A → a, B → b}
G2 = {S → aA, A → a, B → c}
The production rules of Grammar G1 satisfy the rules specified for CNF, so the grammar G1 is
in CNF. However, the production rule of Grammar G2 does not satisfy the rules specified for
CNF as S → aA contains terminal followed by non-terminal. So the grammar G2 is not in CNF.
Steps for converting CFG into CNF
Step 1: Eliminate start symbol from the RHS. If the start symbol T is at the right-hand side of
any production, create a new production as:
S1 → S
Where S1 is the new start symbol.
Step 2: In the grammar, remove the null, unit and useless productions. You can refer to
the Simplification of CFG.
Step 3: Eliminate terminals from the RHS of the production if they exist with other
non-terminals or terminals. For example, production S → aA can be decomposed as:
S → RA
R→a
Step 4: Eliminate RHS with more than two non-terminals. For example, S → ASB can be
decomposed as:
S → RS
R → AS
Convert the given CFG to CNF. Consider the given grammar G1:
S → a | aA | B
A → aBB | ε
B → Aa | b
Solution:
Step 1: We will create a new production S1 → S, as the start symbol S appears on the RHS. The
grammar will be:
S1 → S
S → a | aA | B
A → aBB | ε
B → Aa | b
Step 2: As grammar G1 contains A → ε null production, its removal from the grammar yields:
S1 → S
S → a | aA | B
A → aBB
B → Aa | b | a
Now, as grammar G1 contains Unit production S → B, its removal yield:
S1 → S
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Also remove the unit production S1 → S, its removal from the grammar yields:
S0 → a | aA | Aa | b
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Step 3: In the production rule S0 → aA | Aa, S → aA | Aa, A → aBB and B → Aa, terminal a
exists on RHS with non-terminals. So we will replace terminal a with X:
S0 → a | XA | AX | b
S → a | XA | AX | b
A → XBB
B → AX | b | a
X→a
Step 4: In the production rule A → XBB, RHS has more than two symbols, removing it from
grammar yield:
S0 → a | XA | AX | b
S → a | XA | AX | b
A → RB
B → AX | b | a
X→a
R → XB
Hence, for the given grammar, this is the required CNF.
Greibach Normal Form (GNF)
GNF stands for Greibach normal form. A CFG(context free grammar) is in GNF(Greibach
normal form) if all the production rules satisfy one of the following conditions:
A start symbol generating ε. For example, S → ε.
A non-terminal generating a terminal. For example, A → a.
A non-terminal generating a terminal which is followed by any number of non-terminals.
For example, S → aASB.
Example:
G1 = {S → aAB | aB, A → aA| a, B → bB | b}
G2 = {S → aAB | aB, A → aA | ε, B → bB | ε}
The production rules of Grammar G1 satisfy the rules specified for GNF, so the grammar G1 is
in GNF. However, the production rule of Grammar G2 does not satisfy the rules specified for
GNF as A → ε and B → ε contains ε(only start symbol can generate ε). So the grammar G2 is
not in GNF.
Steps for converting CFG into GNF
Step 1. If the given grammar is not in CNF, convert it to CNF. You can refer following article to
convert CFG to CNF: Converting Context Free Grammar to Chomsky Normal Form
Step 3. Check for every production rule if RHS has first symbol as non-terminal say Aj for the
production of Ai, it is mandatory that i should be less than j. Not great and not even equal.
If i> j then replace the production rule of Aj at its place in Ai.
If i=j, it is the left recursion. Create a new state Z which has the symbols of the left recursive
production, once followed by Z and once without Z, and change that production rule by
removing that particular production and adding all other production once followed by Z.
Step 4. Replace very first non-terminal symbol in any production rule with its production until
production rule satisfies the above conditions.
For converting a CNF to GNF always move left to right for renaming the variables.
Convert the given CFG into GNF:
S → XA|BB
B → b|SB
X→b
A→a
Solution:
For converting a CNF to GNF first rename the non terminal symbols to A1,A2 till
AN in same sequence as they are used.
A1 = S
A2 = X
A3 = A
A4 = B
Therefore, now the new production rule is,
A1 → A2A3 | A4A4
A2 → b
A3 → a
A4 → b | A1A4
Now, check for every production Ai → Aj X, where X can be any number of terminal symbols.
If i<j in the production then it is good to go to the next step but if i>=j then change the
production by replacing it with that terminal symbol’s production. if i=j then it is a left recursion
and you need to remove left recursion.
Here for A4, 4 !< 1, so now replace it with A1‘s production rule.
1.
A1 → A2A3 | A4A4
A2 → b
A3 → a
A4 → b | A2A3A4 | A4A4A4
----------------------
2.
A1 → A2A3 | A4A4
A2 → b
A3 → a
A4 → b | bA3A4 | A4A4A4
COMPILER:
A programmer writes the source code in a code editor or an integrated development environment
(IDE) that includes an editor, saving the source code to one or more text files. A compiler that
supports the source programming language reads the files, analyzes the code, and translates it
into a format suitable for the target platform.
A compiler is a program that can read a program in one language (the source language) and
translate it into an equivalent program in another language (the target language). An important
role of the compiler is to report any errors in the source program that it detects during the
translation process.
If the target program is an executable machine-language program, then the user to process inputs
and produces outputs.
Interpreter:
An interpreter is another common kind of language processor. Instead of producing a target
program as a translation, an interpreter appears to directly execute the operations specified in the
source program on inputs supplied by the user, as shown in Fig.
An interpreter, however, can usually give better error diagnostics than a compiler, because it
executes the source program statement by statement.
The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly language is easier to produce as output and is
easier to debug. The assembly language is then processed by a program called an assembler that
produces relocatable machine code as its output.
Lexical analysis. The compiler splits the source code into lexemes, which are individual
code fragments that represent specific patterns in the code. The lexemes are then tokenized in
preparation for syntax and semantic analyses.
Syntax analysis. The compiler verifies that the code's syntax is correct, based on the rules
for the source language. This process is also referred to as parsing. During this step, the
compiler typically creates abstract syntax trees that represent the logical structures of specific
code elements.
Semantic analysis. The compiler verifies the validity of the code's logic. This step goes
beyond syntax analysis by validating the code's accuracy. For example, the semantic analysis
might check whether variables have been assigned the right types or have been properly
declared.
Intermediate Code Generation. After the code passes through all three analysis phases, the
compiler generates an intermediate representation (IR) of the source code. The IR code
makes it easier to translate the source code into a different format. However, it must
accurately represent the source code in every respect, without omitting any functionality.
Code Optimization. The compiler optimizes the IR code in preparation for the final code
generation. The type and extent of optimization depends on the compiler. Some compilers let
users configure the degree of optimization.
Output Code Generation. The compiler generates the final output code, using the optimized
IR code.
Compilers are sometimes confused with programs called interpreters. Although the two are
similar, they differ in important ways. Compilers analyze and convert source code written in
languages such as Java, C++, C# or Swift. They're commonly used to generate machine code or
bytecode that can be executed by the target host system.
Interpreters do not generate IR code or save generated machine code. They process the code one
statement at a time at runtime, without pre-converting the code or preparing it in advance for a
particular platform. Interpreters are used for code written in scripting languages such as Perl,
PHP, Ruby or Python.
PHASES OF A COMPILER
A compiler as a single box that maps a source program into a semantically equivalent
target program. There are two parts in Compiler:
Analysis
Synthesis.
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate representation
of the source program. If the analysis part detects that the source program is either syntactically
ill formed or semantically unsound, then it must provide informative messages, so the user can
take corrective action. The analysis part also collects information about the source program and
stores it in a data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.
The synthesis part constructs the desired target program from the intermediate representation and
the information in the symbol table. The analysis part is often called the front end of the
compiler; the synthesis part is the back end.
Lexical Analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form (token-name, attribute-value) that it passes on to the subsequent phase, syntax analysis.
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. p o s i t i o n is a lexeme that would be mapped into a token (id, 1), where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for p o s i t i o n . The
symbol-table entry for an identifier holds information about the identifier, such as its name and
type.
Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 27
Department of Computer Science and Engineering
2. The assignment symbol = is a lexeme that is mapped into the token (=). Since this token
needs no attribute-value, we have omitted the second component. We could have used any
abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the lexeme itself as the name of the abstract symbol.
3. i n i t i a l is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for i n i t i a l .
4. + is a lexeme that is mapped into the token (+).
5. r a t e is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table
entry for r a t e .
Blanks separating the lexemes would be discarded by the lexical analyzer. Figure 1.7 shows the
representation of the assignment statement (1.1) after lexical analysis as the sequence of tokens
( i d , l ) <=) (id, 2) (+) (id, 3) (*) (60) ………………..(1.2)
In this representation, the token names =, +, and * are abstract symbols for the assignment,
addition, and multiplication operators, respectively.
Syntax Analysis:
The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical
representation is a syntax tree in which each interior node represents an operation and the
children of the node represent the arguments of the operation.
Semantic Analysis:
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition. An important
part of semantic analysis is type checking, where the compiler checks that each operator has
matching operands. For example, many programming language definitions require an array index
to be an integer; the compiler must report an error if a floating-point number is used to index an
array.
Code Optimization:
The machine-independent code-optimization phase attempts to improve the intermediate
code so that better target code will result. Usually better means faster, but other objectives may
be desired, such as shorter code, or target code that consumes less power.
t1 = id3 * 60.0
idl = id2 + tl
Code Generation:
The code generator takes as input an intermediate representation of the source program
and maps it into the target language. If the target language is machine code, registers Or memory
locations are selected for each of the variables used by the program. Then, the intermediate
instructions are translated into sequences of machine instructions that perform the same task.
For example, using registers Rl and R2, the intermediate code in (1.4) might get translated into
the machine code
The first operand of each instruction specifies a destination. The F in each instruction
tells us that it deals with floating-point numbers. The code loads the contents of address id3 into
register R2, and then multiplies it with floating-point constant 60.0. The # signifies that 60.0 is to
be treated as an immediate constant. The third instruction moves id2 into register Rl and the
fourth adds to it the value previously computed in register R2. Finally, the value in register Rl is
stored into the address of idl , so the code correctly implements the assignment statement (1.1).
Symbol-Table Management:
The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name. The data structure should be designed to allow the compiler
to find the record for each name quickly and to store or retrieve data from that record quickly.
Token Recognition: Finite Automata can be used to recognize tokens in the source
code. Each token type (e.g., keywords, identifiers, operators) can be modeled as a
separate finite automaton. The automaton transitions between states based on the
characters it reads, ultimately reaching an accepting state when a valid token is
identified.
Lexical Pattern Matching: Regular expressions, which are commonly used to define
lexical patterns, can be translated into finite automata. These automata can efficiently
recognize and match patterns in the input stream, aiding in the identification of
various lexical elements.
Scanner Implementation: The lexical analyzer, often referred to as a scanner or
lexer, can be implemented using finite automata. The scanner reads the source code
character by character and uses finite automata to determine the token boundaries and
types.
Efficient Tokenization: Finite Automata help in creating efficient tokenizers by
minimizing the number of comparisons required to recognize tokens. By structuring
the automata based on the language's syntax, the lexical analysis process becomes
more streamlined and optimized.
Error Detection: Finite Automata can be extended to include error-handling states. If
the scanner encounters an unexpected character or an invalid token, the automaton
can transition to an error state, allowing for graceful error detection and recovery in
the lexical analysis phase.
Handling Regular Languages: The lexical structure of most programming
languages can be described using regular languages. Finite Automata, being suitable
for recognizing regular languages, align well with the lexical analysis requirements of
programming languages.
In summary, Finite Automata provide a formal and efficient way to model and implement
lexical analyzers, contributing significantly to the accurate and efficient processing of
programming language source code during compilation.