0% found this document useful (0 votes)
28 views32 pages

Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views32 pages

Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

AUTOMATA THEORY& COMPILER DESIGN


UNIT II
CONTEXT FREE GRAMMARS & INTRODUCTION TO COMPILERS
CONTEXT FREE GRAMMARS AND PARSING:
Context Free Grammar is formal grammar, the syntax or structure of a formal language can be
described using context-free grammar (CFG), a type of formal grammar. Context free grammar is a
formal grammar which is used to generate all possible strings in a given formal language.
The grammar has four tuples: (V,T,P,S).
V - It is the collection of variables or Non Terminal Symbols.
T - It is a set of Terminals.
P - It is the production rules that consist of both Terminals and Non Terminals.
S - It is the Starting Symbol.
A grammar is said to be the Context-free grammar if every production is in the form of :
G -> (V∪T)*, where G € V.
That is left-hand side of the G, here in the example can only be a Variable, it cannot be a terminal.
But on the right-hand side here it can be a Variable or Terminal or both combination of Variable
and Terminal. Above equation states that every production which contains any combination of the
‘V’ variable or ‘T’ terminal is said to be a context-free grammar.
For example the grammar A = { S, a,b, P,S} having production : Here S is the starting symbol.
{a,b} are the terminals generally represented by small characters.
P is variable along with S.
S-> aS
S-> bSa
but
a->bSa, or
a->ba is not a CFG as on the left-hand side there is a variable which does not follow the CFGs rule.
In the computer science field, context-free grammars are frequently used, especially in the areas of
formal language theory, compiler development, and natural language processing. It is also used for
explaining the syntax of programming languages and other formal languages.

Example:
L= {wcwR | w € (a, b)*}
Production rules:
S → aSa | bSb | c
Now check that abbcbba string can be derived from the given CFG.
S ⇒ aSa
⇒ abSba
⇒ abbSbba
⇒ abbcbba
By applying the production S → aSa, and S → bSb recursively and finally applying the production
S → c, we get the string abbcbba.
Department of Computer Science and Engineering

Capabilities of CFG
 Context free grammar is useful to describe most of the programming languages.
 If the grammar is properly designed then an efficientparser can be constructed
automatically.
 Using the features of associatively & precedence information, suitable grammars for
expressions can be constructed.
 Context free grammar is capable of describing nested structures like: balanced
parentheses, matching begin-end, corresponding if-then-else's & so on.
1. Construct a CFG for the language of palindrome strings over {a,b}.
G=({S},{a,b},P, S) ,
where P={S -> a S a | b S b | epsilon | a | b}.
2. Write the CFG for the language L= {anbn/n≥1}.
G=({S},{a,b},P, S) ,
where P={S -> a S b | epsilon}.
3. Find out the context free language
S->aSb/aAb
A->bAa
A->ba
CFL for the given grammar is L={(ab)n | n is even}
4. Construct a CFG over a and b that contains equal number of a and b.
G=({S},{a,b},P, S) ,
where P is
S aSbS | bSaS | SS
5. Write down the context free grammar for the language L = { anbn |n>=1}.
G=({S},{a,b},P, S)
where P is S aSb |ab

Derivation
Derivation is a sequence of production rules. It is used to get the input string through these
production rules. During parsing we have to take two decisions. These are as follows:
 We have to decide the non-terminal which is to be replaced.
 We have to decide the production rule by which the non-terminal will be replaced.
We have two options to decide which non-terminal to be replaced with production rule.
Left-most Derivation
In the left most derivation, the input is scanned and replaced with the production rule from left to
right. So in left most derivatives we read the input string from left to right.
S=S+S

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 2


Department of Computer Science and Engineering

S=S-S
S = a | b |c
Input:
a-b+c
The left-most derivation is:
S⇒S+S
⇒S-S+S S->S-S
⇒a-S+S S->a
⇒a-b+S S->b
⇒a-b+c S->c
S⇒a-b+c
Right-most Derivation
In the right most derivation, the input is scanned and replaced with the production rule from right
to left. So in right most derivatives we read the input string from right to left.
S=S+S
S=S-S
S = a | b |c
Input:
a-b+c
The right-most derivation is:
S⇒S-S
⇒S - S + S S->S-S
⇒S - S + c S->c
⇒S - b + c S->b
⇒a - b + c S->a
S⇒a-b+c

Parse tree
 Parse tree is the graphical representation of symbol. The symbol can be terminal or non-

terminal.
 In parsing, the string is derived using the start symbol. The root of the parse tree is that
start symbol.
 It is the graphical representation of symbol that can be terminals or non-terminals.
 Parse tree follows the precedence of operators. The deepest sub-tree traversed first. So,
the operator in the parent node has less precedence over the operator in the sub-tree.
The parse tree follows these points:
 All leaf nodes have to be terminals.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 3


Department of Computer Science and Engineering

 All interior nodes have to be non-terminals.


 In-order traversal gives original input string.
Example 1
T= T + T | T * T
T = a|b|c
Input:
a*b+c

Step 1: Step 2:

Step 3: Step 4

Step 5

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 4


Department of Computer Science and Engineering

Example
Derive the string"00101" for left most derivation (LMD) and right most derivation (RMD) using
context free grammar (CFG).
S->A1B
A->0A| ε
B->0B| 1B| ε
The left-most derivation is:
S⇒ A1B
⇒ 0A1B A->0A
⇒ 00A1B A->0A
⇒ 001B A->ε
⇒ 0010B B->0B
⇒ 00101B B->1B
⇒ 00101 B->ε
Derived the string 00101 using LMD
The right-most derivation is:
S⇒A1B
⇒A10B B->0B
⇒A101B B->1B
⇒A101 B->ε
⇒0A101 A->0A
⇒00A101 A->0A
⇒00101 A->ε
Derived the string 00101 using RMD
Example
Derive the string "abb" for leftmost derivation and rightmost derivation using a CFG given by,
S → aB | bA
S → a | aS | bAA
S → b | aS | aBB
The left-most derivation is:
S⇒ aB S → aB
⇒ aaBB B → aBB
⇒ aabB B→b
⇒ aabbS B → bS
⇒ aabbaB S → aB
⇒ aabbabS B → bS
⇒ aabbabbA S → bA
⇒ aabbabba A→a

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 5


Department of Computer Science and Engineering

The right-most derivation is:


S⇒ aB S → aB
⇒ aaBB B → aBB
⇒ aaBbS B → bS
⇒ aaBbbA S → bA
⇒ aaBbba A→a
⇒ aabSbba B → bS
⇒ aabbAbba S → bA
⇒ aabbabba A→a
Let G be the grammar S->aB/bA ,A->a/aS/bAA,B->b/bS/aBB. For the string aaabbabbba,
find (a) LMD and (b) RMD DEC 2014

Let S->aB/bA, A->aS/bAA/a, B->bS/aBB/b. Derive the string “aabbbbaa” and


“aaabbabbba” as left most derivation.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 6


Department of Computer Science and Engineering

Construct Leftmost and Rightmost derivation and parse tree for the string 3*2+5 from
the given grammar. Set of alphabets ∑ = {0,…,9, +, *, (, )}
EI
E E + E
E E * E
E (E)
I ε | 0 | 1 | … | 9

Solution:

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 7


Department of Computer Science and Engineering

Ambiguous Grammar
A CFG is said to be ambiguous if there exists more than one derivation for the given input string
i.e., more than one Left Most Derivation (LMD) or Right Most Derivation (RMD).
G = (V, T, P, S) is a CFG that is said to be ambiguous if and only if there exists a string in T*
that has more than one parse tree.
Example:
Consider the following grammar-
S → aB / bA
S → aS / bAA / a
B → bS / aBB / b
Let us consider a string w = aaabbabbba, Derive the string w using leftmost derivation.
Leftmost Derivation:
S → aB
→ aaBB B → aBB
→ aaaBBB B → aBB
→ aaabBB B→b
→ aaabbB B→b
→ aaabbaBB B → aBB
→ aaabbabB B→b
→ aaabbabbS B → bS
→ aaabbabbbA S → bA
→ aaabbabbba A→a

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 8


Department of Computer Science and Engineering

Rightmost Derivation
The process of deriving a string by expanding the rightmost non-terminal at each step is called
as rightmost derivation. The geometrical representation of rightmost derivation is called as
a rightmost derivation tree
S → aB

→ aaBB B → aBB

→ aaBaBB B → aBB

→ aaBaBbS B → bS

→ aaBaBbbA S → bA

→ aaBaBbba A→a

→ aaBabbba B→b

→ aaaBBabbba B → aBB

→ aaaBbabbba B→b

→ aaabbabbba B→b

Here, there exists only one LMD and RMD so the given grammar was unambiguous.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 9


Department of Computer Science and Engineering

Show that the given grammar is ambiguous for the input string 3*2+5.
Set of alphabets ∑ = {0,…,9, +, *, (, )}
EI
E E + E
E E * E
E (E)
I ε | 0 | 1 | … | 9
Solution:
First Left Most Derivation: Derivation Tree 1:
E=>E*E E I

=>I*E I 3

=>3*E+E E E+E

=>3*I+E E I

=>3*2+E I 2

=>3*2+I E I

=>3*2+5 I 5

Second Left Most Derivation: Derivation Tree 2:


E=>E+E E E*E

=>E*E+E E I

=>I*E+E I 3

=>3*E+E E I

=>3*I+E I 2

=>3*2+I E I

=>3*2+5 I 5

From the given grammar, String 3*2+5 can be derived by two Left Most Derivation, so
the given grammar is ambiguous.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 10


Department of Computer Science and Engineering

Check whether the given grammar G is ambiguous or not.


E→E+E
E→E-E
E → id
Solution:
First Left Most Derivation:
E→E+E

→ id + E

→ id + E - E

→ id + id - E

→ id + id- id

Second Leftmost derivation


E→E-E

→E+E-E

→ id + E - E

→ id + id - E

→ id + id - id

Since there are two leftmost derivation for a single string "id + id - id", the grammar G is ambiguous.

Check whether the given grammar G is ambiguous or not.

S → aSb | SS
S→ε
Solution:
First Left Most Derivation:
S→SS S→ ε
→S S→ aSb
→aSb S→ ε
→aaSbb S→ aSb
→aabb S→ ε

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 11


Department of Computer Science and Engineering

Second Left Most Derivation:

S→aSb S→ aSb

→aaSbb S→ aSb

→aabb S→ ε

Check whether the given grammar G is ambiguous or not.


A → AA
A → (A)
A→a
Solution:
First Left Most Derivation:

A→AA A→AA
→AAA A→a
→aAA A→(A)
→a(A)A A→a
→a(a)A A→a
→a(a)a

Second Left Most Derivation:


A→ AA A→a
→ aA A→AA
→ aAA A→(A)
→ a(A)A A→a
→a(a)A A→a
→a(a)a

For the string "a(a)aa" the above grammar can generate


two parse tree, so the given grammar is ambiguous.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 12


Department of Computer Science and Engineering

LEFT RECURSION:
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’
itself as the left-most symbol. Left-recursive grammar is considered to be a problematic situation
for top-down parsers. Top-down parsers start parsing from the Start symbol, which in itself is
non-terminal. So, when the parser encounters the same non-terminal in its derivation, it becomes
hard for it to judge when to stop parsing the left non-terminal and it goes into an infinite loop.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left
recursion.

Consider the Left Recursion from the Grammar.


E → E + T|T
T → T * F|F
F → (E)|id
Eliminate immediate left recursion from the Grammar.
Solution:

Comparing E → E + T|T with A → A α |β


E → TE′
E′ → +TE′| ε
Comparing T → T * F|F with A → Aα|β
T → FT′
T →* FT′|ε
Production F → (E)|id does not have any left recursion

Final answer as
E → TE′
E′ → +TE′| ε
T → FT′
T →* FT′|ε
F → (E)| id

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 13


Department of Computer Science and Engineering

Eliminate the left recursion for the following Grammar.


S → a|^|(T)
T → T, S|S

Solution:
S → a|^|(T) does not have any left recursion

Comparing T → T, S|S With A → A α | β


T→ ST′
T′ →,ST′| ε

Complete Grammar will be


S→ a|^(T)
T→ ST′
T′ →,ST′| ε

Remove the left recursion from the grammar


E → E(T)|T
T → T(F)|F
F → id
Solution:
Eliminating immediate left-recursion among all Aα productions, we obtain

E → TE′
E → (T)E′|ε
T → FT′
T′ → (F)T′|ε
F → id

LEFT FACTORING:
If a more than one grammar production rule has a common prefix string, then the top-down
parser cannot make a choice as to which of the production it should take to parse the string in
hand.
Left factoring is a process by which the grammar with common prefixes is transformed to make
it useful for Top down parsers.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 14


Department of Computer Science and Engineering

Example

If a top-down parser encounters a production like


A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both productions are
starting from the same terminal (or non-terminal). To remove this confusion, we use a technique
called left factoring.

Left factoring transforms the grammar to make it useful for top-down parsers. In this technique,
we make one production for each common prefixes and the rest of the derivation is added by
new productions.

The above productions can be written as


A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take decisions.

Do left factoring in the following grammar


S → iEtS / iEtSeS / a
E→b
Solution:
The left factored grammar is-
S → iEtSS’ / a
S’ → eS / ∈
E→b

Do left factoring in the following grammar-


A → aAB / aBc / aAc
Solution:

Step 1
A → aA’
A’ → AB / Bc / Ac
Again, this is a grammar with common prefixes.
Step 2
A → aA’
A’ → AD / Bc
D→B/c

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 15


Department of Computer Science and Engineering

Do left factoring in the following grammar-


S → bSSaaS / bSSaSb / bSb / a
Solution:
Step 1:
S → bSS’ / a
S’ → SaaS / SaSb / b
Again, this is a grammar with common prefixes.

Step 2:
S → bSS’ / a
S’ → SaA / b
A → aS / Sb
This is a left factored grammar.

Do left factoring in the following grammar-


S → aAd / aB
A → a / ab
B → ccd / ddc

Solution:
S → aS’
S’ → Ad / B
A → aA’
A’ → b / ∈
B → ccd / ddc
SIMPLIFICATION OF CFG:
we have seen, various languages can efficiently be represented by a context-free grammar. All
the grammar are not always optimized that means the grammar may consist of some extra
symbols(non-terminal). Having extra symbols, unnecessary increase the length of grammar.
Simplification of grammar means reduction of grammar by removing useless symbols. The
properties of reduced grammar are given below:

1. Each variable (i.e. non-terminal) and each terminal of G appears in the derivation of some
word in L.
2. There should not be any production as X → Y where X and Y are non-terminal.
3. If ε is not in the language L then there need not to be the production X → ε.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 16


Department of Computer Science and Engineering

Removal of Useless Symbols


A symbol can be useless if it does not appear on the right-hand side of the production rule and
does not take part in the derivation of any string. That symbol is known as a useless symbol.
Similarly, a variable can be useless if it does not take part in the derivation of any string. That
variable is known as a useless variable.
Example:
T → aaB | abA | aaT
A → aA
B → ab | b
C → ad
In the above example, the variable 'C' will never occur in the derivation of any string, so the
production C → ad is useless. So we will eliminate it, and the other productions are written in
such a way that variable C can never reach from the starting variable 'T'.

Production A → aA is also useless because there is no way to terminate it. If it never terminates,
then it can never produce a string. Hence this production can never take part in any derivation.

To remove this useless production A → aA, we will first find all the variables which will never
lead to a terminal string such as variable 'A'. Then we will remove all the productions in which
the variable 'B' occurs.

Elimination of ε Production
The productions of type S → ε are called ε productions. These type of productions can only be
removed from those grammars that do not generate ε.

Step 1: First find out all nullable non-terminal variable which derives ε.

Step 2: For each production A → a, construct all production A → x, where x is obtained from a
by removing one or more non-terminal from step 1.

Step 3: Now combine the result of step 2 with the original production and remove ε productions.

Example:
Remove the production from the following CFG by preserving the meaning of it.
S → XYX
X → 0X | ε
Y → 1Y | ε
Solution:
Now, while removing ε production, we are deleting the rule X → ε and Y → ε. To preserve the
meaning of CFG we are actually placing ε at the right-hand side whenever X and Y have
appeared.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 17


Department of Computer Science and Engineering

Let us take
S → XYX
If the first X at right-hand side is ε. Then
S → YX
Similarly if the last X in R.H.S. = ε. Then
S → XY
If Y = ε then

S → XX
If Y and X are ε then,
S→X
If both X are replaced by ε
S→Y
Now
S → XY | YX | XX | X | Y
Now let us consider
X → 0X
If we place ε at right-hand side for X then,
X→0
X → 0X | 0

Similarly Y → 1Y | 1
Collectively we can rewrite the CFG with removed ε production as

S → XY | YX | XX | X | Y
X → 0X | 0
Y → 1Y | 1

Removing Unit Productions


The unit productions are the productions in which one non-terminal gives another non-terminal.
Use the following steps to remove unit production:
Step 1: To remove X → Y, add production X → a to the grammar rule whenever Y → a occurs
in the grammar.
Step 2: Now delete X → Y from the grammar.
Step 3: Repeat step 1 and step 2 until all unit productions are removed.

Example:
S → 0A | 1B | C
A → 0S | 00
B→1|A
C → 01

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 18


Department of Computer Science and Engineering

Solution:
S → C is a unit production. But while removing S → C we have to consider what C gives. So,
we can add a rule to S.
S → 0A | 1B | 01
Similarly, B → A is also a unit production so we can modify it as
B → 1 | 0S | 00

Thus finally we can write CFG without unit production as


S → 0A | 1B | 01
A → 0S | 00
B → 1 | 0S | 00
C → 01
Normal Forms

1. Chomsky's Normal Form (CNF)


2. Greibach Normal Form (GNF)
CHOMSKY'S NORMAL FORM (CNF)
CNF stands for Chomsky normal form. A CFG(context free grammar) is in CNF(Chomsky
normal form) if all production rules satisfy one of the following conditions:
 Start symbol generating ε. For example, A → ε.
 A non-terminal generating two non-terminals. For example, S → AB.
 A non-terminal generating a terminal. For example, S → a.

G1 = {S → AB, S → c, A → a, B → b}
G2 = {S → aA, A → a, B → c}

The production rules of Grammar G1 satisfy the rules specified for CNF, so the grammar G1 is
in CNF. However, the production rule of Grammar G2 does not satisfy the rules specified for
CNF as S → aA contains terminal followed by non-terminal. So the grammar G2 is not in CNF.
Steps for converting CFG into CNF
Step 1: Eliminate start symbol from the RHS. If the start symbol T is at the right-hand side of
any production, create a new production as:
S1 → S
Where S1 is the new start symbol.
Step 2: In the grammar, remove the null, unit and useless productions. You can refer to
the Simplification of CFG.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 19


Department of Computer Science and Engineering

Step 3: Eliminate terminals from the RHS of the production if they exist with other
non-terminals or terminals. For example, production S → aA can be decomposed as:
S → RA
R→a
Step 4: Eliminate RHS with more than two non-terminals. For example, S → ASB can be
decomposed as:
S → RS
R → AS
Convert the given CFG to CNF. Consider the given grammar G1:
S → a | aA | B
A → aBB | ε
B → Aa | b

Solution:
Step 1: We will create a new production S1 → S, as the start symbol S appears on the RHS. The
grammar will be:
S1 → S
S → a | aA | B
A → aBB | ε
B → Aa | b
Step 2: As grammar G1 contains A → ε null production, its removal from the grammar yields:
S1 → S
S → a | aA | B
A → aBB
B → Aa | b | a
Now, as grammar G1 contains Unit production S → B, its removal yield:
S1 → S
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Also remove the unit production S1 → S, its removal from the grammar yields:
S0 → a | aA | Aa | b
S → a | aA | Aa | b
A → aBB
B → Aa | b | a
Step 3: In the production rule S0 → aA | Aa, S → aA | Aa, A → aBB and B → Aa, terminal a
exists on RHS with non-terminals. So we will replace terminal a with X:
S0 → a | XA | AX | b
S → a | XA | AX | b
A → XBB
B → AX | b | a
X→a

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 20


Department of Computer Science and Engineering

Step 4: In the production rule A → XBB, RHS has more than two symbols, removing it from
grammar yield:
S0 → a | XA | AX | b
S → a | XA | AX | b
A → RB
B → AX | b | a
X→a
R → XB
Hence, for the given grammar, this is the required CNF.
Greibach Normal Form (GNF)
GNF stands for Greibach normal form. A CFG(context free grammar) is in GNF(Greibach
normal form) if all the production rules satisfy one of the following conditions:
 A start symbol generating ε. For example, S → ε.
 A non-terminal generating a terminal. For example, A → a.
 A non-terminal generating a terminal which is followed by any number of non-terminals.
For example, S → aASB.

Example:
G1 = {S → aAB | aB, A → aA| a, B → bB | b}
G2 = {S → aAB | aB, A → aA | ε, B → bB | ε}
The production rules of Grammar G1 satisfy the rules specified for GNF, so the grammar G1 is
in GNF. However, the production rule of Grammar G2 does not satisfy the rules specified for
GNF as A → ε and B → ε contains ε(only start symbol can generate ε). So the grammar G2 is
not in GNF.
Steps for converting CFG into GNF
Step 1. If the given grammar is not in CNF, convert it to CNF. You can refer following article to
convert CFG to CNF: Converting Context Free Grammar to Chomsky Normal Form

Step 2. Change the names of non-terminal symbols to A1 till AN in same sequence.

Step 3. Check for every production rule if RHS has first symbol as non-terminal say Aj for the
production of Ai, it is mandatory that i should be less than j. Not great and not even equal.
If i> j then replace the production rule of Aj at its place in Ai.
If i=j, it is the left recursion. Create a new state Z which has the symbols of the left recursive
production, once followed by Z and once without Z, and change that production rule by
removing that particular production and adding all other production once followed by Z.
Step 4. Replace very first non-terminal symbol in any production rule with its production until
production rule satisfies the above conditions.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 21


Department of Computer Science and Engineering

For converting a CNF to GNF always move left to right for renaming the variables.
Convert the given CFG into GNF:
S → XA|BB
B → b|SB
X→b
A→a
Solution:
For converting a CNF to GNF first rename the non terminal symbols to A1,A2 till
AN in same sequence as they are used.
A1 = S
A2 = X
A3 = A
A4 = B
Therefore, now the new production rule is,
A1 → A2A3 | A4A4
A2 → b
A3 → a
A4 → b | A1A4
Now, check for every production Ai → Aj X, where X can be any number of terminal symbols.
If i<j in the production then it is good to go to the next step but if i>=j then change the
production by replacing it with that terminal symbol’s production. if i=j then it is a left recursion
and you need to remove left recursion.
Here for A4, 4 !< 1, so now replace it with A1‘s production rule.
1.
A1 → A2A3 | A4A4
A2 → b
A3 → a
A4 → b | A2A3A4 | A4A4A4
----------------------
2.
A1 → A2A3 | A4A4
A2 → b
A3 → a
A4 → b | bA3A4 | A4A4A4

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 22


Department of Computer Science and Engineering

Here A4A4A4 in production rule of A4 is the example of left recursion.


To replace the left most recursion take a new Non terminal symbol Z, which has the X part or the
trailing part of the left most recursive production once followed by Z and once without Z. Here
in A4A4A4, the part after the first A4 is A4A4, therefore
Z → A4A4 | A4A4Z
Now change the above production rule by putting Z after every previous production of that Ai,
and remove the left recursive production.
A1 → A2A3 | A4A4
A2 → b
A3 → a
A4 → b | bA3A4 | bZ | bA3A4Z
Z → A4A4 | A4A4Z
The Last step is to replace the production to the form of either
Ai → x (any single terminal symbol)
OR
Ai → xX (any single non terminal followed by any number of non terminals)
So here we need to replace A2 in production rule of A1 and so on.
A1 → bA3 | bA4 | bA3A4A4 | bZA4 | bA3A4ZA4
A2 → b
A3 → a
A4 → b | bA3A4 | bZ | bA3A4Z
Z → bA4 | bA3A4A4 | bZA4 | bA3A4ZA4 | bA4Z | bA3A4A4Z | bZA4Z | bA3A4ZA4Z
The respective grammar is non in GNF form.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 23


Department of Computer Science and Engineering

COMPILER:
A programmer writes the source code in a code editor or an integrated development environment
(IDE) that includes an editor, saving the source code to one or more text files. A compiler that
supports the source programming language reads the files, analyzes the code, and translates it
into a format suitable for the target platform.

A compiler is a program that can read a program in one language (the source language) and
translate it into an equivalent program in another language (the target language). An important
role of the compiler is to report any errors in the source program that it detects during the
translation process.

If the target program is an executable machine-language program, then the user to process inputs
and produces outputs.

Interpreter:
An interpreter is another common kind of language processor. Instead of producing a target
program as a translation, an interpreter appears to directly execute the operations specified in the
source program on inputs supplied by the user, as shown in Fig.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 24


Department of Computer Science and Engineering

An interpreter, however, can usually give better error diagnostics than a compiler, because it
executes the source program statement by statement.

The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly language is easier to produce as output and is
easier to debug. The assembly language is then processed by a program called an assembler that
produces relocatable machine code as its output.

How does a compiler work?


Compilers carry out the following steps:

 Lexical analysis. The compiler splits the source code into lexemes, which are individual
code fragments that represent specific patterns in the code. The lexemes are then tokenized in
preparation for syntax and semantic analyses.
 Syntax analysis. The compiler verifies that the code's syntax is correct, based on the rules
for the source language. This process is also referred to as parsing. During this step, the
compiler typically creates abstract syntax trees that represent the logical structures of specific
code elements.
 Semantic analysis. The compiler verifies the validity of the code's logic. This step goes
beyond syntax analysis by validating the code's accuracy. For example, the semantic analysis
might check whether variables have been assigned the right types or have been properly
declared.
 Intermediate Code Generation. After the code passes through all three analysis phases, the
compiler generates an intermediate representation (IR) of the source code. The IR code
makes it easier to translate the source code into a different format. However, it must
accurately represent the source code in every respect, without omitting any functionality.
 Code Optimization. The compiler optimizes the IR code in preparation for the final code
generation. The type and extent of optimization depends on the compiler. Some compilers let
users configure the degree of optimization.
 Output Code Generation. The compiler generates the final output code, using the optimized
IR code.
Compilers are sometimes confused with programs called interpreters. Although the two are
similar, they differ in important ways. Compilers analyze and convert source code written in
languages such as Java, C++, C# or Swift. They're commonly used to generate machine code or
bytecode that can be executed by the target host system.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 25


Department of Computer Science and Engineering

Interpreters do not generate IR code or save generated machine code. They process the code one
statement at a time at runtime, without pre-converting the code or preparing it in advance for a
particular platform. Interpreters are used for code written in scripting languages such as Perl,
PHP, Ruby or Python.
PHASES OF A COMPILER
A compiler as a single box that maps a source program into a semantically equivalent
target program. There are two parts in Compiler:
 Analysis
 Synthesis.
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate representation
of the source program. If the analysis part detects that the source program is either syntactically
ill formed or semantically unsound, then it must provide informative messages, so the user can
take corrective action. The analysis part also collects information about the source program and
stores it in a data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.

The synthesis part constructs the desired target program from the intermediate representation and
the information in the symbol table. The analysis part is often called the front end of the
compiler; the synthesis part is the back end.

The Compiler has Six Phases:


1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. Intermediate code generation
5. Code Optimization
6. Code Generation

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 26


Department of Computer Science and Engineering

Lexical Analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form (token-name, attribute-value) that it passes on to the subsequent phase, syntax analysis.

For example, suppose a source program contains the assignment statement.


p o s i t i o n = i n i t i a l + r a t e * 60

The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:

1. p o s i t i o n is a lexeme that would be mapped into a token (id, 1), where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for p o s i t i o n . The
symbol-table entry for an identifier holds information about the identifier, such as its name and
type.
Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 27
Department of Computer Science and Engineering

2. The assignment symbol = is a lexeme that is mapped into the token (=). Since this token
needs no attribute-value, we have omitted the second component. We could have used any
abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the lexeme itself as the name of the abstract symbol.

3. i n i t i a l is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for i n i t i a l .
4. + is a lexeme that is mapped into the token (+).

5. r a t e is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table
entry for r a t e .

6. * is a lexeme that is mapped into the token (*).

7. 60 is a lexeme that is mapped into the token (60).

Blanks separating the lexemes would be discarded by the lexical analyzer. Figure 1.7 shows the
representation of the assignment statement (1.1) after lexical analysis as the sequence of tokens
( i d , l ) <=) (id, 2) (+) (id, 3) (*) (60) ………………..(1.2)

In this representation, the token names =, +, and * are abstract symbols for the assignment,
addition, and multiplication operators, respectively.

Syntax Analysis:
The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical
representation is a syntax tree in which each interior node represents an operation and the
children of the node represent the arguments of the operation.

Semantic Analysis:
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition. An important
part of semantic analysis is type checking, where the compiler checks that each operator has

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 28


Department of Computer Science and Engineering

matching operands. For example, many programming language definitions require an array index
to be an integer; the compiler must report an error if a floating-point number is used to index an
array.

Intermediate Code Generation:


In the process of translating a source program into target code, a compiler may construct one or
more intermediate representations, which can have a variety of forms. This intermediate
representation should have two important properties: it should be easy to produce and it should
be easy to translate into the target machine. we consider an intermediate form called three-
address code, which consists of a sequence of assembly-like instructions with three operands per
instruction. Each operand can act like a register. The output of the intermediate code generator.

Code Optimization:
The machine-independent code-optimization phase attempts to improve the intermediate
code so that better target code will result. Usually better means faster, but other objectives may
be desired, such as shorter code, or target code that consumes less power.

A simple intermediate code generation algorithm followed by code optimization is a


reasonable way to generate good target code. The optimizer can deduce that the conversion of 60
from integer to floating point can be done once and for all at compile time, so the inttofloat
operation can be eliminated by replacing the integer 60 by the floating-point number 60.0.
Moreover, t3 is used only once to transmit its value to i d l so the optimizer can transform (1.3)
into the shorter sequence.

t1 = id3 * 60.0
idl = id2 + tl

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 29


Department of Computer Science and Engineering

Code Generation:
The code generator takes as input an intermediate representation of the source program
and maps it into the target language. If the target language is machine code, registers Or memory
locations are selected for each of the variables used by the program. Then, the intermediate
instructions are translated into sequences of machine instructions that perform the same task.

For example, using registers Rl and R2, the intermediate code in (1.4) might get translated into
the machine code

LDF R2, id3


MULF R2, R2, #60.0
LDF Rl, id2
ADDF Rl, Rl, R2
STF idl, Rl

The first operand of each instruction specifies a destination. The F in each instruction
tells us that it deals with floating-point numbers. The code loads the contents of address id3 into
register R2, and then multiplies it with floating-point constant 60.0. The # signifies that 60.0 is to
be treated as an immediate constant. The third instruction moves id2 into register Rl and the
fourth adds to it the value previously computed in register R2. Finally, the value in register Rl is
stored into the address of idl , so the code correctly implements the assignment statement (1.1).

Symbol-Table Management:
The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name. The data structure should be designed to allow the compiler
to find the record for each name quickly and to store or retrieve data from that record quickly.

The Grouping of Phases into Passes:


The phases deals with the logical organization of a compiler. In an implementation,
activities from several phases may be grouped together into a pass that reads an input file and
writes an output file. For example, the front-end phases of lexical analysis, syntax analysis,
semantic analysis, and intermediate code generation might be grouped together into one pass.
Code optimization might be an optional pass. Then there could be a back-end pass consisting of
code generation for a particular target machine.

A compiler can have many phases and passes.


 Pass : A pass refers to the traversal of a compiler through the entire program.
 Phase : A phase of a compiler is a distinguishable stage, which takes input from the
previous stage, processes and yields output that can be used as input for the next stage.
A pass can have more than one phase.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 30


Department of Computer Science and Engineering

APPLICATIONS OF COMPILER TECHNOLOGY

 Implementation of High-Level Programming Languages.


 Optimizations for Computer Architectures
 Design of New Computer Architectures
 Program Translations
 Software Productivity Tools
APPLICATIONS OF FINITE AUTOMATA TO LEXICAL ANALYSIS.
Finite Automata (FA) play a crucial role in lexical analysis, which is the first phase in the
compilation process of programming languages. Lexical analysis involves scanning the source
code to identify and categorize lexemes (basic units such as keywords, identifiers, operators, and
literals). Finite Automata are particularly well-suited for this task, and their applications in
lexical analysis include:

 Token Recognition: Finite Automata can be used to recognize tokens in the source
code. Each token type (e.g., keywords, identifiers, operators) can be modeled as a
separate finite automaton. The automaton transitions between states based on the
characters it reads, ultimately reaching an accepting state when a valid token is
identified.
 Lexical Pattern Matching: Regular expressions, which are commonly used to define
lexical patterns, can be translated into finite automata. These automata can efficiently
recognize and match patterns in the input stream, aiding in the identification of
various lexical elements.
 Scanner Implementation: The lexical analyzer, often referred to as a scanner or
lexer, can be implemented using finite automata. The scanner reads the source code
character by character and uses finite automata to determine the token boundaries and
types.
 Efficient Tokenization: Finite Automata help in creating efficient tokenizers by
minimizing the number of comparisons required to recognize tokens. By structuring
the automata based on the language's syntax, the lexical analysis process becomes
more streamlined and optimized.
 Error Detection: Finite Automata can be extended to include error-handling states. If
the scanner encounters an unexpected character or an invalid token, the automaton
can transition to an error state, allowing for graceful error detection and recovery in
the lexical analysis phase.
 Handling Regular Languages: The lexical structure of most programming
languages can be described using regular languages. Finite Automata, being suitable
for recognizing regular languages, align well with the lexical analysis requirements of
programming languages.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 31


Department of Computer Science and Engineering

 DFA Minimization: Minimizing the deterministic finite automata (DFA) used in


lexical analysis helps optimize the scanner. A smaller DFA results in faster and more
memory-efficient lexical analysis.
 Integration with Parser: The output of the lexical analysis phase, often a stream of
tokens, is essential for the subsequent parsing phase. Finite Automata ensure that the
identified tokens are correctly categorized and passed on to the parser for syntactic
analysis.

In summary, Finite Automata provide a formal and efficient way to model and implement
lexical analyzers, contributing significantly to the accurate and efficient processing of
programming language source code during compilation.

Automata Theory & Compiler Design Mr. P.Krishnamoorthy Page 32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy