Grammar
Grammar
Introduction
When I took Latin, starting in 1954, I felt intuitively that the grammar of the language was somehow
mathematical. But the highest math I knew was algebra, and I could not come up with any mathematical
formulation of Latin grammar.
In 1957, I finally read something about the theory of grammars. Noam Chomsky was one of the pioneers
of this field. Chomsky was trying to analyze natural languages, a (possibly) hopeless task. Fortunately,
we do not look at natural languages in this course. Instead, we deal exclusively with formal languages and
formal grammars. Henceforth, grammar means formal grammar.
Each grammar generates a language. A grammar G can be represented by a string, which we call hGi. A
grammar has a finite description, but might generate an infinite language.
There are classes of grammers, each of which generates a class of lenguages. Not all languages are generated
by grammars, but many important languages are.
Definition of a Grammar
Classes of Grammars
These classes are defined by properties of their right-hand sides and left-hand sides. For each class, there
must be at least one production whose lhs is the start symbol, which, by convention, is typically S.
1. For each production of a left-linear grammar, the lhs must be a single member of V . The right-hand-side
must be one of the following:
(a) A terminal
1 There is no rule that the start symbol be called “S.” It could have any name.
1
(b) A terminal followed by a variable
(c) A variable
(d) The empty string
2. For each production of a right-linear grammar, the lhs must be a single member of V . The right-hand-
side must be one of the following:
(a) A terminal
(b) A variable followed by a terminal
(c) A variable
(d) The empty string
Left and right linear grammars are called regular grammars. Note: the (b) rules cannot be mixed. For
example, a regular grammer could not have both the productions A → aA and A → Ab.
Our definitions of linear grammars differs from the definitions given by Definition 3.3 in our textbook. In
that definition, only (a) and (b) are given. We have three justifications for this change.
(i) If we do not have (d), we cannot generate the empty string. But the empty string is a member of
many regular languages.
(ii) If we do not have (c), we have to go through unnecessary contortions to define the grammar
generates the language accepted by an NFA which has a λ-transition.
(iv) With (a), (b), (c), and (d), it is easier to understand the construction or regular grammars
equivalent to finite automata.
(iii) Allowing (c) and (d) does no harm: we still get only regular languages.
3. For each production of a context-free grammar, the lhs must be a single member of Γ. The right-hand-
side can be any string of grammar symbols.
4. For each production of a context-sensitive grammar, the lhs and rhs must be non-empty strings of
grammar symbols, and the length of the rhs must be at least as great the the length of the rhs2
For each production of an unrestricted grammar, the lhs must be a non-empty string of grammar symbols,
while the rhs may be any string of grammar symbols.
Derivations
Each step of a derivation makes use of just one production. The lhs of that production is replaced by the
rhs of that same production. That is, if u and v are consecutive sentential forms, i.e., u ⇒ v, there must
be a production α → β such that α is replaced by β at that step. That is, there are strings x, y ∈ Γ∗ such
that
1. u = xαy
2 This rule does not permit a context-sensitive language to contain the empty string. However, we usually want to allow
the empty string. We can achieve that by permitting the production S → λ, as long as S is not on the rhs of any production.
2
2. v = xβy
If productions are labeled, we sometimes place the label of the production above the “derives” symbol ⇒
for clarity.
The language generated by G, called L(G), is defined to be the language of all w ∈ Σ∗ which can be derived
from the start symbol.
Example. Let L = {an bm : n, m ≥ 0}, which is described by the regular expression a∗ b∗ . Then L is
generated by the regular grammar:
1. S → aS
2. S → B
3. B → bB
4. B → λ
We now give a G-derivation of w = aabbb.
1 1 2 3 3 3 4
S ⇒ aS ⇒ aaS ⇒ aaB ⇒ aabB ⇒ aabbB ⇒ aabbbB ⇒ aabbb
Given any DFA M , there is a straightforward way to find a regular grammar which generates the language
accepted by M . Suppose Σ = {a1 , . . . an } and Q = {q0 , q1 , . . . qm }.3
b
0 1
a
a b
2
Figure 1
3 Again, to avoid clutter, we write merely “i” to denote “qi ” in state diagrams.
3
M has one final state and its state diagram has SIX arrows, thus G has seven productions:
1.A0 → aA0
2.A0 → bA1
3.A1 → bA1
4.A1 → aA2
5.A2 → aA0
6.A2 → bA1
7.A2 → λ
Here is a G-derivation of abbaba:
1 2 3 4 6 4 7
A0 ⇒ aA0 ⇒ abA1 ⇒ abbA1 ⇒ abbaA2 ⇒ abbabA1 ⇒ abbabaA2 ⇒ abbaba
Similarly, given an NFA M which accepts a language L, there is we can find a left-linear grammar which
generates L. Again, let Σ = {a1 , . . . an } and Q = {q0 , q1 , . . . qm },
As in the case of a DFA, we let Σ be the terminal alphabet of G, and V = {A0 , A1 , . . . Am } the alphabet
of variables, and A0 the start state of G. As in the case of a DFA, each final state and each arrow in the
state diagram defines a production.
If qk ∈ δ(qi , aj ), Ai → aj Ak is a production.
If qk ∈ δ(λ, aj ), Ai → Ak is a production.
If qi is a final state, Ai → λ is a production.
a
a
0 1
b λ b
2
Figure 2
M has one final state and its state diagram has five arrows, thus G has six productions:
1.A0 → aA0
2.A0 → aA1
3.A1 → bA2
4.A2 → bA0
5.A2 → A1
6.A2 → λ
Here is a G-derivation of aabbbab:
1 2 3 5 3 4 2 3 6
A0 ⇒ aA0 ⇒ aaA1 ⇒ aabA2 ⇒ aabA1 ⇒ aabbA2 ⇒ aabbbA0 ⇒ aabbbaA1 ⇒ aabbbabA2 ⇒ aabbbab
4
The above derivation proves that aabbbab is generated by G, that is, aabbbab ∈ L(G).
Proof: We just use the above construction. The formal proof that it yields a left-linear grammar for the
same language is more detailed, but not too hard to understand.
a,b
Ans: M has five states, and the minimal DFA which accepts L(M ) has 25 = 32 states. The following
regular grammar generates L(M ), where A0 is the start state.
1. A0 → aA0
2. A0 → bA0
3. A0 → aA1
4. A0 → bA1
5. A1 → aA2
6. A1 → bA2
7. A2 → aA3
8. A2 → bA3
9. A3 → aA4
10. A3 → bA4
11. A4 → λ
Context-Free Grammars
A grammar G is context-free if, for every production, the left-hand side is one variable. We write CFG to
mean context-free grammar. The right hand side of production of a CFG can be any string of grammar
symbols. The class of context-free grammars is, arguably, the most important class of grammars we study.
A language L is called context-free if it is generated by some context-free grammar. Two grammars are said
to be equivalent if they generate the same language. Every context-free language is generated by infinitely
many different equivalent context-free grammars. We write CFL to mean context-free language.
5
Simple Examples
Simplest Example. The grammar G, where V = {S} and Σ = {a, b}, the start symbol is S, and the
productions are:
1. S → aSb
2. S → λ
L(G) = {an bn : n ≥ 0}, arguably the simplest non-regular context-free language.
Dyck Language. The Dyck language is the language of all balanced strings of left and right parentheses,
which is over the alphabet Σ = {(, )}. Here are three grammars for the Dyck language. In each case, S is
the start symbol and Γ = {S}.
G1
1. S → (S)
2. S → SS
3. S → λ
G2
1. S → S(S)
2. S → λ
G3
1. S → (S)S
2. S → λ
Palindromes. A palindrome is a word which is its own reverse, such as “level” or “noon.” Let L be the
language of all palindromes over the alphabet a, b. L is generated by the CFG
1. S → aSa
2. S → bSb
3. S → a
4. S → b
5. S → λ
If G is a CFG and L = L(G), then each w ∈ L has at least one derivation, frequently multiple derivations.
For example, let L = {an cn bm dm : n, m ≥ 0}. Then L is generated by a grammar with variables S, A, B
and start symbol S:
1. S → AB
6
2. A → aAc
3. A→λ
4. B → bBd
5. B→λ
The first of these is a left-most derivation, since at each step of the derivation, the left-most variable of the
sentential form is replaced by the rhs of a production. For example, in the second step, A is replaced by
aAc and B is not replaced. Similarly, the second derivation is a right-most derivation.
λ b B d
A CFG G is called unambiguous if every string w ∈ L(G) has exactly one left-most derivation.
Theorem 2
(a) A CFG G is unambigous if and only if every string w ∈ L(G) has exactly one right-most derivation.
(b) A CFG G is unambigous if and only if every string w ∈ G(L) has exactly one parse tree.
A CFG is ambiguous if it is not unambiguous. A CFL is called inherently ambiguous if it has no unam-
biguous CFG.
The grammar G1 for the Dyck language is ambigous, while G2 and G3 are both unambiguous.
Dangling Else
Let G be the following context-free grammar, with start symbol S and terminals {a, i, e}
1. S → a
2. S → iS
7
3. S → iSeS
Exercise 2 Show that G is ambiguous by giving two parse trees for the string iiaea.
L(G) does have an unambigous grammar, but it is more complex than the ambiguous grammar.
G models the “dangling else” problem for programming languages. In the following C++ fragment, what
value will be output?
int x = 0;
int y = 0;
int z = 3;
if(x == 1)
if(y == 0)
z = 2;
else z = 4;
cout << z << endl;