CFG-TOC-2
CFG-TOC-2
A Context-Free Grammar (CFG) is a formal grammar used to define the syntax of programming
languages, formal languages, and mathematical logic. It consists of a set of rules that describe how strings
in a language can be generated. In CFG, the production rules are applied to symbols, and the left-hand
side of each rule contains a single non-terminal symbol, which is replaced by a sequence of terminal
and/or non-terminal symbols.
1. Variables (Non-terminal symbols): These are symbols used to represent patterns or structures
that can be expanded into other symbols. They are typically written in uppercase letters (e.g., S,
A, B).
2. Terminals: These are the basic symbols from which strings are formed. They cannot be replaced
or further expanded. In a programming language, terminals could be characters like a, b, 0, 1, or
keywords like if, while, etc.
3. Start symbol: This is a special non-terminal symbol that represents the start of the production
process. Typically, it is denoted as S.
4. Production rules: These are the rules that define how non-terminal symbols can be replaced by
other non-terminal or terminal symbols. A production rule is generally in the form:
o A → α where A is a non-terminal, and α is a string consisting of both terminals and non-
terminals (e.g., A → aB | b).
Let's consider a simple CFG for a language that generates strings of balanced parentheses:
Non-terminals: S
Terminals: (, )
Start symbol: S
Production rules:
1. S → (S)
2. S → SS
3. S → ε (where ε denotes the empty string)
This CFG generates strings such as (), ()(), (()), and (()()), which represent balanced
parentheses.
Context-freeness: The term "context-free" means that the rules are applied regardless of the
context in which the non-terminal appears. For example, in the rule A → α, A can be replaced by
α no matter where it appears in the string.
Generative power: CFGs are powerful enough to describe a wide variety of languages, but they
are not capable of expressing all possible language constructs (e.g., context-sensitive languages,
which require a different class of grammars).
Programming languages: CFGs are widely used to describe the syntax of programming
languages. The syntax of most modern programming languages (like Python, Java, etc.) can be
defined using CFGs.
Compilers: CFGs play a crucial role in the design of compilers, which use them to parse the
source code into a structure that can be processed further (often using techniques like syntax trees
or abstract syntax trees).
Natural language processing (NLP): CFGs are used to describe the syntax of natural languages
in computational linguistics.
Consider a CFG for a simple arithmetic expression involving addition and multiplication:
While context-free grammars are powerful, they have limitations, particularly in expressing some
complex language structures, such as:
Context-sensitive languages: CFGs cannot handle languages where the rules depend on the
context in which a symbol appears (e.g., ensuring that variables are declared before they are
used).
Ambiguity: Some CFGs can be ambiguous, meaning that a string can have multiple parse trees
(derivations). This can be problematic for compilers and parsers, where a unique interpretation is
necessary.
1. Derivations
A derivation is a sequence of rule applications starting from the start symbol of the grammar, eventually
leading to a string of terminals (a valid string in the language). In a derivation, you replace non-terminal
symbols with either other non-terminals or terminals according to the production rules of the grammar.
Example of a Derivation:
Non-terminals: E, T, F
Terminals: *+, , (, ), 0, 1, 2, ..., 9
Start symbol: E
Production rules:
1. E → E + T
2. E → T
3. T → T * F
4. T → F
5. F → (E)
6. F → number
2. Apply rule E → E + T.
E+T
T+T
F+F
number + number
6. Replace number with the actual numbers 3, 4, and 5.
3+4*5
This is the final derived string, which is a valid string in the language generated by the CFG.
A derivation tree, also known as a parse tree, is a tree-like structure that represents the syntactic
structure of a string derived from a grammar. It shows how the derivation progresses from the start
symbol, breaking down into non-terminals and eventually to terminals.
Each internal node of the tree represents a non-terminal, and each leaf node represents a terminal symbol
in the string. The tree captures the hierarchical structure of the derivation, displaying which production
rules were applied at each step.
Using the same CFG as above, let's visualize the derivation tree for the string 3 + 4 * 5:
/ \
E + T
| / \
T T * F
| | |
F F number
| |
number number
| |
3 4 5
Start symbol: The root of the tree is the start symbol of the grammar.
Non-terminals: Each internal node represents a non-terminal symbol in the grammar.
Terminals: Each leaf node represents a terminal symbol (an actual symbol in the string).
Production rules: The branches of the tree represent applications of production rules.
Uniqueness: A derivation tree is unique for any given string in a grammar, assuming the
grammar is unambiguous.
Derivations are a sequence of rule applications starting from the start symbol to eventually
produce a string of terminals. The sequence of derivation steps can be represented as a tree
structure, where each step in the derivation corresponds to a subtree in the tree.
Derivation Trees are the graphical representation of the entire process of derivation. Every string
generated by a CFG has a corresponding derivation tree, where the structure of the tree reflects
the hierarchical relationships between non-terminals and terminals as defined by the grammar's
production rules.
Important Points:
In the context of Context-Free Grammars (CFGs), leftmost and rightmost derivations refer to specific
strategies for replacing non-terminal symbols in a string during the derivation process. These strategies
define the order in which non-terminal symbols are replaced with their corresponding production rules.
1. Leftmost Derivation
In a leftmost derivation, at each step, the leftmost non-terminal is replaced first. That is, you start with
the leftmost non-terminal in the string and apply a production rule to it, then continue this process
iteratively.
E+T
T+T
F+T
number + T
number + F
number + number
8. Replace the number symbols with 3, 4, and 5 to obtain the final string:
3+4*5
2. Rightmost Derivation
In a rightmost derivation, at each step, the rightmost non-terminal is replaced first. This means that
you focus on the rightmost non-terminal in the string and apply a production rule to it, and repeat this
process until the string consists only of terminal symbols.
Using the same CFG, we will derive the string 3 + 4 * 5 using a rightmost derivation.
1. Start with the start symbol: E.
E+T
E+F
E + number
T + number
F + number
number + number
8. Replace the number symbols with 3, 4, and 5 to obtain the final string:
3+4*5
1. S → AB
2. A → a
3. B → b
Leftmost Derivation:
1. S
2. A B (apply S → AB)
3. a B (apply A → a)
4. a b (apply B → b)
Rightmost Derivation:
1. S
2. A B (apply S → AB)
3. A b (apply B → b)
4. a b (apply A → a)
Both derivations result in the same final string, ab, but the order in which rules are applied differs
A sentential form in formal language theory is a string of symbols (both terminals and non-terminals)
that can be derived from the start symbol of a context-free grammar (CFG). It is an intermediate step in
the derivation process, where the string can contain non-terminal symbols, and these non-terminals can
still be replaced using the production rules of the grammar.
In simple terms, a sentential form is any string that can be produced during the derivation process, before
reaching a final string made entirely of terminals (a valid string in the language)
Example:
Non-terminals: S, A, B
Terminals: a, b
Start symbol: S
Production rules:
1. S → AB
2. A → a
3. B → b
We can now illustrate sentential forms and their derivations.
1. Start symbol: S
In this process, each step where non-terminal symbols are replaced by terminal or non-terminal symbols
results in a new sentential form. The last step, when all symbols are terminals, results in a string that
belongs to the language defined by the grammar.
Parsing is the process of analyzing a string of symbols (often called an input string) based on a
formal grammar.
Types of Parsing:
Top-down Parsing: Starts from the start symbol and tries to match the input string by expanding
non-terminals using production rules. Examples: Recursive Descent Parsing.
Bottom-up Parsing: Starts with the input string and works its way back up to the start symbol,
trying to reduce the string using the grammar's production rules. Examples: Shift-Reduce
Parsing, LR Parsing.
Example of Parsing:
Non-terminals: S
Terminals: a, b
Start symbol: S
Production rules:
1. S → aSb
2. S → ε (where ε represents the empty string)
To parse the string aabb using this grammar, you would proceed as follows:
Thus, the string aabb is derived, and a parse tree would represent this process.
Ambiguity in Parsing
Ambiguity in the context of parsing refers to a situation where a given string can be parsed in multiple
ways, leading to different parse trees. This occurs when a grammar allows more than one valid way to
generate a string, meaning the grammar does not define a unique structure for every string in the
language.
Example of Ambiguity:
Non-terminals: E (Expression)
Terminals: +, *, (, ), and a (representing operands)
Start symbol: E
Production rules:
1. E → E + E
2. E → E * E
3. E → a
/\
E + E
| |
a E
Apply E → E * E to generate E * E.
Apply E → a to both E symbols, leading to the parse tree:
Apply E → E * E to generate E * E.
Apply E → a to both E symbols, leading to the parse tree:
/\
E * E
| |
a a
In this case:
The first parse tree would represent the expression (a + a) * a (since addition happens
first).
The second parse tree would represent the expression a + (a * a) (since multiplication
happens first).
Thus, ambiguity arises from the fact that the grammar allows different interpretations of the same string.
Consequences of Ambiguity:
Uncertainty in Interpretation: Ambiguous grammars can lead to multiple possible meanings for
the same string, which is undesirable in both programming languages and natural languages.
Parsing Complexity: Ambiguity makes it difficult for parsers to determine a unique syntactic
structure, leading to complexities in parsing and requiring more sophisticated algorithms to
resolve ambiguity.
Compilation Errors: In programming languages, ambiguity in the grammar could result in
incorrect or inconsistent interpretation of code, leading to compilation errors.
Eliminating Ambiguity:
To make a grammar unambiguous (i.e., to ensure that each string has a unique parse tree), there are
several techniques:
Rewrite the Grammar: Modify the grammar to ensure that it generates a unique parse tree for
every string. For example, you could introduce precedence and associativity rules for operators to
resolve ambiguity.
o E→E+T
o E→T
o T→T*F
o T→F
o F→a
Use Operator Precedence: In arithmetic expressions, operator precedence can be enforced,
ensuring that multiplication is performed before addition, without requiring ambiguous rules.
Left-Factoring: This technique involves reorganizing production rules to avoid ambiguity caused
by common prefixes.
Example:
E→E+E
E→E+T|T
T→T*F|F
F→a
Parentheses for Grouping: Use parentheses to explicitly define the order of operations in expressions,
eliminating ambiguity from the grammar.
Procedure:
Procedure:
o Identify non-terminals that can generate the empty string (i.e., A → ε).
o For every production of the form A → X1 X2 ... Xn, if A can derive ε, then add the
productions where any of X1, X2, ..., Xn can be replaced with ε. This creates new
productions.
o Remove the A → ε production after the new rules are added.
3. Eliminate Unit Productions: A unit production is a production of the form A → B, where A is
a non-terminal that produces another non-terminal B directly.
Procedure:
o For every unit production A → B, find all the productions for B, and replace A → B with
those productions. This can eliminate indirect non-terminal dependencies.
o Remove the unit production after the substitution.
4. Eliminate Useless Non-Terminals: These are non-terminals that cannot be derived from the start
symbol or do not contribute to generating any valid strings.
Procedure:
o After eliminating unit productions and epsilon productions, review the grammar and
ensure all non-terminals are either reachable from the start symbol or can generate
terminal strings.
Example: Simplification of a CFG:
Production rules:
1. S → AB
2. A → aA | ε
3. B → bB | C
4. C → ε
**For A → aA | ε:
If A → ε, we can derive S → AB as S → B.
This gives us the new production S → B.
**For B → bB | C:
S → AB | B
A → aA | ε
B → bB | ε
C → ε (to be removed as we will no longer need it)
S → AB | bB | ε
A → aA | ε
B → bB | ε
Step 3: Eliminate Useless Symbols
C is a useless non-terminal because it does not generate any terminal string (its only production
is C → ε, which is now removed).
So, C is removed from the grammar.
S → AB | bB | ε
A → aA | ε
B → bB | ε
Non-terminals: S, A, B
Terminals: a, b
Start symbol: S
Production rules:
1. S → AB | bB | ε
2. A → aA | ε
3. B → bB | ε
This grammar is simplified and does not contain epsilon productions, unit productions, or useless
symbols.
The two most common normal forms are Chomsky Normal Form (CNF) and Greibach Normal Form
(GNF).
A CFG is in Chomsky Normal Form (CNF) if all of its production rules are of the following forms:
A → BC where A, B, and C are non-terminal symbols, and B and C are not the start symbol.
A → a where A is a non-terminal and a is a terminal symbol.
A → ε is allowed only if the language includes the empty string ε, and A is the start symbol.
Properties of CNF:
All productions are either of the form A → BC or A → a (except for the production for the empty
string).
The grammar is highly restrictive, but it helps in algorithms like the CYK (Cocke-Younger-
Kasami) parsing algorithm.
Conversion to CNF:
Example:
S → AB | a
A→a|ε
B→b
S → AB | a | B
A→a
B→b
Replace S → B with S → b.
Final Grammar:
o S → AB | a | b
o A→a
o B→b
A CFG is in Greibach Normal Form (GNF) if all its production rules are of the following form:
Properties of GNF:
Every production starts with a terminal symbol, followed by zero or more non-terminal symbols.
GNF is useful in certain parsing algorithms and helps in generating deterministic top-down
parsers.
Conversion to GNF:
Example:
S → AB
A→a|ε
B→b
S → AB | B
A→a
B→b
S → aB
A→a
B→b
Comparison Between CNF and GNF:
CNF is more restrictive and is commonly used for algorithms like CYK parsing because of its
binary nature (each production has at most two non-terminals on the right-hand side).
GNF is more suitable for top-down parsers, as it ensures that every production starts with a
terminal symbol, allowing for easier construction of recursive descent parsers.
Problems Related to Chomsky Normal Form (CNF) and Greibach Normal Form (GNF)
The Membership Problem is a fundamental problem that asks whether a given string belongs to
the language generated by a context-free grammar (CFG). This problem can be addressed using
different techniques depending on whether the grammar is in CNF, GNF, or general form.
The Membership Problem is the problem of determining if a given string w is generated by a context-
free grammar G, i.e., whether w ∈ L(G), where L(G) is the language generated by the CFG G.
The problem can be solved using different algorithms depending on the form of the grammar.
The Membership Problem is the problem of determining if a given string w is generated by a context-
free grammar G, i.e., whether w ∈ L(G), where L(G) is the language generated by the CFG G.
The problem can be solved using different algorithms depending on the form of the grammar.
In the case of Chomsky Normal Form (CNF), the grammar is structured in a way that makes it easier to
implement efficient parsing algorithms for solving the Membership Problem. The two main parsing
techniques used for the Membership Problem in CNF are CYK (Cocke-Younger-Kasami) and Dynamic
Programming (DP).
CYK Algorithm
1. Convert the Grammar to CNF: Ensure the grammar is in Chomsky Normal Form.
2. Create a Parsing Table:
o Let w = w₁w₂...wₖ be the input string with length k.
o Create a table of size k × k where the entry at [i, j] represents the non-terminals that can
generate the substring wᵢ...wⱼ.
3. Fill the Table:
o For each substring length from 1 to k, and for each substring of that length, check which
non-terminals can derive that substring by looking at the production rules.
4. Check for the Start Symbol:
o If the start symbol S appears in the entry [1, k] (the entry representing the entire string
w), then the string w is in the language generated by the grammar. Otherwise, it is not.
Complexity of CYK: The time complexity of CYK is O(k³), where k is the length of the input string w.
This makes CYK an efficient algorithm for solving the Membership Problem for CNF grammars.
Example:
S → AB
A→a
B→b
Let w = ab.
Using CYK, we would fill a table to check if w can be derived from the start symbol S. After filling the
table, we would check if S can generate the entire string, which it does in this case.
In the case of Greibach Normal Form (GNF), the production rules are structured such that each
production starts with a terminal symbol followed by a sequence of non-terminals. This form is well-
suited for top-down parsers, especially for recursive descent parsing, which can be used to solve the
Membership Problem.
For GNF, a top-down recursive descent parser can be employed to determine whether a string w is
generated by a grammar in GNF. Here's how it works:
The recursive descent parser will try to match the string w by recursively applying the production rules
starting from the start symbol. The parser attempts to break down w using the terminal symbols followed
by non-terminals, as dictated by the GNF rules.
Complexity of Top-Down Parsing:
A top-down parser may take exponential time in the worst case, especially if the grammar has a
lot of recursion. However, it is more efficient than brute-force methods when dealing with
grammars in GNF, as each production starts with a terminal symbol.
1. Conversion to CNF:
o Given a CFG, the process of converting it into Chomsky Normal Form involves
eliminating epsilon-productions, unit-productions, and ensuring all right-hand sides are
either two non-terminals or a single terminal. This can be a complex task for large
grammars.
o Problem: The conversion to CNF may lead to an exponential increase in the size of the
grammar, especially when the grammar has many rules.
2. Conversion to GNF:
o Converting a CFG to Greibach Normal Form can also be challenging because each
production must start with a terminal symbol followed by a sequence of non-terminals.
This conversion may require significant transformations, including handling left
recursion.
o Problem: Left recursion must be removed before converting to GNF, which can be non-
trivial.
3. Ambiguity in CNF and GNF:
o Both CNF and GNF are not immune to ambiguity. A CFG may still generate multiple
parse trees even when it's in CNF or GNF.
o Problem: Determining whether a CFG in CNF or GNF is ambiguous (i.e., generates
multiple parse trees for the same string) is a difficult problem.