Syntax Analysis: Chapter - 4
Syntax Analysis: Chapter - 4
Chapter – 4
SYNTAX ANALYSIS
Introduction
The Role of the Parser
Syntax Analyzer creates the syntactic structure of the given source program. This
syntactic structure is mostly a parse tree. Syntax Analyzer is also known as parser. The
syntax of a programming is described by a context-free grammar (CFG). We will use BNF
(Backus-Naur Form) notation in the description of CFGs.
The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a context-free grammar or not. If it satisfies, the parser creates the parse tree of
that program. Otherwise the parser gives the error messages.
A context-free grammar gives a precise syntactic specification of a programming
language. The design of the grammar is an initial phase of the design of a compiler. A
grammar can be directly converted into a parser by some tools. Parser works on a stream of
tokens. The smallest item is a token.
The position of the parser in compiler is shown below.
Representative Grammar
Some of the grammars that will be examined in this unit are presented here for ease of
reference. Constructs that begin with keywords like while or int, are relatively easy to parse,
because the keyword guides the choice of the grammar production that must be applied to
match the input.
We concentrate on expressions, which present more of challenge, because of the
associativity and precedence of operators.
Associativity and precedence are captured in the following grammar, which is describing
expressions, terms, and factors. E represents expressions consisting of terms separated by +
signs, T represents terms consisting of factors separated by * signs, and F represents factors
that can be represents either parenthesized expressions or identifiers.
E→E+T|T
T→T*F|F
F → ( E ) | id
The above expression grammar belongs to the class of LR grammars that are suitable for
bottom-up parsing. This grammar can be adapted to handle additional operators and
additional levels of precedence. However, it cannot be used for top-down parsing because it
is left recursive.
The following non-left recursive grammar used for top-down parsing.
E → T E’
E’ → + T E’ | Ɛ
T → F T’
T’ → * F T’ | Ɛ
F → ( E ) | id
The following grammar treats + and * alike, so it is useful for illustrating techniques for
handling ambiguities during parsing.
E → E + E | E * E | ( E ) | id
Here, E represents expressions of all types. Grammar permits more than one parse tree for
expressions like a+b*c.
Syntax Error Handling
Common Programming errors can occur at many different levels.
1. Lexical errors: include misspelling of identifiers, keywords, or operators.
Eg: the use of an identifier elipseSize instead of ellipseSize and missing quotes
around text intended as a string.
2. Syntactic errors: include misplaced semicolons or extra or missing braces.
Eg: in C or Java, the appearance of a case statement without an enclosing switch is a
syntactic error.
3. Semantic errors: include type mismatches between operators and operands.
Eg: a return statement in a Java method with result type void.
4. Logical errors: can be anything from incorrect reasoning on the part of the
programmer.
Eg: the assignment operator = instead of the comparison operator ==.
Goals of the Parser
• Report the presence of errors clearly and accurately
Phrase-Level Recovery
• A parser may perform local correction on the remaining input. i.e it may replace a prefix of
the remaining input by some string that allows the parser to continue.
Ex: replace a comma by a semicolon, insert a missing semicolon
• Local correction is left to the compiler designer.
• It is used in several error-repairing compliers, as it can correct any input string.
• Difficulty in coping with the situations in which the actual error has occurred before the
point of detection.
Error Productions
• By anticipating common errors that might be encountered, we can augment the grammar for
the language at hand with productions that generate the erroneous constructs.
• Then we can use the grammar augmented by these error productions to Construct a parser.
• If an error production is used by the parser, we can generate appropriate error diagnostics
to indicate the erroneous construct that has been recognized in the input.
Global Correction
• We use algorithms that perform minimal sequence of changes to obtain a globally least cost
correction.
• Given an incorrect input string x and grammar G, these algorithms will find a parse tree for
a related string y.
• Such that the number of insertions, deletions and changes of tokens required to transform x
into y is as small as possible.
• It is too costly to implement in terms of time space, so these techniques only of theoretical
interest.
Context-Free Grammars
• Inherently recursive structures of a programming language are defined by a context-free
grammar.
• In a context-free grammar, we have:
– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)
– A finite set of productions rules in the following form
• A → α where A is a non-terminal and α is a string of terminals and non-terminals
(including the empty string)
– ‘A’ start symbol (one of the non-terminal symbol)
NOTATIONAL CONVENTIONS
1. Symbols used for terminals are:
a. Lower case letters early in the alphabet (such as a, b, c, . . .)
b. Operator symbols (such as +, *, . . . )
c. Punctuation symbols (such as parenthesis, comma and so on)
d. The digits(0…9)
e. Boldface strings and keywords (such as id or if) each of which represents a single terminal
symbol
2. Symbols used for non terminals are:
a. Uppercase letters early in the alphabet (such as A, B, C, …)
b. The letter S, which when it appears is usually the start symbol.
c. Lowercase, italic names (such as expr or stmt).
3. Lower case greek letters, α, β, ϒ for example represent (possibly empty) strings of
grammar symbols.
4. Upper case letters late in the alphabet such as X, Y, Z represent grammar symbols,
either nonterminals or terminals.
5. Lowercase letters late in the alphabet such as u, v, ….. z represent strings of
terminals.
6. A set of productions A→ α1 | α2 |…… | αk. Call α1, α2,…. Αk the alternatives for A.
Example: using above notations list out terminals, non terminals and start symbol in the
following example
E→E+T|E–T|T
T →T * F | T / F | F
F → ( E ) | id
Here terminal are +, -, *, / , (, ), id
Non terminals are E, T, F
Start symbol is E
Derivations
Example:
Example:
Top-Down Parsing
Top down parsing can be viewed as the problem of constructing a parse tree for the input
string, starting from the root and creating the nodes of the parse tree in preorder.
Recursive-Descent Parsing
• Backtracking is needed (If a choice of a production rule does not work, we backtrack to try
other alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient.
• It tries to find the left-most derivation.
Predictive Parsing
• No backtracking
• Efficient
• Needs a special form of grammars (LL (1) grammars).
•Recursive Predictive Parsing is a special form of Recursive Descent parsing without
backtracking. Non-Recursive (Table Driven) Predictive Parser is also known as LL (1)
parser.
• When re-writing a non-terminal in a derivation step, a predictive parser can uniquely choose
a production rule by just looking the current symbol in the input string.
• When we are trying to write the non-terminal stmt, we can uniquely choose the production
rule by just looking the current token.
• We eliminate the left recursion in the grammar, and left factor it. But it may not be suitable
for predictive parsing (not LL(1) grammar).
Constructing LL(1) Parsing Tables
• Two functions are used in the construction of LL(1) parsing tables:
– FIRST FOLLOW
FIRST(S)={a,c,e,Ɛ}
FIRST(A)={a,c}
FOLLOW(S)={$}
FOLLOW(A)={b,d}
Note: For panic mode error recovery, refer example 4.36 in prescribed text book and also
refer figure 4.22 and figure 4.23.
Bottom-Up Parsing
• A bottom-up parser creates the parse tree of the given input starting from leaves towards
the root.
• A bottom-up parser tries to find the right-most derivation of the given input in the reverse
order.
• Bottom-up parsing is also known as shift-reduce parsing because its two main actions are
shift and reduce.
– At each shift action, the current symbol in the input string is pushed to a stack.
– At each reduction step, the symbols at the top of the stack (this symbol sequence is the right
side of a production) will replaced by the non- terminal at the left side of that production.
– There are also two more actions: accept and error.
Shift-Reduce Parsing
• A shift-reduce parser tries to reduce the given input string into the starting symbol.
• At each reduction step, a substring of the input matching to the right side of a production
rule is replaced by the non-terminal at the left side of that production rule.
• If the substring is chosen correctly, the right most derivation of that string is created in the
reverse order.
Handle
• Informally, a handle of a string is a substring that matches the right side of a production
rule.
– But not every substring matches the right side of a production rule is Handle
• If the grammar is unambiguous, then every right-sentential form of the grammar has exactly
one handle.
• We will see that is a string of terminals.
A Shift-Reduce Parser