On The Translation of Languages From Left To Right
On The Translation of Languages From Left To Right
Y
610 KNUTtt
is not of bounded right context, since the handle in both acid and bc~d
is " d " ; yet this grammar is certainly L R ( 0 ) . A more interesting ex-
ample is
Here the terminal strings are {a~bc~}, and the b must be reduced to S
or A according as n is even or odd. This is another LR(0) grammar
which fails to be of bounded right context.
In Section I I I we will give further examples and will discuss the
relevance of these concepts to the grammar for ALGOL 60. Section IV
contains a proof that the existence of k, such that a given grammar is
L R ( k ) , is recursively undecidable.
Ginsburg and Greibach (1965) have defined the notion of a deter-
ministic language; we show in Section V that these are precisely the
languages for which there exists an L R ( k ) grammar, and thereby we
obtain a number of interesting consequences.
II. ANALYSIS OF LR(k) GRAMMARS
Given a grammar ~ and an integer k => 0, we will now give two ways
to test whether ,q is L R ( k ) or not. We may assume as usual that ~ does
614 KNUTH
not contain useless productions, i.e., for any A in I there are terminal
strings ~, f, ~ such that S - > a A ' , / ~ aft'/.
The first method of testing is to construct another grammar ~ which
reflects all possible configurations of a handle and k characters to its
right. The intermediate symbols of ~ will be [A; a], where a is a k-letter
string on T U { ~ } ; and also [p], where p is the number of production in
9. The terminal symbols of ~ will be I U T U { -~}.
For convenience we define Hk(a) to be the set of all k-letter strings f
over T U { -~} such that a - > ¢~-/with respect to @ for some v; this is
the set of all possible initial strings of length k derivable from a.
Let the pth production of ~ be
[S; _~k]~ O[p] and [S; _~k]~ O~[q] implies ¢ = e and p = q. (15)
But ~ is a regular grammar, and well-known methods exist for testing
Condition (15) in regular grammars. (Basically one first transforms
so that all of its productions have the form Q~ ~ aQ], and then if Q0 =
IS; qk], one can systematically prepare a list of all pairs (i, j) such that
there exists a string a for which Qo ~ aQ~ and O0 ~ aQj .)
When k = 2, the grammar ~ corresponding to (2) is
TRANSLATION FROM LEFT TO RIGHT 615
(We thus have added to $ all productions we might begin to work on,
in addition to those we are already working on.)
T R A N S L A T I O N FROM L E F T TO R I G H T 617
C o m p u t e & ' by Eq. (18) and then compute the new set &~+~as follows:
properly take care of the most general case, this method is necessarily
complicated, for all of the relevant information must be saved. The
structure of this general method should shed some light on the im-
portant special cases which arise when the LR(k) grammar is of a simpler
type.
We will not give a formal proof that this parsing method works, since
the reader may easily verify that each step preserves the assertions we
made about the state sets and the stack. The construction of all possible
state sets that can arise will terminate since there are finitely many of
these. The grammar will be LR(k) unless the Z sets of Eqs. (19)-(20)
are not disjoint for some possible state set. The parsing method just
described will terminate since any string in the language has a finite deri-
vation, and each execution of Step 2 either finds a step in the derivation
or reduces the length of string not yet examined.
III. EXAMPLES
Now let us give three examples of applications to some nontrivial
languages. Consider first the grammar
In Table I, the symbol 21-~ stands for the state [2, 1; ~ ], and 4lab
stands for two states [4, 1; a] and [4, 1; b]. "Shift" means "perform the
shift left operation" mentioned in step 2; "reduce p " means "perform
the transformation (21) with production p." The first lines of Table I
TABLE I
~ARSING METHOD FOR GRAMMAR (26)
01~ 4 stop
224 4 reduce 2
43ab a, b reduce 4
634 4 reduce 6
84ab a, b reduce 8
TRANSLATION FROM LEFT TO R I G H T 621
are formed as follows: Given the initial state $ = {004} , we. m u s t form
S' according to Eq. (18). Since X01 = B and X02 = 4 we must include
10 4 and 20 4 in $'. Since X21 = L and X~2 = R we must:include 30ab;
40ab in $ ' ( a and b being the possible initial characters o f R 4 ). Since
X41 = L and X4~ = N we must, similarly, include 30ab and 40ab in 8';
but these have already been included, and so 8' is completely deter-
mined. Now Z = {a} in this case, so the only possibility i n s t e p 2 is to
have Yi = a and shift. Step 3 is more interesting; if we ever get to
Step 3 with $~ = $ (this includes later events when a reduction (21) has
been performed) there are three possibilities for X,~+i. These are de-
termined by the seven states in St, and the righthand column is merely
an application of Eq. (23).
An important shortcut has been taken in Table I. Although it is
possible to go into the state set "514 71b", we have no entry for that
set; this happens because 51471b is contained i n 51471ab. A procedure
for a given state set must be valid for any of its subsets. (This implies less
error detection in Step 2, but we will soon justify that.) It is often
possible to take the union of several state sets for which the parsing
action does not conflict, thereby considerably shortening the parsing
algorithm generated by the construction of Section II.
When only one possibility occurs in Step 2 there is no need to test
the validity of Yi • • • Yk ; for example in Table I line 1 there is no need
to make sure Y~ = a. One need do no error detection until an attempt
to shift Y~ = ~ left of the vertical line occurs. At this point the stack
will contain "$oS8i[ 4 k'' if and only if the input string was well-
formed; for we know a well-formed string will be parsed, and (by defini-
tion!) a malformed string cannot possibly be reduced to " S 4 ~'' by
applying the productions in reverse. Thus, any or all error detection
m a y be saved until the end. (When k = 0, 4 must be appended at the
right in order to do this delayed error check.)
One could hardly write a paper about parsing without considering the
traditional example of arithmetic expressions. The following grammar is
typical:
This grammar has the terminal alphabet {a, - , . , (,), 4 }; for example,
the string " a -- ( - - a . a - a) 4 " belongs to the language. Table II shows
how our construction would produce a parsing method. In line 10, the
notation "4, 5, 6" appearing in the X column means rules 4, 5, and 6
apply to this state set also. Such "factoring" of rules is another way to
simplify t h e parsing routine produced by our construction, and the
reader will undoubtedly see other ways to simplify Table II.
By means of our construction it is possible to determine exactly what
information about the string being parsed is known at any given time.
Because of this detailed knowledge, it will be possible to study how much
of the information is not really essential (i.e., how much is redundant)
and thereby determine the "best possible" parsing method for a gram-
mar, in some sense. The two simplifications already mentioned (delayed
error ehecldng, taking unions of compatible state sets) are simplifications
of this ldnd, and more study is needed to analyze this problem further.
In many eases it will not be necessary to store the state sets $~ in the
stack, since the states Sr which are used in the latter part of Step 2 can
often be determined by examining a few of the X's at the top of the
stack. Indeed, this will always be true if we have a bounded right con-
text grammar, as defined in Section I. Both grammars (26) and (27)
are of bounded context.
From Table I we can see how to recover the necessary state set in-
formation without storing it in the stack. We need only consider those
state sets which have at least one intermediate character in the " X ~ + I "
column for otherwise the state set is never used by the parser. Then it is
immediately clear from Table I that {004} is always at the bottom of
the stack, {214 , 4lab} is always to the right of L, {614,8lab} is always
to the right of b, and {624, 82ab} is always to the right of N.
Grammar ( 2 7 ) is related to the definition of arithmetic expressions in
the ALGOL 60 language, and it is natural to ask whether ALGOL 60 is
an LR(k) language. The answer is a little difficult because the definition
of this language (see Naur (1963)) is not done completely in terms of
productions; there are "comment conventions" and occasional informal
explanations. The grammar cannot be LR(k) because it has a number
of syntactic ambiguities; for example, we have the production
(open string} --+ (open string} (open string}
which is always ambiguous. Another type of ambiguity arises in the
parsing of (identifier) as (actual parameter}. There are eight ways to do
T A B L E II
]~ARSING METHOD FOR GRAMMAR (2,7)
this:
(actual parameter} --~ (array identifier} --~ (identifier}
(actual parameter --~ (switch identifier} --~ (identifier)
(actual parameter --* (procedure identifier} --* (identifier}
(actual parameter -+ (expression} --~ (designational expression}
(identifier}
(actual parameter (expression} --~ (Boolean expression}
(variable} ~ (identifier}
(actual parameter --~ (expression} --~ (Boolean expression}
(function designator) ~ (identifier}
: (actual parameter --~ (expression} --~ (arithmetic expression}
(variable} ~ (identifier}
(actual parameter} --* (expression} --+ (arithmetic expression}
(function designator) ~ (identifier}
These syntactic ambiguities reflect bona fide semantic ambiguities,
if the identifier in question is a formal parameter to a procedure, for it is
then impossible to determine what sort of identifier will be the actual
arg~lment in the absence of specifications. At the time the ALGOL 60
report was written, of course, the whole question of syntactic ambiguity
was just emerging, and the authors of that document naturally made
little attempt to avoid such ambiguities. In fact, the differentiation
between array identifiers, switch identifiers, etc. in this example was done
intentionally, to provide explanation along with the syntax (referring
to identifiers which have been declared in a certain way). In view of this,
a ninth alternative
(actual parameter) --~ (string} --* (formal parameter} --* (identifier)
might also have been included in the ALGOL 60 syntax (since section
4.7.5.1 specifically allows formal parameters whose actual parameter is a
string to be used as actual parameters, and this event is not reflected in
any of the eight possibilities above). The omission of this ninth alterna-
tive is significant, since it indicates the philosophy of the ALGOL 60 re-
TRANSLATION FRCM LEFT TO RIGHT ~5
the Tag problem (see Cocke and Minsky (1964)) but no apparent simple
connection. We can, however, prove that the partial correspondence
problem is recursively unsolvable, using methods analogous to those
devised by Floyd (1964b) for dealing with the ordinary correspondence
problem and using the determinacy of Turing machines.
For this purpose, let us use the definition and notation for Turing ma-
c.hines as given in Post (1947) ; we will construct a partial correspondence
problem for any Turing machine and any initial configuration. The
characters used in our partial correspondence problem wilt be
q~SiS~hh, 1 < i <_ R, 0 <=j <-_ m.
If the initial configuration is
S i l S j ~ " " Sj~_tq~lSjk'" S~
the pair of strings
( ~, ~hSj~...S~_lqi~Sjk... Si~,h) (28)
will enter into our partial correspondence problem. We also add the
pairs
(/~, h), (h,/~), (S~., ~.), (Ss', Sj), (~ , q~), 1 <_-i --- R, 0 ~ j = m. (29)
Finally, we give pairs determined by the quadruples of the Turing ma-
chine:
Form of quadruple Corresponding pairs, 0 < t -< m:
q~S~Lq~ (hqiS~, h(tzSoSj), ( Stq~S~, q~S~Ss)
q~S~Rqz (q~Sjh, ,~J(l~Sof~), (q~SjSt, Si~zSt) (30)
qiSjSkq~ (q~Sj, (lzS~)
N o w it is easy to see that these corresponding pairs will simulate the
behavior of the Turing machine. Since the pair (28) is the only pair
having the same initial character, and since the pairs in (30) are the
only ones involving any q~ in the ]efthand string, the only possible
strings which can be initial substrings of both a~la~: .-. and
fl~fl~ . . . are initial substrings of
, ~-aO~la~a~&~a~ "" , (31 )
where no, m , a~, etc. represent the successive stages of the Turing
machine's tape (with h's placed at either end, and where ~ is an obvious
TRANSLATION FROM LEFT TO RIGHT ~27
notation signifying the " b a r r i n g " of each letter of a). For these pairs,
therefore, the partial correspondence problem has an affirmative answer if
and only if the Turing machine never halts. And the problem of telling if a
Turing machine will ever halt is, of course, well known to be recursively
unsolvable.
We will apply this result to L R ( k ) grammars as follows:
T~EOREM. The problem of deciding, for a given grammar ~, whether or
not there exists a k ~ 0 such that ~ is L R ( k ) , is recursively unsolvable.
This theorem is in contrast to the results of Section II, where we
showed the problem to be solvable when k is also given. To prove this
theorem we will reduce the partial correspondence problem to the L R ( k )
problem for a particular class of grammars.
Let ( a l , ill), "" • , (a,~, ft.) be pairs of strings entering into the partial
correspondence problem, and let
X1X2 " " X~ +
be n + 1 characters distinct from those appearing among the a's and
3's. Let ~ be the following grammar:
S - - ~ A , S---~ B, A -+ X i + o~i , B - ~ X I + fli
(32)
A --+ X i A o ~ i , B --> X i B f l i , ] ~- i <~ n .
O {X,m "'" X i l --~ (~il "'" C~im} O {Xim "'" X i 1 ~- ~,1 "'" ~,m}:
We will show @ is LR(tc) for some k if and only if the partiM corre-
spondence problem has a negative answer. If the answer is affirmative,
for every p we have sentential forms X 9 . . . X{, + a~ . . . a ~ , X{. .- •
X q + fl~ • • • fl~ in which the first p characters following " + " agree.
The handle must include the " + " sign, but the p - q characters following
the handle do not tell us whether the production A --+ Xi, + a~ or
B --+ X~I + fi~ is to be applied, if q is the maximum length of the
strings a~, fl~. Hence the grammar is not LR(q). On the other hand, if
the answer to the partial correspondence problem is negative, there is
a p for which, knowing (ix, ".- , i,~i~(~.o) and the first p characters
of aqai~ - " ai, ~ ~ or fli,fl~ "'" flit q ~, we can distinguish whether it
is a string of a's or a string of fl's, and therefore @ is in fact a bounded
context grammar.
628 KNUTH
type (ii) Aq~ ~ q~ if qi is not final, Aqi ~ (ti if qi is final, A ~ --~ ~..
type (iii) Aql ~ A B q j if qi is not final, Aq~ --~ A B ~ j if q~ is final, A ~ -*
A B(l j .
One easily verifies that (39) cannot occur, and the same set of strings
is accepted; basically we get into a state ~. if the current string has been
accepted, and then we do not accept the string again, but return to an
unbarred state when the next rule of type (i) is used.
Once the D P D A has been modified to meet these assumptions, let it
have the states q0, • • • , q, ; we are ready to construct a grammar for
the language it accepts. We begin by defining the languages L~At for
0 < i, t < r and for all intermediates A of the DPDA:
L~At = {a [ Aq~a _t> Aq --+ qt for some q} (40)
where no step in the derivation represented by " - ' > " affects the A appear-
ing at the left.
Constl~ct the following productions for all rules (38) of the DPDA:
Rule P r o d u c t i o n s for
Now remove all useless productions from ~, i.e., those which can never
appear in a derivation of a terminal string starting from L0~. We claim
the resulting grammar ~ is L R ( 1 ) . This result could be proved using
either of the constructions in Section II, where the state sets have a
rather simple form, but for purposes of exposition we will give here a
more intuitive explanation which shows the connection between the
operation of the D P D A and the parsing process.
Consider any string a-{ where a is accepted by the DPDA, and
consider the step-by-step behavior of the D P D A as it processes a. At
the same time we will be building a partial derivation tree which reflects
all of the information known at a given stage of the parse. The nodes of
this partial tree will contain symbols [i, A, .] which means that in the
only possible parsing of the string the intermediate L~at, for some t =
0, 1, . . . , r or t " b l a n k " , must fill that position. We will be " a t " some
node [i, A, *] of t h e tree, meaning this particular node below the handle
is of interest, and at the same time the D P D A will contain the con-
figuration .-. A q ~ . . . .
All of this can be clarified by considering an example, so we will con-
sider the following " r a n d o m " D P D A :
Rules of DPDA Productions of ~ (useless ones deleted)
c [4, A, *]
\ /
[2, A, ,]
\
a ]1, A, *]
\ /
[2, A, *] (45)
\
a [1, A, *]
\ /
[2, A, ,1
\
a [1, [-, *]
\ /
[0, ~, ,]
We are now " a t " node [4, A, *], signified by the three dots above it. At
this point the D P D A uses the rule Aq4 --* q6 and we transform the top
of tree (45) to
(Thus, two handles are recognized and then removed from the tree.)
Then the D P D A uses the rule Aq6 --~ q2 and (46) becomes
i L L6A~i
i a... /<L,~ i ". (47)
[~A,*]
by reducing three more handles. When the rule Aq~b --~ Aq3 is next ap-
634 KNUTH
b [3, A, *]
L2~2 [2, A, *]
\/
[1, A, ,]
aN,// (481
[2, A, *]
\
[1, ~, ,]
a /
\/
[o, ~, ,]
Now q3 is a final state and the next character is " ~ ", so we complete
the parsing; (48) becomes
b L3~
\/
L2~2 L2~
\/
L1A
a\ // (49)
L2A
\
a LI~
\/
Lo~
Having worked the example, we can consider the general case. Suppose
the D P D A is in t h e configuration ..- C A q i a . . . , and suppose we are
at node [i, A, .] of the tree. If q~ is a final state and a -= " ~ ", by condi-
tion (39) we must now complete the parsing, so we proceed to replace
each [i, A, ,] in the tree by L~u until the root is reached (as in going from
(48) to (49)). If q~ is not final or a ~ " -~ ", there are three cases de-
pending on the pair Aq~ :
C a s e ( i ) . The D P D A contains a rule of the form A q i a --~ A q j . Then
the only possible parse must occur by changing
TRANSLATION FROM LEFT TO RIGHT 635
from to a [j, A, *]
[i, A, ,] ~ /
[i, A, *]
\
/[ i, A,*] to i X~\ ? ~ j
X2 [il, A,,.] X2 LqA~j
x~ \[~i A2.1 X. ~./2A2,
......... .\/:i ..........................
"c,1
.
\
[i', c,,l [i', c,,]
[i, A, .1 [j, B, .]
\
[i, A, *]
as we did while building tree (45).
Cases (i), (ii), (iii) are mutually exclusive by the definition of DPDA,
and the arguments are justified by the fact that our tree represents all
possible productions of the grammar that could conceivably work.
Notice that in the parsing we actually have almost an LR(0) grammar
since it was necessary to look at the character following the handle only
when q~ was a final state, to see if the next character is " ~ " or not.
As a consequence of our two theorems, we find a language can be
generated by an LR(k) grammar if and only if it is deterministic, if and
only if it can be generated by an LR(1) grammar.
The theorem cannot be improved to " L R ( 0 ) grammar", since ob-
636 KNUTH
REFERENCES
CocK~, J., AND MINSKY, M. (1964), Universality of Tag systems with P = 2.
J. Assoc. Comput. Mach. 11, 15-20.
EARLEY, J. (1964), "Generating Productions from B N F " (preliminary report).
Carnegie Institute of Technology.
EICKEL, J. (1964), Generation of parsing algorithms for Chomsky type 2 languages.
Tech. Hoch. M~nchen, Bet. //6401.
FLoYn, R. W. (1963), Syntactic analysis and operator precedence. J. Assoc. Corn-
put. Mach. 10, 316-333.
FLOYD, R. W. (1964a), Bounded context syntactic analysis. Commun. Assoc.
Comput. Mach. 7, 62-66.
FLOYD, R. W. (1964b), "Now Proofs of Old Theorems in Logic and Formal Lin-
guistics." Computer Associates, Inc., Wakefield, Massachusetts.
GINSBURG, S., AND GREIBACH,S. (1965), "Deterministic Context-Free Languages"
(preliminary report). Am. Math. Soc. Not. 12, 246, 367.
IRONS, E. T. (1964), "Structural connections" in formal languages. Commun.
Assoc. Comput. Mach. 7, 67-71.
LzNc~, W. C. (1963), "Ambiguities in BNF Languages." Thesis, Univ. of Wis-
eonsin.
NAU~, P., ed. (1963), Revised Algol 60 report. Commun. Assoc. Comput. Mach. 6,
1-17.
P~us, M. (1962), A general processor for certain formal languages. Proc. Syrup.
Symbolic Languages in Data Processing, Rome, I962. Gordon and Breach,
New York.
POST, E. L. (1947), Beeursive unsolvability of a problem of Thue. J. Symbolic
Logic 19., 1-11.