Dissertation
Incremental LR(1) Parser Construction as an Aid to
--------------------------------------------------
Syntactical Extensibility
-------------------------
DISSERTATION
zur Erlangung des Grades des
Doktors der Naturwissenschaften
der Universität Dortmund
an der Abteilung Informatik
von
Georg Fischer
Dortmund
1980
Table of Contents
-----------------
Acknowledgements 3
Symbols and Abbreviations 4
1 Introduction 5
1.1 Survey of Previous Work 9
1.2 EXTRA - an EXtensible TRAnslator 12
1.2.1 EXTOL - The Extensible, Translation Oriented Language 15
1.2.2 Other Features of EXTRA 21
1.2.3 Example: Extension of the Metagrammar 24
2 Definitions and Notations 25
2.1 The Set Oriented Algorithmic Language SOL 25
2.2 Context-Free Grammars 28
2.3 LR(k) Grammars 30
2.4 Construction of LR(k) Parsers 32
2.5 Data Structures and Mappings 36
3 Algorithms for NQLALR(1) 42
3.1 Initialization and the Parsing Algorithm 44
3.2 The NQLALR(1) Method for the Computation of Lookaheads 47
3.3 Modification of the Grammar 54
3.3.1 Deletion of a Set of Productions 55
3.3.2 Insertion of a Production 57
3.3.3 Reduced Grammars 58
3.4 Modification of the Set of States 60
3.4.1 Compatibility of States 60
3.4.2 Computation of New Successors 63
3.4.3 Garbage Collection for States 65
3.5 Transmission of Lookaheads 67
4 Non-NQLALR(1) Grammars 69
4.1 Causes for Conflicts 70
4.2 Foundations of the LR(1) Construction 72
4.2.1 A Non-LALR(1) Grammar 73
4.2.2 Overloaded Conflicts 74
4.2.3 Lane Tracing 76
4.2.4 Strong Compatibility 77
4.3 Algorithms for LR(1) 79
4.3.1 The Graph of Lanes 80
4.3.2 Processing of Conflicts 84
4.3.3 Splitting of States 85
5 Implementation 93
6 Discussion 96
List of Algorithms 98
References 99
Zusammenfassung
---------------
Es werden Algorithmen zur Konstruktion und insbesondere zur
Veränderung von Syntaxanalysatoren für Programmiersprachen
beschrieben. Aus einer Grammatik wird die Tabelle für einen
LR(1)-Analysator - über die Zwischenstufen LR(0), NQLALR(1),
LALR(1) und allgemeines LR(1) - generiert. Die mächtigeren
(und teureren) Verfahren werden nur bei Bedarf eingesetzt.
Die "inkrementelle" Konstruktion geht aus (1) von einer Grammatik
und der zugehörigen Tabelle und (2) von einer Menge von Produk-
tionen, die (2.1) zu der Grammatik hinzugefügt oder (2.2) aus
ihr gestrichen werden sollen. Die existierende Tabelle wird dann
so modifiziert, daß sie der neuen Tabelle wieder entspricht.
Selbst wenn im Extremfall jede Produktion einzeln eingefügt
wird, ergaben Messungen an Grammatiken für übliche Programmier-
sprachen nur einen Rechenzeitfaktor zwischen 4 und 13 gegenüber
dem normalen "en-bloc"-Verfahren. Bei häufigen Änderungen
an der Grammatik dürfte deshalb das inkrementelle Verfahren
den Aufwand für die Parser-Konstruktion deutlich verringern.
Die Algorithmen sind ein wesentlicher Bestandteil des Systems
EXTRA, das die interaktive Beschreibung, Ausführung und
Abänderung von Übersetzern ermöglicht. Mit EXTRA können
zum Beispiel sehr einfach Vorübersetzer hergestellt werden,
die syntaktische Erweiterungen in Programmiersprachen verarbeiten.
Abstract
--------
Algorithms for the construction and especially for the modification
of parsers for programming languages are described. The table for
an LR(1) parser is generated from a grammar by the intermediate
steps LR(0), NQLALR(1), LALR(1) and general LR(1). The more powerful
(and more expensive) methods are used only when necessary.
The "incremental" construction starts with (1) a grammar and the
corresponding table and (2) with a set of productions which are
(2.1) to be added to or (2.2) to be deleted from the grammar. The
existing table then is modified such that it corresponds again to
the new grammar. Even if at the extreme the productions are added
one by one, measurements with grammars for usual programming
languages show only a factor in computing time between 4 and 13,
compared with the normal "en-bloc"-approach. For frequent changes
of the grammar of a programming language, the incremental method
therefore should significantly reduce the effort for the parser
construction.
The algorithms are a main component of the system EXTRA which
allows the interactive description, execution and modification
of translators. With EXTRA it is very easy, for example, to
build preprocessors which handle syntactical extensions in
programming languages.
3
Acknowledgements
----------------
This work would have been impossible without the aid of many
people.
Though my interest in compiler construction reaches back to my
studies until 1975 in München, they were really brought to light
by the project group "Compiler Construction Systems" in 1977/78
at the University of Dortmund. The work and discussions of the
ten students in that group strongly influenced the ideas and
design of the EXTRA system and its parser generator. I especially
thank Helmut Ludwigs, Udo Lipeck and Klaus Gröning for their
interest and their contributions.
Two members of the group, Manfred Schönhold and Horst Rüthing,
designed and implemented the transformational features of the
EXTRA system. Without their work, the system would not be
operational, and I thank them for their engagement.
My collegues at the Lehrstuhl 1 often participated at the evolution
of this work, and I thank them for their encouragement and their
critics.
For the typing of the manuscript I used an editor written by
Horst Dahmen. To the members of the project NCAP go my thanks for
the patience they always showed when I occupied one of their
terminals.
Werner Davids read an earlier version of the manuscript, and he
helped much in improving my English, but of course he is not
responsible for all the errors and weak constructions which I
introduced afterwards.
Prof. Dr. Bernd Reusch provided me the opportunity to write
this work. He and especially Dr. Gisbert Dittrich read it very
carefully, and I express my gratitude for their discussions,
their hints and the time they spent.
Finally, I would like to thank my wife, who showed amazing
patience and understanding while I was writing this dissertation .
4
Symbols and Abbreviations
-------------------------
The use of the line printer implies some restrictions on the
notation in this work. Subscripting is not possible; it is
mostly indicated by placing the single subscript to the right
of the subscripted element, or by the enclosure in square
brackets. Exponentiation is also omitted, or it is indicated
by the operator "**".
The following list gives an overview to the symbols used in this
work.
∈ element of
¬∈ not element of
c contained in or equal to
c≠ contained in but unequal to
x cartesian product
2**X powerset of X
Ø empty set
card( ) cardinality, number of elements
e empty word
X* set of all words over the alphabet X (e ∈ X*)
|x| length, number of symbols in word x
HEAD(x,k) first k symbols of x
TAIL(x,k) last k symbols of x
x**k k-fold concatenation of x
|| concatenated wwith
==> if ... then ...
<==> iff (if and only if)
:<==> by definition iff
--> mapping from ... to ...
|-> is mapped to
=> directly derives (in one step)
=>+ derives (in one or more steps)
=>* derives or is
-> shift )
=: reduce ) parser actions
=. accept )
?? error )
:= by definition equal to
& and
| or
¬ not
∀ for all
∃ exists
? for any
< less than
<= less than or equal
= equal
≠ not equal
> greater than
5
1 Introduction
---------------
Hundreds if not thousands of programming languages have been
invented since the beginning of computer programming. Most
of them were, if ever implemented, only used by a few people.
The costs and the mental effort for designing, implementing
and teaching a new language are very high. Therefore it seems
astonishing that there are always people left over which cannot
be satisfied by any of the (say) twenty languages besides COBOL
and FORTRAN which are in wide-spread use today.
One reason for this fact perhaps is the continuous appearance
of new application areas on the one and of new computer
architectures on the other side. It seems quite natural that
programming languages which are the link between both fields
cannot remain unaffected by these changes.
There may be another, perhaps even better explanation: the
psychological one. The ordinary programmer has no means to
develop a new language. The computer scientist, however, because
of his education is not willing to let his ideas be narrowed
by the awkward constructions of the available languages, and he
soon becomes very attracted by the idea to invent (and perhaps
implement) his own language.
Therefore the number of programming languages must be proportional
to the number of research institutions which provide suitable
freedom and financial support.
From this explanation it is obvious that the flood cannot be
dammed up by fixed languages such as PL/1 or ADA, as mighty as
the organization behind the language might be.
Some language designers very early tried to attack the problem
more basically, and they introduced extensibility into programming
languages. By 1970 it seemed that this would be the solution to
the Tower of Babel in computer science (cf. for example Christensen
and Shaw [1969], Schuman [1971]), and proposals for the
extensibility of nearly every aspect in a language were made.
In the meantime this enthusiasm has faded out. Extensibility
introduces serious implementation problems. The extensible
features (modes and operator definition) in ALGOL68, for
example, are one explanation for the delay (if not failure) of
that language.
It might also be argued that extensibility is inherently
dangerous, because the programs written by one person become
unreadable to everybody else. Though this danger cannot be
ignored, it seems not to be a problem of extensible languages
per se.
To some extent the same problem occurs with the heavy use
of macros in assembler language programming, or even with
procedures in high level languages. Both techniques are
recommended today, and suitable organizational actions
were developped to cope with the mentioned problem.
6
It may be the point that the compilers will have no difficulty
to read extended languages as long as they are given the
definitions of the extensions.
In this work another attempt is made to support one special
aspect of extensibility, namely syntactical extensibility.
Given a base language with some context-free grammar, the more
experienced user should be enabled to define his own notation
for statements, declarations, operators and so on. Together
with the new syntax he should describe how to translate the
extensions into constructions of the base language.
The goto-less programming debate shows a good example where
only syntactical extensibility is needed. The "structured"
constructions such as IF-THEN-ELSE, DO-WHILE, CASE etc. only
serve to clarify the program and to decrease the amount of
clerical work. Their semantics may be described by conditional
goto statements not only in theory, but in fact a compiler
translates them in several such gotos. With a syntactical
extension facility in a language it is an easy task to define
missing control flow statements.
Often syntactical extensibility is done with several restrictions,
for example
(1) extensions must be defined before the entire program,
(2) they are processed in a separate ("preprocessor") pass,
(3) the applications of the extended constructions have to
start with a special delimiter or with a unique keyword,
(4) the constructions may not be nested,
(5) extended constructions in the result of the translation
process are not recognized again,
(6) extensions are allowed only for specific constructions,
e.g. statements,
and so on.
Such restrictions of course facilitate the implementation at the
expense of generality and ease of use.
In section 1.2 the translation system EXTRA is described which
dispenses with most of these requirements. The most notable
exceptions are the conditions that all input into the system -
base grammar, extension definitions, and programs to be
translated - can be described by one grammar, and that this
grammar must be LR(1).
LR(1) parsing was chosen because today it seems to be the
most reasonable compromise between efficient implementation
and ease of grammar design. Most unambiguous context-free
are LR(1) without any transformation - a statement which
is not true for other modern parsing methods, LL(1) for
example.
Only a functional description of the full EXTRA system is given
7
in section 1.2. The rest of this work concentrates on one major
component of the system, namely the incremental LR(1) parser
constructor, while the implementation of the translation
mechanisms is described by Rüthing and Schönhold [1980].
Incremental construction means the following:
(1) Given an LR(1) parsing table corresponding to a
context-free grammar, and
(2) given a set of productions (at the extreme a single
production) to be
(2.1) inserted into the grammar or
(2.2) deleted from the grammar,
(3) modify the parsing table such that it corresponds
to the new grammar.
The incremental approach may also be illustrated by an analogy
between grammar design and ordinary programming. The parser
constructor the corresponds to the language processor. There
are three possibilities to process a program after a change:
(1) all statements are recompiled,
(2) the new statements only are recompiled and the result
is linked to the result of the old statements,
(3) new and old statements are always interpreted anew.
Possibility (1) is the usual approach for languages such as
ALGOL60, COBOL, FORTRAN and so on. Possibility (3) is most often
used with BASIC, but interpretation is considerably slower than
execution of compiled code. Possibility (2) should be the most
economic one if compilation and execution times are considered
together, but this approach implies great difficulties if the
user is allowed to manipulate block structure, declarations
and other non-local constructions.
With a list of restrictions (2) is implemented in IBM's PL/1
Checkout Compiler [IBM 1973], the general case is treated by
Schneider [1975] and Celentano [1978]. The attribute "incremental"
in the title of this work was taken from the latter two references,
though here grammars resp. parsing tables are incremented, while
Schneider and Celentano increment programs resp. code.
It is clear that an incremental parser constructor is the main
prerequisite for syntactical extensibility. Though the statement
of the problem is very simple, its solution is not trivial, since
even the algorithms for usual LR(1) parser construction are
rather complex.
The reader should note that LR(1) was postulated in contrast
to LALR(1). Small tables for all LR(1) grammars will be
constructed in the steps LR(0), NQLALR(1) ("not quite"
LALR(1)), LALR(1) and finally LR(1), where the last or the
last two steps may be skipped if they are not necessary.
8
In the following section we will first survey the literature on
the two areas which are linked by this work, namely syntactical
extensibility and LR(k) parser construction.
Section 1.2 is an updated version of Fischer [1979]. It describes
a syntactically extensible translation system implemented at
the University of Dortmund by Horst Rüthing, Manfred Schönhold
and myself in 1979.
Section 2 introduces several notations for algorithms and grammars.
LR(k) grammars are defined, and the data structures which are
built and modified by the algorithms are presented.
In section 3 the first set of algorithms for the incremental
construction of parsing tables given. This section relies on
the NQLALR(1) method for the construction of lookaheads, and it
should already produce conflict-free tables for most natural,
unambiguous grammars.
If conflicts do occur, the algorithms of section 4 may be used
to determine whether the grammar is LALR(1) or LR(1), and the
corresponding parsing tables may be constructed.
The implemented version of the algorithms were applied to a set
of grammars of programming languages, and several measurements
were carried out. The results are reported in section 5.
In section 6 finally the approach is summarized and discussed.
9
1.1 Survey of Previous Work
----------------------------
Syntactical Extensibility
The field of extensibility in programming languages is too wide
to be covered thoroughly in this work. For a bibliography the
reader is referred to Metzner [1979]. Good introductions to the
field are also given in the books of Cole [1976] and Brown [1974].
Here we will only discuss a historical root of extensibility,
and several references with a specific connection to this work.
There is one extension mechanism which is in widespread use since
the beginnning of programming: the macro in assembler languages.
According to Brown [1974] "a macro is a facility for
replacing one sequence of symbols by another. A piece of
computer software for supporting macro activity is called
a macro processor".
The macro concept evolved from simple replacement schemes to
rather sophisticated possibilities for the generation of program
text, and it seems that the macro is one of the weapons which
allowed assembler languages to withstand high-level programming
languages so long.
High-level languages usually have no or only weak built-in
macro processing possibilities; the compile-time facilities
of PL/1 are one of the exceptions.
About 1965 the macro concept was made independent of any base
language. In GPM (Strachey [1965]) and TRAC (Mooers and Deutsch
[1965]) fixed characters are used to surround a macro call and
to separate the parameters, but in ML/1 (Brown [1967]) a macro
call already starts with the name of the macro only, and the
other delimiters may be chosen freely. In STAGE2 (Waite [1973])
finally a pattern matching is performed on every input line, and
therefore even infix operators can be handled.
As Cole [1976] points out, all replacement schemes discussed so
far are effective before or during the lexical analysis phase
in a compiler or assembler.
The next step is to build the macro facilities in the syntax
analysis pass,as was proposed by Cheatham [1966] and by
Leavenworth [1966]. In high-level languages this approach should
be much more efficient, but of course it complicates the parsing
process considerably.
Wegbreit [1970] for example describes a language with several
aspects of extensibility. He uses Earley's parsing method
(Earley [1970]), but the time estimations rise from O(n**3) to
O(n**8) because of syntactical extensibility.
The title of this work could have been obtained from the thesis
of Vidart [1974] by the replacement of one letter. Vidart modifies
the tables of an LL(1) parser to achieve syntactical extensibility.
The productions added by the user may destroy the LL(1) property
of the old grammar, therefore the system applies a transformation
to the grammar in order to ensure LL(1).
10
There are, however, limitations pertinent to LL(1) parsing: the
Boolean expression grammar in Knuth [1971] is an example for an
LR(0) grammar whose language cannot be generated by any LL(k)
grammar.
The LANG-PAK system of Heindel and Roberto [1975] is the last
mentioned here. LANG-PAK was designed for the design of small
application languages, and the interactive use is especially
well supported. Nonterminals may be defined, undefined (!) and
examined, trial sentences may be input to test the current
grammar, the grammar may be stored in and fetched from a file
at any stage, and semantics may be connected with the productions.
The parsing method is called "left-to-right, top-down, fast-back"
what should again be LL(1). The system is remarkably small
(756 lines of FORTRAN) and fully operational.
LR(k) Parser Construction
LR(k) theory was introduced in the fundamental paper of Knuth
[1965] which is reported in more detail in sections 2.3 and
2.4. Unfortunately Knuth's algorithm does not directly lead
to a practical method for automatic LR(k) parser construction:
for realistic grammars his algorithm would get into serious time
and space problems.
During the past fifteen years several methods with an acceptable
complexity have been developped.
Korenjak [1969] uses an approach which is also known from
precedence parsing: first tables are computed with Knuth's
algorithm for several important subgrammars (e.g. for statements,
declarations, expressions etc.), and then the tables are linked
together to form a table for the entire grammar. Appropriate
choice of the subgrammars and error treatment at the boundary
between two subtables are two difficulties with this method.
Almost all other practical methods follow an advice of DeRemer
[1969], which is based on the following observations:
(1) LR(0) parsing tables are reasonably small and easy
to compute, but most natural grammars are not LR(0).
(2) The conflicts occurring in the LR(0) table can most
often be resolved if the parser is given a "lookahead"
of length 1 (or k >= 1 in the general case).
DeRemer therefore first computes the LR(0) set of states. For
the computation of the lookaheads there are several different
methods.
SLR(k) is the first successful method for the automatic
construction of tables for LR(k) parsing tables. The lookahead
for a reduction to nonterminal 'A' is determined from the
strings which may follow on 'A' in any sentential form deriveable
from the start-symbol. This method was implemented and evaluated
by DeRemer [1969,1971] and by Anderson, Eve and Horning [1973].
All grammars for which the LR(0) table can be made conflict-free
11
by the addition of some lookahead are called LALR(k) by DeRemer
[1969]. This definition comprises more grammars than SLR(k),
and especially LALR(1) is the standard method today. It was
first implemented by LaLonde [1971], and as SLR(k) it was shown
to be superior to precedence parser construction and to Korenjak's
method (LaLonde, Lee and Horning [1971]). The method is also
thoroughly discussed by Aho and Ullman [1972,1977].
There is a third method, termed NQLALR(k) ("not quite" LALR(k))
by DeRemer and Pennello [1979], which is a little bit less
powerful than LALR(k). Here the lookaheads are taken from states
determined by a tree of predecessor states. Probably several
implementations which claimed LALR(k) have in fact used this
method, and because of its relative simplicity and efficiency
it is also used in this work (cf. sections 3.2, 3.5).
In the meantime work concentrates on the most efficient LALR
method. Pager [1977 LTA] and DeRemer and Pennello [1979] must
be cited here.
Pager [1977 PGM] describes a method which produces conflict-free
parsing tables with a minimal number of states for all LR(k)
grammars. In principle states that would be constructed by
Knuth's method are merged as with LALR, NQLALR and SLR, but
Pager avoids the case that the merging introduces a conflict
which would be not present in the LR(k) table. The corresponding
checks must be made throughout the construction, however, and
therefore the method is inefficient if the grammar turns out
to be LALR(k).
To achieve general LR(1) another idea appearing already in DeRemer
[1969] is used in this work: first an LALR(1) table is constructed,
but then the constructor tries to remove any remaining conflicts
by splitting of some states. In section 4.3 we obtain the minimal
table of Pager [1977 PGM], but probably with less effort in the
case of a few LALR(1) conflicts.
12
1.2 EXTRA - an EXtensible TRAnslator
-------------------------------------
Introduction
The EXTRA systems allows the interactive description and use of
a translator for LR(1) languages. The current description
may successively be extended, and test translations may be
interspersed with the definition phases. This approach signifi-
cantly decreases the costs for the implementation and maintenance
of a translator.
Originally the system was designed for the tranlation between
high-level programming languages.
Usual text editors are not suited to this task, because
they cannot handle nested structures. A string manipulation
language such as SNOBOL4 (Griswold, Poage and Polonsky
[1971]) could handle context-free languages, but its back-
tracking pattern matching process is too slow for realistic
translations.
What we would look for is an easy-to-define front-end of an
ordinary compiler which would read and analyze the source text
possibly in one pass, and by a syntax-directed translation in
the sense of Lewis and Stearns [1968] would translate it into the
desired target text.
Another design issue of the system originates from the obser-
vation that the costs for using known parser constructors are
unnecessary high in realistic compiler construction projects.
One might object that the parser needs to be generated
only a very few times. In practice this is not true, because
often the design of new languages (and hence of its grammars)
is not yet completed when the compiler construction is
started. Furthermore the implementor himself wishes to change
the grammar because he needs additional hooks for semantics,
and finally inconsistencies and weak points in the language
may cause additional constructor runs.
In contrary to the repeated sacrifice of some minutes of CPU-time
for the change of a single production, the EXTRA system has a
"steadiness" property: local changes of the grammar only cause
local changes in the internal parsing and translation tables.
We will refer to this as the incremental construction method in
contrast to the usual en-bloc construction.
There are other properties which ease the use of the EXTRA
system. The problem of the displacement of numbers for semantic
routines after a change of the grammar is avoided, since the
translation is defined almost entirely in terms of the grammar.
Grammar and source text are read in and analyzed by the same
system components; this unifies the notation and reduces
the size of the system and its data structures. There is an
escaping mechanism that allows the activation of user written
program segments, and there are some commands that enable the
interactive use of the system.
13
General Structure of the System
The EXTRA system has much similarity to a very simple
translator as it is shown in figure 1.2-1.
There is a scanner that reads the characters of the source
text and builds symbols.
The string of symbols is syntactically analyzed by the
parser; this component builds a syntax tree.
Parts of the tree may be modified by the tree transformator.
Finally the leaves of the modified tree are collected and put
to an external medium by a small output program.
---------------------------------------------------------------
Source Text Transformational
| Grammar
↓ |
Scanner | |
| | |
↓ | |
Parser <---------- Table <---- Parser <-----+
| | Constructor |
| | |
↓ | |
Tree Transformator <-- Table <---- Transformator <---+
| | Constructor
| |
↓ |
Output Program
|
↓
Target Text
---------------------------------------------------------------
Figure 1.2-1: Structure of a Simple Translator
Parser and tree transformator are best driven by tables that
contain the language dependent information. These tables are
constructed from a transformational grammar (DeRemer [1974])
by a parser constructor and a transformator constructor in a
separate step prior to the actual translation of source text.
As shown by figure 1.2-2, this separation is almost entirely
overcome in the EXTRA system, and this is one of the main
features of the system.
14
---------------------------------------------------------------
Source Text and
Transformational
Grammar
|
↓
Scanner
|
↓
Parser <---------- Table <---- Parser
| Constructor
| ↑
| |
| +-------------------------+
| | |
↓ ↓ ↓
Tree Transformator <-- Table <---- Transformator
| Constructor
↓
Output Program
|
↓
Target Text
---------------------------------------------------------------
Figure 1.2-2: Structure of the EXTRA System
The input to EXTRA is a mixture of parts of the transformational
grammar and of the source text to be translated.
Both components of the input are read by one scanner, and they
are syntactically analyzed by the same parser. The tree trans-
formator is the first that recognizes the meaning of the
grammatical components. They are stored by special semantic
actions of the transformator, and at definite moments the
syntax directed translation is stopped, the tables are modified
such that they reflect the new grammar, and then the translation
is resumed.
15
1.2.1 EXTOL - The Extensible, Translation Oriented Language
------------------------------------------------------------
Description of Syntax
Along with the babylonian proliferation of new programming
languages goes a variety of notations for grammars; cf. the
title of Wirth's paper [1977]. We will adopt a mixture of that
proposal (for delimiters '.', '=' and '|') and of PL/1 (for
identifiers, numbers and strings).
The entire input for EXTRA has the following raw structure:
... [ ... ] ... [ ... ] ... [ ... ] ...
In square brackets, portions of the transformational grammar are
given; between these packages, source text that is to be trans-
lated may be interspersed. This enables the interactive testing
and correction of a defintion.
Since we would like to analyze the entire input by the same parser,
there should be one grammar that describes that input. But this
grammar must be brought into the system, i.e. it must be read
in, and then there must be a description (= grammar) for it,
and this description must be brought into the system ... .
Therefore the grammar for the first package in '[' and ']' is
defined by a fixed procedure in the EXTRA system: the metaparser.
This procedure is able to read the metagrammar that starts with
the following rules:
EOF IDENTIFIER NUMBER STRING
[AXIOM = EXTRA_INPUT
.EXTRA_INPUT = MEMBERETIES '[' GRAMMAR ']' SOURCE_TEXT
=> #1
| EXTRA_INPUT '[' GRAMMAR ']' SOURCE_TEXT
=> #1
.GRAMMAR = RULES => #2
.RULES = RULE
| RULES '.' RULE
.RULE = LEFT_SIDE '=' RIGHT_SIDES => #4
| LEFT_SIDE '-' RIGHT_SIDES => #5
.LEFT_SIDE = IDENTIFIER => #3
.RIGHT_SIDES = RIGHT_SIDE
| RIGHT_SIDES '|' RIGHT_SIDE
.RIGHT_SIDE = SYNTAX_PART SEMANTIC_PART
.SYNTAX_PART = MEMBERETIES => #6
.SOURCE_TEXT = /* extensions by the user */
.MEMBERETIES = => #7
| MEMBERETIES MEMBER
.MEMBER = PRIMARY
.PRIMARY = IDENTIFIER => #8
| STRING => #9
| NUMBER => #8
/* (the metagrammar is continued below) */
As the reader may guess, the grammar above is written in the
notation that it describes.
Once the metagrammar is read in, it is used to construct a LR(1)
16
parsing table. From there on, the LR(1) parser in the EXTRA
system may use this table to read more grammar; that grammar then
is used to modify the table and so on.
The constructions '=> #n' on the right side are produced by the
rules for SEMANTIC_PART that are given below. These constructions
are quite important, they activate the so-called special semantic
actions which are described in detail in section 1.2.2.
If in the following parse a string of symbols is reduced
to EXTRA_INPUT, for example, the tree transformator will
perform action #1, in which case the translation of
SOURCE_TEXT is put out, and the system is prepared to scan
GRAMMAR in the following.
Actions #3, #6, #7, #8 and #9 are used to store the
LEFT_SIDE and the MEMBERs of the RIGHT_SIDEs of a rule. #4
indicates that the rule is to be inserted in the grammar,
while #5 will delete it from the grammar.
#2 finally reorganizes the entire parsing table such that
it reflects the new grammar, and it prepares the system to
scan SOURCE_TEXT in the following.
Though it does not occur in the metagrammar, we will use the
term "production" for a LEFT_SIDE together with one of its
RIGHT_SIDEs.
Several special cases must be noted if a production is inserted
(deleted):
(1) The LEFT_SIDE of the production is not yet (no more)
accessible from the start-symbol AXIOM : the production
will not occur in the parsing table.
(2) The LEFT_SIDE does not yet (no more) produce any
terminal string : the production will occur in the
parsing table (applications of this anomaly are discussed
in connection with the backspacing of the parser, below.
(3) There is already a rule in the grammar with the same
LEFT_SIDE : the new production is united with the old
ones.
(4) The first (last) production with a certain symbol as
LEFT_SIDE is inserted (deleted) : the symbol switches
from terminal to nonterminal (from nonterminal to terminal).
(5) The production is already (is not) in the grammar :
it is ignored, possibly associated semantics are
changed (an error message is issued).
The first line of the metagrammar determines the names of several
symbols that are built from characters by the scanner, and that
"do not stand for themselves". Note that the association of the
names to the symbols is done by the ordering.
Identifiers that do not occur in the first line or as a LEFT_SIDE
stand for themselves, i.e. they are keywords (cf. section 1.2.2,
scanner). Strings stand for their "content", they are used to
separate (sequences of) special characters from the delimiters
17
of the grammatical notation.
Simple Tree Transformations
It is an easy task to build a syntax tree during the parse:
if the parser reduces to a production, a node for the LEFT_SIDE
is inserted, and the MEMBERs of the RIGHT_SIDE are made the
sons of that node.
The simplest way to define a translation is to modify the sons
and their ordering when the subtree for the LEFT_SIDE is built.
The notation for such modifications is straightforward: after
the production we write "what it shall become", i.e. we note
a '=>' and then the sequence of symbols that are to become the
new sons of the LEFT_SIDE:
.SEMANTIC_PART = TRANSLATIONETY GENERATIONETIES => #10
.TRANSLATIONETY = => #11
| '=>' ELEMENTETIES => #12
.ELEMENTETIES = => #13
| ELEMENTS
.ELEMENTS = ELEMENT
| ELEMENTS ELEMENT
.ELEMENT = STRING => #14
| IDENTIFIER => #15
| NUMBER => #16
| IDENTIFIER ':' NUMBER => #17
| '#' NUMBER => #18
While the SYNTAX_PART of a production is used to modify the
parsing table, the SEMANTIC_PART is translated to some instructions
that are interpreted by the tree transformator every time the
parser reduces to that production.
These instructions are generated parallel to the recognition of
the components of SEMANTIC_PART, and they are later interpreted
from left to right. By action #2 (after a ']') they are somewhat
modified, and several consistency checks are made.
Again we note some special cases:
(1) An empty TRANSLATIONETY (#11) indicates that the subtree
for the production remains unchanged.
(2) If ELEMENTETIES is empty (#13), however, the translation
of the production is the empty string.
(3) Every ELEMENT that occurs as a MEMBER in the SYNTAX_PART
stands for a copy of that MEMBER. If it is a nonterminal,
the entire subtree corresponding to the nonterminal is
copied. (The first occurrence of the ELEMENT is in fact
not copied, but "taken").
(4) If the ELEMENT occurs more than once as a MEMBER, the
different occurrences can be distinguished by suffixing
the ordinal number (':1', ':2' etc.).
18
(5) If the ELEMENT does not occur as a MEMBER, it must
either be a terminal or an accumulator (see below).
(6) For an ELEMENT of the form '#' NUMBER an instruction
is generated by action #18 that results in the
activation of a piece of code in the EXTRA system
during the translation.
Accumulators
Simple tree transformations suffice, for example, to translate
arithmetic expressions from infix to prefix or postfix notation,
but often it is necessary to transport portions of the syntax
tree over a greater distance.
In EXTRA so-called accumulators are used to store subtrees during
the translation of a production; later in the parsing process
these subtrees can be identified by the IDENTIFIER of the accumu-
lator, and a copy of the stored subtree can be inserted in the
translation, or it can be used to build another accumulator.
The notation for the definition and the application of accumulators
stresses the resemblance of theses constructions to nonterminals:
.GENERATIONETIES=
| GENERATIONETIES GENERATION
.GENERATION = '->' ACCUMULATOR '=' ELEMENTETIES => #23
.ACCUMULATOR = => #24
| IDENTIFIER => #25
An example of the power of a single accumulator is the reversion
of a list:
[SOURCE_TEXT = LIST => REVERSE
.LIST = IDENTIFIER -> REVERSE = IDENTIFIER
| LIST ',' IDENTIFIER -> REVERSE = IDENTIFIER ',' REVERSE
]
Every accumulator represents not only one subtree, but a stack
of trees. All accumulators in the SEMANTIC_PART of a production
are processed in three steps:
(1) For every occurrence of (the IDENTIFIER of) an
accumulator as an ELEMENT a copy of the element on
the top of the corresponding stack is made, and these
copies are used to build the subtrees specified by
the ELEMENTETIES of the TRANSLATIONETY and the
GENERATIONs.
(2) The top elements of all accumulators that are used as
an ELEMENT are deleted from the stacks.
(3) For every GENERATION with a nonempty ACCUMULATOR the
subtree corresponding to ELEMENTETIES is pushed onto
the stack.
With the empty alternative for ACCUMULATOR it is possible
explicitly to delete incarnations of accumulators (only step
(2) is applied).
19
Accumulators are a little bit more powerful than a stack, since
it is also possible to access subtrees which lay deeper in the
stack by the ':' NUMBER construction for ELEMENTs; ':1' denotes
the top element, ':2' the element below the top and so on.
These elements are also deleted in step (2).
It is clear that the user always must be aware of the order in
which productions are applied by the bottom-up parser when he
uses accumulators.
The transformator constructor checks whether there may be SOURCE_
TEXTs which lead to the use of an accumulator whose stack is
empty, or which lead to incarnations that will be never used.
Such checks are made when the tables are reorganized (action #2).
Backspacing of the Parser
Sometimes it is necessary further to modify parts of the tree
that have already been generated, or to perform special semantic
actions for such parts. Instead of inventing operations that
access and manipulate subtrees of the MEMBERs and ELEMENTs in a
production, we modify the parser, since it should be the only
analyzing mechanism in the system.
EXTOL provides for two additional forms of ELEMENTs:
.ELEMENT = '@' => #19
| COMBINED_SYMBOL COMBINED_ELEM
.COMBINED_SYMBOL= IDENTIFIER => #21
.COMBINED_ELEM = '(' ELEMENTS ')' => #22
]
Action #19 backspaces the read-head of the parser. All ELEMENTs
to the right of the '@' are prefixed to the remaining SOURCE_
TEXT, and they are reanalyzed. In the implementation the parser
switches between the scanner and a simple procedure that extracts
the next leaf from the tree.
The '@' may not be used in a GENERATION; this restriction is not
checked by the metagrammar, but by the transformator constructor.
If parts of the subtree behind '@' need no further examination,
they may be combined to a nonterminal COMBINED_SYMBOL with the
aid of the second alternative for ELEMENT during the generation
of the tree. The parser later will only find that nonterminal.
This feature may be used to overcome the loss of efficiency intro-
duced by the backspacing.
COMBINED_SYMBOLs necessitate nonterminal lookahead symbols in
the LR(1) parsing table (the definition of the function FIRSTk
may be generalized without difficulty so that sentential forms
are the result instead of terminal strings). The following
example shows, however, that nonterminal lookaheads may destroy
the LR(1) property of a grammar:
[AXIOM = A B x | D B y
.B =
.A = z
20
.D = z
]
The corresponding parsing table has a conflict for nonterminal
'B' which is resolved if only terminal lookaheads ('x' and 'y')
are used. In such cases the user must dispense with COMBINED_
SYMBOLs.
Backspacing of course is very powerful, because the entire SOURCE_
TEXT can be processed by several passes. It implies, however, also
a serious problem: the portion of the tree that is reanalyzed
must be derivable from the current grammmar. Syntax errors
occurring during reanalysis are more serious than those during
the usual parse: the former are not caused by the programmer who
wrote the SOURCE_TEXT, but by the designer of GRAMMAR. The system
will perhaps check for the possibility of such errors in the
future; at present the decidability problems are uunsolved, and
the responsibility for correct backspacing is up to the user.
There is a trick which makes it possible to separate the language
that may be input from the one that is read during reanalysis:
use unbased nonterminals, i.e. nonterminals which generate no
terminal string. The parser has no problem to shift such
nonterminals or to use them as lookaheads, but it cannot occur
during the normal parse of SOURCE_TEXT.
21
1.2.2 Other Features of EXTRA
------------------------------
There are some features of the EXTRA system that are not
directly related to the transformational grammar. The user
may wish to change and extend these features, but he can do
this only by modifying the PL/1 source-code of the EXTRA system.
Scanner
As already mentioned, a fixed scanner is used to read both the
grammar and the source program. In its present form it is
PL/1-oriented. A precise description would need a lot of
additional description, therefore only some hints are given here
with the aid of the following rules:
[LETTER = A | B | C | D | E | F | G | H | I | J
| K | L | M | N | O | P | Q | R | S | T
| U | V | W | X | Y | Z | '_'
.DIGIT = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
.QUOTE = ''''
.IDENTIFIER = LETTER
| IDENTIFIER LETTER
| IDENTIFIER DIGIT
.NUMBER = DIGIT
| NUMBER DIGIT
.OPEN_STRING = /* a sequence of zero or more characters
different from QUOTE */
.STRING = QUOTE OPEN_STRING QUOTE
-> CONTENT = OPEN_STRING
| STRING QUOTE OPEN_STRING QUOTE
-> CONTENT = CONTENT QUOTE:1 OPEN_STRING
/* this is a COMMENT */
]
The scanner returns the symbols IDENTIFIER, NUMBER and STRING.
COMMENTs end after the first '*/', they may contain several '/*'.
The length of an IDENTIFIER, NUMBER or STRING is only limited
by the storage available in the EXTRA system.
End of lines are always ignored; spaces and COMMENTs are
terminators for all symbols except STRING. A STRING is terminated
by a QUOTE that is not followed by another QUOTE.
COMMENTs in STRINGs and vice versa are not recognized.
There is a problem concerning the sequence of the characters
I D E N T I F I E R , because it is used in several quite
different ways:
(1) in the metagrammar, it denotes names of nonterminals
(syntactical variables), accumulators and terminals
that consist of letters (keywords),
(2) in the grammar given by the user, it denotes an arbitrary
identifier in the SOURCE_TEXT to be translated,
22
(3) in the SOURCE_TEXT it is just such an identifier.
Despite this confusion there is only one restriction that affects
the SOURCE_TEXT:
Keywords may not be used as identifiers.
This restriction does not seem to be serious, since many programming
languages (COBOL, BASIC, SIMULA67, PASCAL, ADA) have it, though
PL/1 and FORTRAN do not have it.
There is another convention for strings and (sequences of)
interpunctuation characters:
Initially the scanner does not recognize any character that
is no letter and no digit and that is not enclosed in
QUOTEs.
The scanner enters the CONTENT of every string in a special
list, however, and if it subsequently reads an interpunc-
tuation character, it tries to extract a symbol from the
input equal to the longest sequence of characters that
was already stored in this list.
This possiblity does not affect the rules for MEMBER and ELEMENT
in the metagrammar: terminals that are not keywords should
always be enclosed in QUOTEs.
From the input
':' '=' ':=' ::= '::=' ::=
the scanner would thus extract the symbols
STRING STRING STRING : := STRING ::=
Special Semantic Actions
ELEMENTs of the form '#' NUMBER cause the tree transformator
to execute a program segment that is located in the EXTRA system
via NUMBER. By this escape mechanism all internal data structures
and the full power (and danger) of PL/1 are available during the
translation. Of course this feature should be used only for
tasks that cannot be accomplished by the means of EXTOL.
A series of numbers is used to connect the construction of the
parser table and the table for the tree transformator with the
rules of the metagrammar, as can be seen from section 1.2.1:
'#2' for example reorganizes the entire parser table.
The semantic actions expect a fixed ordering of the MEMBERS of
the production to which they are applied. By careful modifications
the user could, however, change for example the delimiters of the
metagrammar.
For the higher numbers (beginning at 35) the user has to provide
his own PL/1-code, and he must recompile at least one procedure
and relink the EXTRA system.
23
Typical actions that can only be accomplished by the
'# n'-constructs are:
(1) Create a new identifier for a temporary variable that
was introduced by the translation.
(2) Convert a label identifier to a FORTRAN statement
number.
(3) Insert a symbol that causes the output program to begin
a new line, or to tabulate to a specified column
position.
(4) Initialize or increment a counter, create a symbol that
corresponds to the value of that counter.
(5) Store attributes of identifiers in a symbol table,
modify symbols on the parse stack according to their
attributes.
Output program
The tree transformator always knows the portion of the tree that
may be given out because it is surely unaffected by the following
translation. For every symbol in the leaves of that portion, the
transformator calls the output program.
The output program may perform some additional translation, but
it should only have a memory limited to a few adjacent symbols.
The output program is thus the counterpart to the scanner; it is
responsible for the external representation of the target text,
for formatting, file peculiarities and so on.
For a translation to FORTRAN, for example, it would autonomously
create continuation cards, and it would map identifiers such that
their length is at most six.
Commands
EXTRA offers a few commands that facilitate the interactive use
and the communication with the environment of the system:
[RULE = COMMAND
.COMMAND = LOAD IDENTIFIER => #30
| STORE IDENTIFIER => #31
| READ IDENTIFIER => #32
| WRITE IDENTIFIER => #33
]
The IDENTIFIERs always designate a file.
Without the COMMANDs the user would have been forced always to
start with the metagrammar. By STORE he may save the current
description in internal form on a file, and later he can LOAD and
further extend or modify it. READ redirects the entire input
to the specified file, and WRITE redirects the output medium.
24
1.2.3 Example: Extension of the Metagrammar
--------------------------------------------
Various extensions of the BNF notation have been proposed in
the literatur in order to allow regular expressions on the
right side of productions. In the EXTRA system such extensions
may be defined precisely, and with the backspacing facility the
expanded rules can also be inserted in the parsing table.
We need two special semantic actions which are easy to write:
#50 changes the preceeding ELEMENT to a new identifier,
#51 is analogous to #8, it replaces the MEMBER in the
production which is currently built by the preceeding
ELEMENT.
Furthermore, we need two accumulators:
[EXTRA_INPUT - EXTRA_INPUT '[' GRAMMAR ']' SOURCE_TEXT
.EXTRA_INPUT = EXTRA_INPUT SUB GRAMMAR BUS SOURCE_TEXT
.SUB = '['
-> APPEND = -> GEN = ' ' #50
/* initialize the accumulators */
.BUS = ']' -> = APPEND GEN /* delete all acumulators */
.MEMBER = PRIMARY '*' => GEN #51 /* "zero or more" */
-> APPEND = APPEND '.' GEN '='
'|' GEN PRIMARY
-> GEN = ' ' #50
| PRIMARY '+' => GEN #51 /* "one or more" */
-> APPEND = APPEND '.' GEN '=' PRIMARY
'|' GEN PRIMARY
-> GEN = ' ' #50
| PRIMARY '?' => GEN #51 /* "optional" */
-> APPEND = APPEND '.' GEN '='
'|' PRIMARY
-> GEN = ' ' #50
| PRIMARY '/' PRIMARY /* "separated by" */
=> GEN #51
-> APPEND = APPEND '.' GEN '=' PRIMARY:1
'|' GEN PRIMARY:2 PRIMARY:1
-> GEN =' ' #50
.PRIMARY = '(' RIGHT_SIDES ')' => GEN #51
-> APPEND '.' GEN '=' RIGHT_SIDES
-> GEN = ' ' #50
.RULES = RULE => RULE @ APPEND
-> APPEND =
| RULES '.' RULE => RULES '.' RULE @ APPEND
-> APPEND =
]
The accumulator GEN always containns the next identifier for the
nonterminals that must be created, while APPEND contains the
generated rules. Since APPEND is reanalyzed when it is appended
to RULES, the generated rules are also inserted in the grammar,
and the parser constructor will insert them in the table without
at all being affected by the extensions.
25
2 Definitions and Notations
----------------------------
It is assumed throughout this work that the reader has some
familiarity with formal language theory, LR(k) parsing,
translation and programming. Several notations which are
essential for the understanding of sections 3 and 4 will be
presented in the following subsections.
2.1 The Set Oriented Algorithmic Language SOL
----------------------------------------------
The construction of LR(k) parsers is a very dynamical process that
probably cannot be described by static conditions, sets and
formulae. Moreover, the construction process is rather complex,
and a verbal description either would not allow the necessary
accuracy, or it would become unreadably long.
Because of the recursive and parallel nature of wide parts of the
construction method it would have been an interesting task to use
a functional programming style as proposed for example by Backus
[1978]. Functional languages are just evolving at present,
however, and it is uncertain whether a completely new notation
would enhance the clarity of the description of a complex
algorithm. Furthermore it is an aim of this work to show that the
incremental construction can be done efficiently with todays means;
the functional description would therefore be too great a departure
from the actual implementation mentioned in section 5.
It is common practice to formulate algorithms in a variant of a
well known programming language, possibly enriched by statements
in a natural language.
In this work PL/1 is used as base language. As can be seen from
the grammar below, only very few elements of that language were
retained: lexical level, assignment, GOTO, IF-THEN-ELSE, DO-WHILE-
END, procedures and calls. Several extensions were made to allow
quantors and operations on sets, and to abbreviate some
constructions that occur very often.
Lexical Level of SOL
The notation for the individual symbols of the language is much
the same as in PL/1. Comments in algorithms are enclosed in
'/*' and '*/', for example:
/* this is a comment */
In order to distinguish between words of the English language
and identifiers with a fixed meaning, the latter are enclosed
in quotation marks in the text if confusion is possible.
Keywords of SOL and identifiers for variables, sets, mappings
consist of a sequence of lower case letters, digits and
26
underscores ('_'). Names of algorithms (i.e. PL/1 procedures)
and of some important sets are marked out by the use of upper
case letters.
Syntax of SOL
Using the conventions of the EXTRA system (cf. section 1.2,
especially the extensions of 1.2.3), the syntax of the SOL
language is described by the following rules. Some remarks
concerning the semantics are given verbally.
[SOURCE_TEXT = ALGORITHM
.ALGORITHM = IDENTIFIER ':' PROC FORMAL_PARAMETERS_PACK? ';'
(DCL DECLARATION/',' ';')?
ALGORITHM*
(STATEMENT ';')*
END IDENTIFIER ';'
.FORMAL_PARAMETERS_PACK = '(' (IDENTIFIER (':=' '*')?)/',' ')'
The construction ':=*' indicates that the value of a formal
parameter may be changed in the algorithm; in the absence of
this suffix it will be used only.
A formal parameter may be the name of a (variable) algorithm.
This construction is rather often used in sections 3 and 4 to
avoid duplication of parts of algorithms.
.DECLARATION = (IDENTIFIER | '(' IDENTIFIER/',' ')')
('∈' | 'c') IDENTIFIER
Declarations indicate the domain of objects. They are given only
when this domain cannot easily be concluded from the usage (or the
IDENTIFIER) of the object.
.STATEMENT = ASSIGNMENT
| GOTO IDENTIFIER
| CALL_STATEMENT
| DO_GROUP
| CONDITIONAL_STATEMENT
| LABEL ':' STATEMENT
.LABEL = IDENTIFIER
.ASSIGNMENT = VARIABLE ':=' EXPRESSION
| VARIABLE OP ':=' EXPRESSION
=> VARIABLE ':=' VARIABLE OP '(' EXPRESSION ')'
For example, 'A u:= {b};' means that 'b' becomes an element of the
set 'A'.
.VARIABLE = IDENTIFIER ('(' EXPRESSION/',' ')')?
The IDENTIFIER may be a set, state, symbol and so on. If it is
followed by a list of expressions enclosed in parentheses, it
usually is a mapping. Assignment to such a mapping means the
(re-) definition of the value for a specific set of arguments.
In the implementation, where all states, symbols etc. are
realized by natural numbers, these mappings are of course
ordinary arrays.
27
.EXPRESSION = PRIMARY/OP
.PRIMARY = 0 | 1 /* natural numbers */
| SET
| 'Ø' /* the empty set */
| VARIABLE
| TUPLE
.SET = '{' IDENTIFIER '|' BOOLEAN_EXPRESSION '}'
.OP = '+' /* add */
| '-' /* subtract */
| 'u' /* unite with */
| 'n' /* cut with */
| '\' /* delete (set difference) */
.TUPLE = '(' COMPONENT/',' ')'
.COMPONENT = EXPRESSION/'|'
| (IDENTIFIER ':=')? '*'
In a BOOLEAN_EXPRESSION, a COMPONENT which is '*' may have any
value; this value is assigned to the named object if '*' is
preceeded by IDENTIFIER ':='. For TUPLEs used in a CALL_STATEMENT,
':=*' again stands for a possible change of the actual parameter.
.CALL_STATEMENT = IDENTIFIER TUPLE
.DO_GROUP = GROUP_HEAD ';'
(STATEMENT ';')*
END
.GROUP_HEAD = DO
| DO WHILE BOOLEAN_EXPRESSION
| FOR variable ':=' expression TO expression
(BY expression)? DO
| FOR ('∀' | '?') QUALIFIED_EXPRESSION DO
.QUALIFIED_EXPRESSION = IDENTIFIER ('∈' EXPRESSION)?
| IDENTIFIER ('∈' EXPRESSION)?
'(' BOOLEAN_EXPRESSION ')'
The qualificator 'FOR ?' chooses any one of the objects fulfilling
the condition. The STATEMENTs in such a DO_GROUP are executed at
most once.
.CONDITIONAL_STATEMENT = IF BOOLEAN_EXPRESSION THEN
(STATEMENT | DO_GROUP) ';'
(ELSE
(STATEMENT | DO_GROUP) ';')?
.BOOLEAN_EXPRESSION = BOOLEAN_PRIMARY/BOOL_OP
.BOOLEAN_PRIMARY = RELATION
| '∃' QUALIFIED_EXPRESSION
| '¬' BOOLEAN_PRIMARY
.BOOL_OP = '&' /* and */
| '|' /* (inclusive) or */
.RELATION = EXPRESSION REL_OP EXPRESSION
.REL_OP = '<'
| '<=' /* less or equal */
| '='
| '≠'
| '>'
| '>=' /* greater or equal */
| '∈' /* element of */
| '¬∈' /* not element of */
| 'c' /* contained in */
]
28
2.2 Context-Free Grammars
--------------------------
A vocabulary V is a finite, nonempty set of symbols. A word w
over V is a finite sequence of symbols in V. The length of w is
denoted by |w|. There is one word e with |e| = 0, the empty word.
V* is the set of all words over V, including e. For k >= 0
HEAD(w,k) (TAIL(w,k)) gives the string consisting of the first
(last) k symbols in w. w**k is the k-fold concatenation of w.
We restrict to such context-free grammars (called "reasonable")
which are relevant for the following of this paper.
Definition 2.2-1:
A reasonable context-free grammar RCFG is a 4-tuple
G = (nonts,terms,prods,start),
where
(1) nonts is a finite, nonempty set of nonterminals,
(2) terms is a finite, nonempty set of terminals,
(3) prods c (nonts x (nonts u terms)*)
is a finite, nonempty set of productions,
(4) start ∈ nonts is the start symbol of G.
The vocabulary (set of symbols) of G is defined by
(5) symbs := nonts u terms.
In addition, we require for every RCFG:
(6) nonts n terms = Ø,
(7) ∀ A ∈ symbs
(A ∈ nonts <==> ∃ w ∈ symbs* ((A,w) ∈ prods)),
(8) ∀ y ∈ symbs
(∃ A,x,z ∈ symbs* ((A,x y z) ∈ prods) |
∃ x ∈ symbs* ((y,x) ∈ prods)),
(9) (see below, section 2.3).
If (A,w) ∈ prods, we will call 'A' the left side and 'w' the
right side of the production, and we will also write
A = w
in accordance with the notation introduced in section 1.2. The
usual "->" sign is reserved in this work for the shift action
of LR parsers.
Definition 2.2-1 excludes several cases which cause problems in
practice. Henceforth we will most often omit the prefixes
"reasonable" and "context-free" since we will not consider other
grammars.
29
Definition 2.2-2:
A grammar G defines relations in symbs* x symbs* by
(1) w1 => w2 :<==> (w1,w2) ∈ "=>" :<==>
∃ (A,w) ∈ prods ∃ u,v ∈ symbs*
(w1 = u A v & w2 = u w v)
"directly derives"
(2) w =>+ w' :<==>
∃ n > 0 ∃ w0...wn
(w = w0 => w1 => ... => wn = w')
"derives"
(3) w =>* w' :<==> w =>+ w' | w = w'
"derives or is"
A sequence w0 => ... => wn is called a derivation (of wn
from w0).
A word w ∈ symbs* is called nullable iff w =>* e.
If start =>* w, w is called a sentential form, and if in
addition w ∈ terms*, w is called a sentence. The set of
all sentences
L(G) := {t ∈ terms* | start =>+ t}
is the language generated by G.
For w ∈ L(G) there may be several derivations from 'start'.
Some of them are excluded by the convention always to replace
the rightmost nonterminal in w1 of (1), i.e. we require
v ∈ terms*. In the following we will only use such rightmost
derivations, if not explictly stated otherwise.
The remaining variety of derivations leads us to a property of
the grammar.
Definition 2.2-3:
For a grammar G = (nonts,terms,prods,start) let r and l(A)
for every A ∈ nonts be new symbols not in terms u nonts.
The structure grammar of G is defined by
SG(G) := (nonts,terms u {l(A) | A ∈ nonts} u {r}
,{(A, l(A) w r) | (A,w) ∈ prods}, start).
To every derivation start => ... => w ∈ L(G) in G uniquely
corresponds a derivation start => ... => st(w) ∈ L(SG(G)) in
SG; this word st(w) is called the structure tree of w.
G is called unambiguous iff for every w ∈ L(G) and for all
derivations start => ... => w in G the corresponding
derivations in SG lead to the same structure tree. Otherwise
G is called ambiguous.
30
2.3 LR(k) Grammars
-------------------
One could hardly write a paper containing the string "LR"
without reference to the original work of Knuth [1965].
This section and the following are rather closely patterned
after that article.
Let us first for a fixed k >= 0 append the following requirement
to definition 2.2-1 for reasonable context-free grammars:
(9) ∃ axiom,eof ∈ symbs
((start, axiom eof**k) ∈ prods &
∀ (A,w) ∈ prods ((A,w) ≠ (start, axiom eof**k) ==>
(A ≠ start & ¬ ∃ x,y ∈ symbs*
(w = x start y | w = x eof y))))
The endmarker 'eof' (other authors use "$", "-|", "#" or "|?_;"
for this symbol) ensures that generally a lookahead string of
length k is available. From (5) to (9) we can conclude that
eof ∈ terms.
Requirement (9) also excludes derivations of the form
start =>+ start
which would collide with the unambiguity of LR grammars (cf.
Harrison and Havel [1974]). Henceforth we will only consider
grammars that fulfill restriction (9).
We are interested in bottom-up parsers which try to reconstruct
the derivation of a sentence while reading the latter from left
to right. Such a parser sometimes must reduce a sequence of
symbols to the left side of a production. For the points at
which such reductions may take place we will use a special term:
Definition 2.3-1:
For a grammar G let x ∈ symbs* be a sentential form.
Any pair (n,p) is called a handle of x iff
∃ u,w ∈ symbs* ∃ v ∈ terms* ∃ A ∈ nonts
(p = (A,w) ∈ prods &
n = |u w| &
start =>* u A v => u w v = x).
31
LR(k) grammars are such that if the parser tries to reconstruct
the derivation, the handle can always be uniquely determined
if the parser looks ahead k symbols:
Definition 2.3-2 (Knuth [1965]):
A RCFG G is said to be translatable from left to right
with bound k (briefly "an LR(k) grammar") iff
∀ x ∈ symbs* ∀ v1,v2 ∈ terms*
∀ p1,p2 ∈ prods ∀ n1,n2 >= 0
(
(n1 = |x| &
(n1,p1) is a handle of x v1 &
(n2,p2) is a handle of x v2 &
HEAD(v1,k) = HEAD(v2,k)
) ==>
(n1 = n2 & p1 = p2)
).
A grammar is an LR grammar if there is a k such that the
grammar is an LR(k) grammar. A language is an LR language
if there is an LR grammar that generates it.
Though Knuth caused an avalanche of literature on LR grammars,
the main theoretical knowledge remained rather fixed during the
time. The following collection of important results on LR
grammars is cited from Harrison and Havel [1974]:
(1) For a given grammar and a given k >= 0 it is decidable
whether the grammar is an LR(k) grammar or not.
(2) It is undecidable whether a given grammar is an LR
grammar or not.
(3) Every LR grammar is unambiguous.
(4) The language generated by an LR grammar is deterministic.
The DPDA can be effectively constructed.
(5) Any deterministic language has an LR(1) grammar.
(6) Any LR grammar is equivalent to an LR(1) grammar.
Results (1) to (5) are all from Knuth [1965]. (6) is a
consequence of (4) and (5); a direct transformation is described
by Mickunas, Lancaster and Schneider [1976].
Knuth's second proof for (1) is the key to the practical
utilization of LR grammars, since it gives a method how to
construct parsers for this class of grammars.
32
2.4 Construction of LR(k) Parsers
----------------------------------
First we need two functions on words:
Definition 2.4-1:
For a grammar G, k >= 0 and a word w ∈ symbs* let
(1) FIRST(w,k) :=
{t ∈ terms**k | ∃ (Ai,wi) ∈ prods ∃ ui ∈ symbs*
∃ vi ∈ terms* for i = 0...n
(w = u0 A0 v0 => u0 w0 v0 = u1 A1 v1 => ...
=> un wn vn = t vn)},
(2) EFF(w,k) :=
{t ∈ terms**k | ... /* same as above */ ...
=> un wn vn = t vn &
∀ i = 0...n (wi ≠ e | ui ≠ e))}.
The e-free FIRST-function EFF thus disallows derivation steps
where a nonterminal at the extreme left is replaced by the empty
word.
For the construction we furthermore need a notion for the possible
situations during the parse:
Definition 2.4-2:
For a grammar G and k >= 0 a configuration c is a 3-tuple
c = (p, j, t),
where p = (A,w) ∈ prods, 1 <= j <= |w|+1, t ∈ terms**k.
The string 't' is called the context (or lookahead) of the
configuration.
A state is a set of configurations.
Configurations are more intuitively written in the form
c = [A = u @v, t],
where u v = w; the marker '@' stands before the 'j'th symbol
in 'w'.
The string before the marker corresponds to the portion of 'p'
which is already recognized; the marker symbolizes the read-head
of the parser, and 't' is a string of terminals which are
expected after the reduction to 'A'.
The terminology is not at all consistent in the literature; what
we here (and DeRemer [1971], McKeeman et al. [1970], Pager
[1977]) call a "configuration" is also called "item" (Aho and
Ullman [1972 etc.]), "state" (Knuth [1965]) or even "partial
state" (Korenjak [1969]).
The "items" in this work (see the next section) are a combination
of the configurations and the GOTO and ACTION tables of Aho and
Ullman [1972,1977]. They do not contain contexts unless the
33
marker '@' is at the extreme right.
Knuth's famous second proof for the decidability of the LR(k)
property of a grammar G describes the construction of the set
of states of an LR(k) parser as if it were always carried out
parallel to the parsing of a string 'input':
LR_K_ALGORITHM: proc(G,k,input);
dcl stack ∈ states*,(state,state2) ∈ states,
symbol ∈ symbs, (i,j,m,act) ∈ NAT,
Z: NAT --> 2**(symbs**k), lng: NAT --> NAT,
left: NAT --> symbs;
step0: /* initialization */
state := {[start = @axiom eof**k,eof**k]};
/* the start-state */
stack := state;
step1: /* compute the closure of 'state' */
state2 := state u {[A2 = u2 @v2, t2] | ∃ A1 ∈ nonts
∃ u1,v1 ∈ symbs* ∃ t1 ∈ terms**k
([A1 = u1 @A2 v1, t1] ∈ state &
t2 ∈ FIRST(v1 t1,k))}; /* 18 */
if state2 ≠ state then do;
state := state2;
goto step 1;
end;
step2: /* determine the next action of the parser */
m := 0; Z(0) := Ø
for ∀ [A = u @v, t] ∈ state do;
if v ≠ e then
Z(0) u:= EFF(v t,k); /* 19 */
else do;
m := m + 1; Z(m) := t; /* 20 */
lng(m) := |u v|; left(m) := A;
end;
end;
act := m + 1;
for i := 0 to m do;
for j := i+1 to m do;
if Z(i) n Z(j) ≠ Ø then
/* G is not LR(k) */ goto done;
end;
if HEAD(input,k) ∈ Z(i) then act := i;
end; /* for i */
if act = 0 then do; /* shift */
symbol := HEAD(input,1);
input := TAIL(input, |input| - 1);
end;
else
if act <= m then do; /* reduce */
symbol := left(m);
stack := HEAD(stack, |stack| - lng(act)); /* 21 */
state := TAIL(stack,1);
end;
else
/* input ¬∈ L(G) : error */ goto done;
step3: /* compute the successor state */
state2 := {[A = u symbol @v, t] |
[A = u @symbol v, t] ∈ state}; /* 23 */
state := state2;
stack := stack || state;
if state ≠ {[start = axiom @eof**k,eof**k]} then
34
goto step1;
else
/* input ∈ L(G) : accept */
done:
end LR_K_ALGORITHM;
The construction terminates: since 'symbs' and 'prods' are finite,
the powerset of the set of all possible configurations is finite,
but this is the set of all possible states.
In line (2) the context is irrelevant because the parser will
already accept the sentence when it has read 'axiom', and then
"sees" eof**k.
The numbering of the steps and equations is from Knuth [1965].
The algorithm above is about as long as Knuth's description,
but it avoids any "...", renaming of subscripts, and verbal
formulations such as "defined recursively as the smallest set
satisfying ..." or "Z,Z0...Zp must all be disjoint".
In contrast to Knuth we did not stack symbols, because this is
only necessary if the semantic information behind the symbols
is needed when reducing a production.
In practice it is neither possible nor necessary to perform
steps 1 and 3 and the check for disjointness of the Z(i) always
anew for the same states.
The algorithm usually is divided in two parts: a constructor
which precomputes the states, their closure and especially
a parsing table, i.e. a mapping from states and lookahead
strings of length k to parser actions.
This table then is used by the parser (or parsing algorithm)
to analyze specific sentences.
For every pair (state,lookahead) the table must provide
information whether
(1) to shift, and what state will be the successor,
(2) to reduce, and to what production (only the left
side and the length of the right side are of interest),
(3) to accept,
(4) to announce error.
In practice it is always a problem to store the parsing table.
Since entries for case (4) occur very often, they are usually
omitted, and an error is announced if the mapping is undefined
for a pair.
The factor 'k' has a very strong influence on the size of the
table, therefore virtually all practical methods use k = 1
(only few real grammars are LR(0)).
Knuth already pointed out that it is possible to identify
states with respect to the parser actions as long as the
35
Z(i) remain disjoint. This merging leads to the LALR(k)
construction methods and parsing tables (note that the parsing
algorithm remains the same).
The only difference between Knuth's LR parsing and LALR parsing
may be that with LALR an error is detected some reductions later
than with an LR(k) table. However, the parser will never shift
an erroneous symbol with either table.
Definition 2.4-3:
The core of a state 'q' is the set of marked productions:
CORE(q) := {(p,j) | ∃ t ∈ terms**k ((p,j,t) ∈ q}.
A configuration group is a 3-tuple
(p,j,{t1...tn}),
where (p,j,t1) ... (p,j,tn) are configurations with a common
core.
The canonical LALR(k) construction method is defined by the
following informal
Algorithm 2.4-4:
(1) Construct a LR(k) set of states and the corresponding
parsing table.
(2) As long as there are two different states with equal
cores, identify the states, i.e. union their sets of
configurations together with the corresponding parser
actions, and replace all occurrences of either state
by the resulting state.
(3) Check whether the parsing table continues to define
a unique mapping - if so (if not), the grammar is
(is not) LALR(k).
This algorithm only serves for definitional purposes - again it
is impracticable because it relies on an LR(k) table which must
first be computed and stored.
The tables obtained with LALR(k) for k >= 1 all have the same
number of states as the LR(0) parsing table of the grammar. This
number is acceptably small for realistic grammars, and therefore
there has been great interest in algorithms for LALR(k) which
avoid the previous construction of an LR(k) table.
36
2.5 Data Structures and Mappings
---------------------------------
The description of the incremental construction is complicated
by the fact that the sets 'nonts', 'terms' and 'prods' which
constitute a grammar may be changed between two reorganization
steps.
Sometimes we will need new symbols, productions, states and so on.
Therefore we will distinguish between the "possible" and the
"current" sets:
Definition 2.5-1:
(1) SYM is an denumerable set of possible symbols,
(2) PRO c SYM x SYM* is an denumerable set of possible
productions,
(3) STA is an denumerable set of possible states,
(4) ACT = {shift,reduce,accept,error,oldsh,newsh} is the
set of parser actions,
(5) NAT = {0,1,2,3,...} is the well-known set of natural
numbers,
(6) ITE c SYM x PRO x NAT x ACT x STA is an denumerable set
of possible items,
(7) nonts c SYM, terms c SYM, states c STA, and items c ITE
are finite sets of current nonterminals, terminals,
states and items, respectively.
Since we are mainly interested in the construction of the parsing
table, we will not mention the changes of 'nonts', 'terms' and
'prods', but we will assume that the current grammar always
fulfills conditions (5) to (9) of definition 2.2-1 during a
reorganization step.
The changes in the current sets of states and items, i.e. the
creation of new and the deletion of superfluous elements will be
explicitly stated, however.
Though these changes may seem trivial, they strongly
influence the implementation. Creation may always lead to
exhaustion of the available storage, and deleted elements
should be reuseable, i.e. they should be collected in free
lists.
As shown in figure 2.5-2, the sets of definition 2.5-1 are related
by several mappings. It is the main task of the construction to
adjust the values of these mappings after a change in the grammar.
37
--------------------------------------------------------------
pronuc
stapre
--- stanum ------------------------->
states --- staite -->> items NAT
↑ | <---itesuc----- | -----itepos---> ↑
↑ | | |
| | | |
| stasym iteact |
| | | |
| | itesym ↓ itepro prolng
symsta | ACT |
| | |
| | |
| | |
| ↓ <-------------- prolef -------------- |
symbs prods
--------------- sympro ------------>>
promem
A --> B : mapping from A to B
A -->> B : mapping from A to 2**B
A,B --> C : mapping from A x B to C
--------------------------------------------------------------
Figure 2.5-2: Mappings between Data Structures
In the following the mappings are collected with respect to their
domains, and several explanations and properties are given.
Some mappings are not really necessary for the construction, but
they are introduced because they shorten the description, and
especially because they enhance the efficiency of the algorithms.
Productions
Definition 2.5-3:
(1) prolef: prods --> symbs
(A,w) |-> A
"the left side",
(2) prolng: prods --> NAT
(A,w) |-> |w|
"the length of the right side",
(3) promem: prods x NAT --> symbs
((A,w),i) |-> if 1 <= i <= |w|
then TAIL(HEAD(w,i),1)
else eop
"the 'i'th member of the right side".
38
A fourth mapping
(4) pronuc: prods x NAT --> 2**states
is defined in section 3.4.1.
The symbol 'eop' ("end of production") is assumed to be different
from 'start' and 'eof', and like these symbols it cannot occur
in any production that is inserted or deleted.
Items
The main components of the parsing table are the items which in
this work are a combination of the configurations and of the
parser actions.
Definition 2.5-4:
An item is a 5-tuple of the form
(0) (symb,prod,pos,act,succ)
∈ symbs x prods x NAT x ACT x states
The individual components are obtained by the mappings
(1) itesym: items --> symbs
"the marked symbol resp. the lookahead",
(2) itepro: items --> prods
"the production which is marked",
(3) itepos: items --> NAT
"the position of the marker",
(4) iteact: items --> ACT
"the action of the parser",
(5) itesuc: items --> states
"the successor state".
During the construction the components 'prod' and 'pos' will
always define a marked production, therefore for any 'item' we
will have
1 <= pos <= prolng(prod) + 1.
(As in definition 2.4-2, the marker '@' is before the 'pos'th
symbol in the right side of 'prod', and again items are sometimes
more intuitively written in the form
[A = u @v].
If pos <= prolng(prod), we have a shift-item of the form
(promem(prod,pos),prod,pos,shift,state)
and the successor 'state' is defined.
39
If pos = prolng(prod) + 1, we have a reduce-item
(la,prod,pos,reduce,*),
where 'la' is the lookahead (-symbol because we only use k = 1).
The successor '*' is formally used as a state, but it is irrelevant.
There is exactly one item
(eof,prod,3,accept,*)
in every parsing table (cf. section 3.1).
The actions 'oldsh' and 'newsh' are only used during the
reorganization to mark some shift-items.
In the examples parsing tables are often given in an abbreviated
and linearized form.
Example 2.5-5:
For the grammar
[axiom = S
.S = S a S b
.S =
]
the following parsing table is obtained with NQLALR(1):
2 axiom -> 3 eof =.
| S -> 4 eof =: axiom,1
| | a -> 5 S -> 6 a -> 5 ...
| | | b -> 7 eof =: S,4
| | | a =: S,4
| | | b =: S,4
| | a =: S,0
| | b =: S,0
| a =: S,0
States are numbered 2,3... (the implemented system always assigns
number 2 to the start state, and all examples were computed by
this system). Every entry has the form
state symbol action
where action is one of the following:
-> q : shift 'symbol' and goto state 'q',
=: A,n : for lookahead 'symbol' reduce; the left side
is 'A', pop 'n' elements off the stack,
=. : accept.
If there is more than one action in a state, following lines are
used. The state number is not repeated in this case, but the
indentation and the vertical bars indicate to what state an action
belongs.
40
If a state is followed by three dots, its actions were already
given in other lines.
Though the marked productions are not given in this representation
of the parsing table, they can be concluded easily if the shift
arrows are traced backwards.
States
Definition 2.5-6:
(1) staite: states --> 2**items
"the set of items"
(2) stapre: states --> 2**states
stateb ∈ stapre(statea) :<==>
]?= (*,*,*,shift,stateb) ∈ staite(statea)
"all predecessor states"
(3) stasym: states --> symbs
stasym(state) = symb :<==>
∃ (*,prod,pos,*,*) ∈ staite(state)
(pos >= 2 & symb = promem(prod,pos-1))
"the symbol that accesses the state"
A fourth mapping
(4) stanum: state --> NAT
is defined in section 3.4.1.
A symbol 'symb' is said to be used in a state 'state' iff
∃ (symb,*,*,*,*) ∈ staite(state).
As the terms "successor" and "predecessor" already indicate,
states may be seen as nodes of a directed graph whose edges
are marked with the shifted symbols:
Definition 2.5-7:
Given an LR(k) parsing table with a set of states 'states',
the corresponding transition graph is a 3-tuple
TG = (states,E,M),
where
(1) E c states x states
(statea,stateb) ∈ E :<==>
∃ (*,*,*,shift,stateb) ∈ staite(statea)
<==> statea ∈ stapre(stateb),
(2) M: E --> symbs
M(statea,stateb) = symb :<==>
∃ (symb,*,*,shift,stateb) ∈ staite(statea)
<==> stasym(stateb) = symb.
41
One edge in the graph may represent several shift-items in
'statea'. There may be cycles in the graph, there is a root
(the start-state), and there may be leaves (states with no
shift-items).
Symbols
Definition 2.5-8:
(1) sympro: symbs --> 2**prods
A |-> {p ∈ prods | A = prolef(p)}
"all productions for the symbol",
(2) symsta: symbs --> 2**states
x |-> {state ∈ states | x = stasym(state)}
"all states accessed by the symbol".
A symbol 'symb' is a nonterminal if sympro(symb) ≠ Ø,
otherwise it is a terminal.
42
3 Algorithms for NQLALR(1)
---------------------------
We are now ready to describe the algorithms for the incremental
construction of an LR(1) parser. For the computation of lookaheads,
we will use a simple method that lies between SLR(1) and LALR(1);
it was termed NQLALR(1) ("Not Quite LALR(1)") by DeRemer and
Pennello [1979].
This method is especially well suited for the incremental
construction; its weakness with respect to the class of grammars
for which it is applicable will be discussed and overcome in
section 4.
Let us look back to the system described in section 1.2. If we
concentrate on the parser constructor, such a system has the
following structure:
EXTRA: proc;
dcl (delete_set,insert_set) c PRO;
INIT;
do while true;
delete_set :=
/* a set of productions to be deleted from the grammar */;
insert_set :=
/* a set of productions to be inserted in the grammar */;
REORGANIZE;
end;
end EXTRA;
INIT creates a very simple grammar and a corresponding parsing
table.
Algorithm REORGANIZE is the parser constructor; it reorganizes
the parsing table such that it corresponds again to the new
grammar. This step may be repeated several times.
The design of algorithm REORGANIZE is the main theme of this
work. The algorithm consists of a series of steps:
REORGANIZE: proc;
dcl changed ∈ BOOL
,(lagar_set,laget_set,laput_set,successors_set
,conflict_set) c STA
,closure_set c SYM;
changed := true;
do while changed;
PROEDIT(delete_set,* ,PRODEL);
PROEDIT(insert_set,changed:=*,PROINS);
LAGAR;
SUCCESSORS;
end;
GARBAGE;
LAGET;
LAPUT;
CONFLICT;
end REORGANIZE;
PRODEL and PROINS are used to change the grammar and to
prepare the computation of the new set of states. As long as
43
a production became accessible from the start-symbol, the
switch 'changed' is true, and there may be other productions
which can be inserted now.
LAGAR deletes lookaheads that may become invalid by the
changes.
The transition graph is changed by SUCCESSORS. Thereby we
may be left with some states which are no longer accessible
from the start-state. Such states are deleted by GARBAGE.
With LAGET and LAPUT the lookahead-sets of reductions are
(re-)computed.
CONFLICT finally looks at conflicts and tries to remove them.
It may detect that states must be splitted for this purpose.
All algorithms in REORGANIZE are closely interrelated; therefore
it is difficult to describe them in a sequential ordering.
We begin with the initialization in section 3.1.
The computation of lookaheads is most severely affected by the
incremental approach. In section 3.2, the NQLALR(1) method for
that computation is presented and justified, and it is shown
how the entire loookahead complex can be separated from the
modifications of the transition graph for states.
Thereafter the modification of the grammar in section 3.3 may
be understood. In section 3.4 the modification of the transition
graph is described, and in section 3.5 the lookaheads are
(re-) computed.
All questions which arise if conflicts occur are put aside until
section 4.
44
3.1 Initialization and the Parsing Algorithm
---------------------------------------------
The incremental construction of a LR(1) parsing table starts
with a very simple grammar that has only one production.
symbs := {start,axiom,eof}
prods := {prod1}
prod1 := (start, eof axiom eof)
We define the corresponding parsing table to consist of two
states with only one item:
2 axiom -> 3 eof =.
The complete initialization is done by
INIT: proc;
/* initialization of the grammar */
sympro(start) := {prod1};
promem(prod1,1) := eof;
promem(prod1,2) := axiom;
promem(prod1,3) := eof;
/* initialization of the parsing table */
ALLOC(eof ,state2:=*);
ALLOC(axiom,state3:=*);
staite(state2) := {(axiom,prod1,2,shift,state3)};
staite(state3) := {(axiom,prod1,3,accept,*)};
stapre(state3) := {state2};
pronuc(prod1,2):= {state2};
pronuc(prod1,3):= {state3};
end INIT;
Algorithm ALLOC is a small algorithm which creates a new
state that is accessed by the first parameter. It is given
in section 3.4.1.
This choice of the initial grammar and parsing table needs some
remarks:
(1) The first member 'eof' of the production has the only
purpose to make the item in 'state2' a nucleus-item
(with a marker-position >= 2).
(2) The symbols 'start' and 'eof' are not accessible to the
user of the constructor, they cannot occur in any
production which is added later, and 'prod1' cannot be
deleted.
(3) The only way to extend the grammar now is to add a
production with 'axiom' on the left side.
(4) 'state2' has no predecessor, therefore no lookaheads
can be received by this state.
45
The parsing algorithm consists of a loop in which one of the
actions 'shift', 'reduce', 'accept' or 'error' is performed;
the loop is left by action 'accept':
PARSE: proc;
dcl top ∈ NAT, state ∈ states, (left,symb) ∈ symbs;
state := state2;
top := 1;
stack(top) := state;
read:
SCAN(symbol:=*); /* get next symbol form the input word */
next_action:
DELTA(state,symbol,item:=*);
if iteact(item) =
shift
then do;
state := itesuc(item);
top +:= 1; stack(top) := state;
goto read;
end; else if iteact(item) =
reduce
then do;
left := prolef(itepro(item));
top -:= prolng(itepro(item));
state := stack(top);
DELTA(state,left,item:=*);
state := itesuc(item);
top +:= 1; stack(top) := state;
goto next_action;
end; else if iteact(item) =
error
then do;
if symb ≠ eof then do;
/* issue an error-message and
try next input-symbol */
goto read;
end;
/* else 'eof' after error or
iteact(item) =
accept: stop parsing */
end PARSE;
The algorithm uses a mapping
stack: NAT --> states
which represents the well-known parse stack of states. We do
not stack symbols because the shifted symbol can always
be determined from a state by the mapping 'stasym'.
46
The parser as well as the constructor need a small algorithm
which computes the next action to be performed by the parser:
DELTA: proc(state,symbol,item:=*);
if ∃ item1 ∈ staite(state) (item1 = (symbol,*,*,*,*))
then do;
if ∃ item2 ∈ staite(state) (item2 ≠ item1 &
item2 = (symbol,*,*,*,*)
then /* conflict for 'symbol' */
DECIDE(state,symbol,item:=*);
else
item := item1;
end;
else
item := (symbol,*,*,error,*);
end DELTA;
DELTA returns an item with action 'error' for a (state,symbol)-
pair that is not stored in the parser-table. DECIDE is called
if there is more than one possible action. The latter algorithm
may resolve the conflict by selecting the shift-item in favour
of reduce-items, or by using semantic information, but it is not
described here.
If we look at the initial grammar and if we assume that the first
'eof' has already been read by some magic, i.e. the remaining
input string is 'axiom eof', then the parser will perform only
two actions:
+----------+-------------------+---------------+------------+
| actual | stack contents | remaining | action |
| state | | input string | |
+----------+-------------------+---------------+------------+
| state2 | | axiom eof | -> state3 |
| state3 | state2 | eof | =. |
+----------+-------------------+---------------+------------+
If the grammar is extended by some productions with 'axiom' on
the left side, we have to add the closure items to 'state2',
and we must compute additional states; in the input 'axiom'
may be replaced by a string derivable from that symbol, and
the parser will perform additional actions. These actions are
to be inserted before the actions in the table above; note
that the table always ends with those two actions.
47
3.2 The NQLALR(1) Method for the Computation of Lookaheads
-----------------------------------------------------------
The NQLALR(1) method is very simple in concept. Let us assume
that we have already constructed an LR(0) parsing table, and
that we want to compute the lookaheads for a reduction to
production 'prod' in state 'statez'.
This approach is much the same as in the SLR(1) method
of DeRemer [1971]. With SLR(1), we would take all terminals
which may follow on the left side of the production in any
sentential form derivable from the start symbol.
Here we imagine the possible situations of the parser. If
prod = (left, symb[1]...symb[n]),
and if the parser performs the reduction for 'prod' in 'statez',
it will pop a state 'statea' from the stack and it will go to
that state afterwards. Here it will always shift 'left' and go
to a state 'stateb':
---------------------------------------------------------------
symb[1] symb[n]
statea --------> ... --------> statez
|
| left
↓
stateb
---------------------------------------------------------------
Figure 3.2-1: A subgraph typical for the NQLALR(1) method
If we assume that the lookaheads in 'stateb' have already been
computed by the usual LALR(1) method, then it is obvious that
the terminals which are expected by the parser in 'stateb' have
something to do with the lookaheads we want to compute:
In the situation of figure 3.2-1, every symbol that is not
used in 'stateb' in a shift action or as lookahead in a
reduction is no valid lookahead for a reduction to 'prod'
in 'statez'.
The NQLALR(1) method puts this statement into positive form:
Every symbol that is used in 'stateb' ... is taken as a
lookahead for 'prod' in 'statez'.
The method starts with the parsing table obtained by the LR(0)
method, and it transmits the used terminals from "emitting"
states ('stateb') to "receiving" states ('statez'). The process
iterates as long as any state obtains an additional terminal
(initially only shifted terminals are transmitted). A precise
definition of this process is the algorithm LAPUT in section 3.5.
48
The "Not Quite" prefix is justified by
Lemma 3.2-1:
There are grammars which are LALR(1), but not NQLALR(1).
Proof: DeRemer and Pennello [1979] give such a grammar.
Another example is the 5th grammar from Mickunas, Lancaster
and Schneider [1976]:
[axiom = S
.S = o A o | o B q | q B o
.A = c | c A
.B = c | c B
]
Here NQLALR(1) would produce the following transition graph:
2 axiom -> 3 eof =.
| S -> 4 eof =: axiom,1
| o -> 5 A -> 7 o -> 12 eof =: S,3
| | B -> 8 q -> 13 eof =: S,3
| | c -> 9 o =: A,1
| | o =: B,1
| | A -> 14 o =: A,2
| | B -> 15 o =: B,2
| | | q =: B,2
| | q =: B,1
| | c -> 9 ...
| q -> 6 B -> 10 o -> 16 eof =: S,3
| c -> 11 o =: B,1
| B -> 15 ...
| c -> 11 ...
The action 'o =: B,1' in state 9 (and hence the conflict)
is not present with LALR(1).
Lemma 3.2-2:
Every NQLALR(1) grammar is an LALR(1) grammar.
Proof: Compared with LALR(1), we obtain at most some
additional lookaheads with NQLALR(1), but not fewer.
We had no conflict with NQLALR(1), therefore we will have
none with LALR(1).
As mentioned in section 2.4, additional lookaheads will only
cause the parser to perform some additional reductions before
the detection of an error, but an erroneous symbol is never
shifted. Therefore LR(1) parsers also operate correctly when
they use conflict-free parsing tables constructed by NQLALR(1).
Figure 3.2-1 may be interpreted in two ways:
(1) a terminal in 'stateb' must be emitted to all reductions
reductions for productions 'prod' in states 'statez', or
(2) a reduction for a production 'prod' receives as its
lookaheads the terminals of all appropriate states
'stateb'.
49
An en-bloc construction method would need (1) only, but for the
interactive construction (2) is also important.
The two interpretations lead to two mappings:
emitters: states --> 2** (states x prods)
receivers: states x prods --> 2** states
With 'emitters', we could take the terminals of a state and "put"
them to a set of reductions in states, while with 'receivers' we
can "get" the terminals of a set of states for a reduction in a
state.
To compute these mappings, we must be able to trace the horizontal
sequence of transitions in figure 3.2-1 in both directions.
From left to right ("forward") this trace is done by:
FORWARD: proc(statea,prod,posa,APPLY);
/* move the marker to the end of the production */
dcl (statei,statej) ∈ states, posi ∈ NAT, symi ∈ symbs;
statei := statea;
posi := posa;
do while posi <= prolng(prod);
symi := promem(prod,posi);
if (symi,prod,posi,shift,statej:=*) ∈ staite(statei)
then do;
statei := statej;
posi +:= 1;
end;
else
goto done;
end;
APPLY(statea,statei,prod);
done:
end FORWARD;
Starting with a production 'prod' marked at position 'posa'
in state 'statea', the algorithm tries to move the marker to
the end of the production. If all transitions for the currently
marked members of the production already exist, FORWARD finally
calls APPLY.
50
The other direction is traced by:
BACKWARD: proc(statez,length,APPLY);
/* move the marker to the beginning of the production */
dcl (seti,setj) c states, (statea,statej) ∈ states;
setj := {statez};
do i := 1 to length;
seti := Ø
for ∀ statej ∈ setj do;
seti u:= stapre(statej);
end;
setj := seti;
end;
for ∀ statea ∈ setj do;
APPLY(statea);
end;
end BACKWARD;
The starting nodes of all paths of length 'length' that end
in state 'statez' are determined, and APPLY is called for
the states 'statea' at these nodes.
A realization of the mappings 'emitters' and 'receivers' by a
static data structure would consume too much storage, and it
would lead to difficult reorganization problems when the
transition graph is changed.
Instead, two algorithms are used which compute an element of
the powersets on the right side, and which call an algorithm
APPLY with every state in that element.
EMITTERS: proc(statez,prod,APPLY);
/* where do lookaheads come from ? */
dcl setb c states;
ECOPY: proc(statea);
DELTA(statea,prolef(prod),item:=*);
setb u := {itesuc(item)};
end ECOPY;
setb := Ø
BACKWARD(statez,prolng(prod),ECOPY);
for ∀ stateb ∈ setb do;
APPLY(stateb,statez,prod);
end;
end EMITTERS;
We move the marker which is at the end of 'prod' back before
the first member by the aid of BACKWARD. In all states
reached by BACKWARD we shift the left side of 'prod'; the
resulting states are the "emitters", and APPLY is called for
them.
51
RECEIVERS: proc(stateb,APPLY);
/* where do lookaheads go to ? */
dcl setz c states;
RCOPY: proc(statez);
setz u:= {statez};
end RCOPY;
for ∀ prod ∈ sympro(stasym(stateb)) do;
setz := Ø
for ∀ statea ∈ stapre(stateb) do;
FORWARD(statea,prod,1,RCOPY);
end;
for ∀ statez ∈ setz do;
APPLY(stateb,statez,prod);
end;
end;
end RECEIVERS;
State 'stateb' is either accessed by a nonterminal or by a
terminal. In the latter case, no lookaheads are transmitted.
For a nonterminal, lookaheads are transmitted along the paths
which are built for the various productions for that
nonterminal, and that start at the predecessors of 'stateb'.
APPLY is called for the corresponding reduce items in the
"receiving" state.
Note that the calls of APPLY in the two algorithms have the same
parameter lists.
During the interactive modification there are four operations
which may affect lookaheads that were previously computed by
the NQLALR(1) method:
(SD) a s?_ymbol is no longer used in a state, it is
d?_eleted,
(SI) a new s?_ymbol is used in a state, it is i?_nserted,
(ED) an e?_dge of the transition graph is d?_eleted,
(EI) a new e?_dge is i?_nserted.
For symbols the relevant state is always the emitting state
'stateb' of figure 3.2-1.
For edges there are four subcases depending on the interpretation
of the edge as one of the h?_orizontal edges in figure 3.2-1,
or as the v?_ertical one.
In the following the appropriate actions for cases (SD), (SI),
(EDH), (EDV) are described. Cases (EIH) and (EIV) are treated
by ITEMINS in section 3.3.2 and by LAGET and LAPUT respectively
(section 3.5).
52
(SD) A Symbol is Deleted from a State
Algorithm LAGAR processes an entire set of states that have
lost some symbols:
LAGAR: proc;
/* garbage collection for lookaheads */
dcl set c states;
do while lagar_set ≠ Ø
set := lagar_set;
lagar_set := Ø
for ∀ stateb ∈ set do;
RECEIVERS(stateb,LADEL);
end;
end;
end LAGAR;
LAGAR propagates the deletion of lookaheads in a state
'stateb' to all states 'statez' that received lookaheads
from 'stateb'.
Since there is no easy way to know whether a lookahead 'symb'
in 'statez' steems only from 'stateb', or whether it is also
received from another state, LADEL is used to delete all
reduce-items for the reductions which receive lookaheads from
'stateb':
LADEL: proc(*,statez,prod);
dcl lost ∈ BOOL;
lost := false;
for ∀ itemz ∈ staite(statez)
(itemz = (symbz:=*,prod,prolng(prod)+1,reduce,*))
do;
if symbz ≠ eop then
staite(statez) \:= {itemz};
if ¬ ∃ (symz,*,*,*,*) ∈ staite(statez) then
lost := true;
end;
staite(statez) u:=
{(eop,prod,prolng(prod)+1,reduce,*)};
laget_set u:= {statez};
if lost then
lagar_set u:= {statez};
end LADEL;
LADEL replaces all reduce-items for a production 'prod' in a
state 'statez' by one reduce-item with the special lookahead
'eop'. The latter is reexpanded by LAGET (cf. section 3.5).
The first parameter '*' is necessary because LADEL is used as
parameter APPLY to RECEIVERS.
If state 'statez' at least once loses the last item with
a certain symbol, LAGAR must be activated for 'statez', too.
LADEL only deletes lookaheads which are different from 'eop'.
A state which has only reductions under 'eop' will not again
be stored in 'lagar_set', and the loop in LAGAR will thus
terminate.
53
(SI) A Symbol is Inserted in a State
We cannot transmit the symbol to the receiving states until the
transition graph has completely been reorganized. Therefore
we simply remember that 'stateb' must emit all its symbols:
laput_set u:= {stateb};
(EDH) A Horizontal Edge is Deleted
A closer look to an edge shows the following situation:
statei = statej =
{... {...
[l1, x1 @symbi y1] symbi [l1, x1 symbi @y1],
... --------> ...
[ln, xn @symbi yn], [ln, xn symbi @yn],
...} ...}
i.e. the edge may represent several shift items. We have to move
the marker forward to the end of all productions which have
symbol 'symbi' marked in 'statei'. The reductions which are reached
then can no longer be assumed to receive their lookaheads
through the edge, and since we do not know whether they receive
them from elsewhere, we delete all those lookaheads:
EDGEDEL: proc(statei,statej);
/* lookaheads are no longer transmitted through an edge */
for ∀ item ∈ statei
(item = (symbi:=*,prod:=*,pos:=*,shift | oldsh,statej))
do;
FORWARD(statej,prod,pos+1,LADEL);
end;
for ∀ prod ∈ sympro(stasym(statej)) do;
FORWARD(statei,prod,1,LADEL);
end;
end EDGEDEL;
The second for-loop of this algorithm already solves the next
task:
(EDV) The Vertical Edge is Deleted
We start with the marker before the first member of all
productions of the symbol that accesses 'statej' (= 'stateb')
and we move the marker forward to the end of the productions.
The lookaheads of the reductions reached then are all deleted
by LADEL.
The algorithms and operations above are used at various places
in the following sections. They isolate the difficulties of
lookaheads in connection with thhe incremental construction, so
that we may now concentrate on the modification of the grammar
and of the transition graph.
54
3.3 Modification of the Grammar
--------------------------------
Both tasks, the deletion and the insertion of a set of
productions, have several similarities. In this section
the algorithms are presented which apply to both tasks,
while the differences are treated in two subsections.
We will use the term "edit" for one of the operations "delete"
or "insert". Editing is done with the aid of
PROEDIT: proc(prod_set,changed:=*,APPLY);
dcl left ∈ SYM;
changed := false;
for ∀ prod ∈ prod_set do;
left := prolef(prod);
for ∀ stateb ∈ symsta(left) do;
changed := true;
prod_set \:= {prod};
for ∀ statea ∈ stapre(stateb) do;
APPLY(statea,prod);
end;
end;
end;
end PROEDIT;
For every production in 'prod_set' all states are determined
which shift the left side of that production. At least these
states are affected by the change of the grammar, and
another algorithm APPLY is called for each of them. In
'changed' it is noted whether there was any such state,
i.e. whether 'left' is already reachable from the start
symbol (cf. section 3.3.3). APPLY is either PRODEL or PROINS.
If a state shifts the left side of a production, then it must
contain an item with the marker before the first member of the
right side of that production. Generally it contains the
closure, i.e. for every marked nonterminal the state contains
all productions of that nonterminal, with the marker before
the first member. The closure is computed with the aid of
CLOSURE: proc(state,APPLY);
/* compute the closure items of a set of symbols */
dcl set c SYM;
do while closure_set ≠ Ø
set := closure_set;
closure_set := Ø
for ∀ symb ∈ set do;
for ∀ prod ∈ sympro(symb) do;
APPLY(state,prod,1);
end;
end;
end;
end CLOSURE;
The algorithm calls another algorithm APPLY for every item
in the closure; it is assumed that APPLY inserts the marked
symbol in 'closure_set'. APPLY is either ITEMMARK (for
deletion) or ITEMINS (for insertion of a production).
55
3.3.1 Deletion of a Set of Productions
---------------------------------------
At least three types of data in the parsing table are computed
by a closure operation: lookaheads, items and states.
Generally the algorithms for deletion in such closed data
structures appear to be more complex than their counterparts
for insertion. If an element is inserted, we only have to insert
the closure of that element, but if an element is deleted, we
cannot delete all its closure elements, since some of them may
yet be in the closure of other remaining elements.
In the case of lookaheads (algorithms LAGAR, LADEL in section 3.2)
we deleted the closure and we noted that it should later be
recomputed by LAGET. It is not necessary to be more careful,
since lookaheads do not (yet, see section 4) affect closure-items
and states.
Unnecessaray deletion of closure-items, however, would cause
unnecessary deletion of lookaheads, so it is advisable to be more
cautious here.
Algorithm PRODEL is given a state and a production with the marker
before the first member.
PRODEL: proc(state,prod);
dcl lost ∈ BOOL;
closure_set := Ø
for ∀ item ∈ staite(state) do;
if itepos(item) >= 2 then do; /* nucleus-item */
if iteact(item) = shift then
closure_set u:= {itesym(item)};
end;
else
if iteact(item) = shift then iteact(item) = oldsh; else
if iteact(item) = reduce then iteact(item) = oldre;
end;
CLOSURE(state,ITEMMARK);
lost := false;
for ∀ item ∈ staite(state) do;
if iteact(item) = oldre then do; /* empty production */
staite(state) \:= {item};
if ¬ ∃ (itesym(item),*,*,shift | reduce,*)
∈ staite(state)
then lost := true;
end;
else
if iteact(item) = oldsh then do;
EDGEDEL(state,itesuc(item));
if ¬ ∃ (itesym(item),*,*,shift | reduce,*)
∈ staite(state)
then lost := true;
end;
end;
if lost then
lagar_set u:= {state};
if prolng(prod) > 0 then /* no empty production */
successors_set u:= {state};
end PRODEL;
56
The closure is updated in three steps:
(1) all closure-items are marked to be "old",
(2) the closure of the nucleus-items is recomputed,
and the resulting items are marked with their
normal actions,
(3) by SUCCESSORS in section 3.4 the remaining "old"
items are deleted.
For all "old" shift-items the corresponding edge will be
deleted from the transition graph; therefore by EDGEDEL
we delete all lookaheads which are transmitted through this
edge.
If there are symbols which are no longer used in the state
('lost' = true), the state must be processed by LAGAR.
PRODEL only deletes the item with the marker before 'prod';
in GARBAGE we will delete all states which had a nucleus-item
for 'prod'.
Algorithm ITEMMARK marks all items in the new closure with their
usual actions ('shift' resp. 'reduce').
ITEMMARK: proc(state,prod,*);
for ∀ item ∈ staite(state)
(item = (symb:=*,prod,1,act:=*,*))
do;
if act = oldre then
iteact(item) := reduce;
else
if act = oldsh then do;
iteact(item) := shift;
closure_set u:= {symb};
end;
end;
end ITEMMARK;
57
3.3.2 Insertion of a Production
--------------------------------
The analogon to PRODEL for insertion is
PROINS: proc(state,prod);
closure_set := Ø
ITEMINS(state,prod,1);
CLOSURE(state,ITEMINS);
for ∀ symb (∃ (symb,*,*,newsh,*) ∈ staite(state))
do;
if ∃ (symb,*,*,shift,stateb:=*) ∈ staite(state) then
EDGEDEL(state,stateb);
end;
if prolng(prod) > 0 then
successors_set u:= {state};
end PROINS;
After the insertion of the item which has the marker before
the first member of 'prod' the closure of the marked symbol
is also inserted. If there are old items which already shift
a symbol 'symb', the state will need a different 'symb'-
successor, therefore we delete all lookaheads transmitted
through the existing edge.
PROINS is always called after PRODEL, so we need not care for
'oldsh' items.
The algorithm for the insertion of a single item is more
complicated because it is also used later in SUCCESSORS to
insert items which have the marker not before the first member.
ITEMINS: proc(state,prod,pos);
/* insert an item in a state */
symb := promem(prod,pos);
if ¬ ∃ (symb,prod,pos,*,*) ∈ staite(state) then do;
item := (symb,prod,pos,newsh,*);
if symb ≠ eop then do; /* shift item */
if ¬ ∃ (symb,*,*,shift | newsh,*) ∈ staite(state)
then
closure_set u:= {symb};
if ¬ ∃ (symb,*,*,shift | newsh | reduce,*)
∈ staite(state)
then
laput_set u:= {state};
successors_set u:= {state};
end; /* shift item */
else do; /* reduce item */
iteact(item) := reduce;
laget_set u:= {state};
end; /* reduce item */
if ∃ (symb,*,*,reduce,*) ∈ staite(state) then
conflict_set u:= {state};
staite(state) u:= {item};
end;
end ITEMINS;
Symbol 'symb' is the marked member. If the item is not yet
in the state, it is inserted, and if 'symb' was not used
previously in the state, it must be transmitted as lookahead
58
by LAPUT. If a conflict is introduced by the new item, the
state must later be processed by CONFLICT (see section 4).
New reduce-items must get their lookaheads by the aid of
LAGET.
3.3.3 Reduced Grammars
-----------------------
Context-free grammars are usually required to be reduced:
Definition 3.3.3-1:
A grammar
G = (nonts,terms,prods,start)
is called reduced iff the following two conditions are
fulfilled:
(R1) ∀ A ∈ nonts ∃ form1,form2 ∈ symbs*
(start =>* form1 A form2)
('A' is reachable),
(R2) ∀ A ∈ nonts ∃ word ∈ terms*
(A =>* word)
('A' is based).
We will now discuss the two conditions separately.
Reachability
(R1) is in a certain sense automatically ensured by any LR
construction method:
Lemma 3.3.3-2:
∀ A ∈ nonts
(∃ stateb ∈ states (stasym(stateb) = A) ==> (R1))
Proof: For every predecessor 'statea' of 'stateb' there
is a path of shift-transitions from the start-state to
'statea'; 'form1' is the concatenation of all symbols
which are shifted along that path. Assume that the parser
has read a string such that it is in 'stateb' now: its
last action was a shift of 'A' from 'statea' to 'stateb'.
For every state reached by the parser there is a string
form2 ∈ terms* which continues the word consumed so far
to a sentence (the "correct prefix" property).
59
Basement
(R2) is not ensured in our construction. There are three reasons
to allow unbased nonterminals:
(1) We do not want to require the user to insert the
productions in a special ordering; he may later insert
a production which bases the nonterminal.
(2) In the EXTRA system presented in section 1.2 unbased
nonterminals are useful in connection with the backspacing
of the parser. They are used to introduce parts of the
grammar which can only be generated, but not be input.
(3) It seems that there is no easy way to check (R2) by
looking at the parsing table.
Unbased nonterminals may lead to states which do not use any
terminal, but not necessarily, as can be seen from
Example 3.3.3-3:
[axiom = S
.S = A
.A = b C | d
.C = d C
]
2 axiom -> 3 eof =.
| S -> 4 eof =: axiom,1
| A -> 5 eof =: S,1
| b -> 6 C -> 8 eof =: A,2
| d -> 7 eof =: A,1
60
3.4 Modification of the Set of States
--------------------------------------
3.4.1 Compatibility of States
------------------------------
The construction process for LR parsing tables terminates
because a new successor state will be created only if the table
does not yet contain an appropriate state. In the latter
case, the state already in the table is made the successor.
In practice a new state is always created, then it is
decided whether any existing state is compatible with the
new one, and if such a state exists, the new one will be deleted
and the existing state will be taken as successor.
The various construction methods differ in the compatibility
criteria (cf. Pager [1977 PGM]) they use, and especially in the
field between LALR and LR these criteria are quite important.
The criteria are rated from weak to strong: the strongest
criterion which is used by SLR(k), NQLALR(k) and LALR(k) is the
LR(0) criterion: Two states are compatible iff their cores
(sets of marked productions without contexts) are equal.
The weakest criterion is
Knuth's criterion: Two states are compatible iff they are
equal, i.e. iff their sets of configurations (with contexts)
are equal.
These two criteria have the outstanding advantage that they are
applied very easily. Unfortunately Knuth's criterion produces
impractical large sets of states for grammars of programming
languages, thus the LR(0) criterion is the one which is used in
almost all implementations.
Pager [1977 PGM], however, presents two other criteria that lie
between the extremes: parsers can be generated for all LR(k)
grammars, and the resulting sets of states are acceptably small.
From his weak and strong compatibility criteria the latter is
especially interesting because it ensures a minimal number of
states. We will come back to this issue in section 4.
Let us now see how we can decide whether two states are
compatible. A first aid is
Lemma 3.4.1-1 (Pager [1977 PGM]):
Two states are compatible iff their sets of nucleus
configurations are equal.
Proof: The closure configurations of a state depend entirely
on the nucleus configurations of that state.
This result applies also to the LR(0) criterion, where contexts
are not regarded.
Nevertheless, the search for a compatible state remains very
61
time consuming if it is applied to every state that has been
previously generated. One way to reduce the number of candidates
is well known (cf. Aho and Ullman [1977]):
If the new state 'statex' is accessed by 'symb', only the
states in 'symsta(symb)' need to be tested for compatibility.
(Poof: A state which is accessed by a symbol different from
'symb' cannot have the same nucleus as 'statex', because the
accessing symbol is the one before the marker in a nucleus
item.)
For symbols such as 'identifier' or ',' the sets 'symsta' could be
rather lengthy.
Lemma 3.4.1-1 normally is used only to decrease the effort for the
comparison of two states. We will use it here, however, to build a
data structure which stores the candidate states:
pronuc: prods x NAT --> 2**states
(prod,pos) |-> {state | ∃ (*,prod,pos,*,*)
∈ staite(state) & pos >= 2 }
"the states which have the marked production in
their nucleus"
stanum: states --> NAT
state |-> card({(prod,pos) | ∃ (*,prod,pos,*,*)
∈ staite(state) & pos >= 2})
"the number of nucleus items in a state"
The following algorithm determines for a state 'statex' a state
'statey' which is compatible with 'statex'; at the extreme 'statey'
is the same as 'statex'.
COMPATIBLE: proc(statea,statex,statey:=*);
/* LR(0) compatibility criterion */
dcl set c STA;
for ? (*,prod1:=*,pos1:=*,*,*) ∈ staite(statex)
(pos1 >= 2);
do;
set := pronuc(prod1,pos1);
for ∀ item ∈ staite(statex)
(item = (*,prod:=*,pos:=*,*,*) & pos >= 2
& prod ≠ prod1 & pos ≠ pos1)
do;
set n:= pronuc(prod,pos);
end;
end;
set \:= {state | state ∈ set &
stanum(state) ≠ stanum(statex)};
if set = Ø then do;
for ∀ (*,prod:=*,pos:=*,*,*) ∈ staite(statex) do;
pronuc(prod,pos) u:= {statex};
end;
statey := statex;
end;
else
for ? state ∈ set do;
statey := state;
end;
end COMPATIBLE;
62
In the case that the nucleus has only one element, the
statement in the first 'for'-loop is not executed at all.
Otherwise the dissection ensures that only states are
retained whose nuclei contain the nucleus of 'statex',
and after the for-loop equality of the nuclei is required.
The mapping 'pronuc' is adjusted in the case where 'statey'
is equal to 'statex'.
Pager [1977 PGM] writes that the average number of nucleus-items
in the parsers he constructed was 1.7, therefore the sets 'pronuc'
should be very small, and COMPATIBLE should be rather efficiently
find the right state.
A modified version of algorithm COMPATIBLE is described in
section 4.3.3. Only that version makes use of the parameter
'statea'.
During the change of the transition graph new states must be
created, and because of the incremental approach we also have
to delete superfluous states. We use small separate algorithms
for these operations.
ALLOC: proc(symb,state:=*);
for ? state ∈ STA \ states do;
states u:= {state};
stanum(state) := 0;
staite(state) := Ø
stapre(state) := Ø
stasym(state) := symb;
symsta(symb) u:= {state};
end;
end ALLOC;
ALLOC creates a new state which is properly connected to
the other data structures, but which has an empty item set.
The state is accessed by 'symb'.
FREE: proc(state);
symsta(stasym(state)) \:= {state};
states \:= {state};
successors_set \:= {state};
laget_set \:= {state};
laput_set \:= {state};
lagar_set \:= {state};
conflict_set \:= {state};
hidden_set \:= {state};
for ∀ (*,prod:=*,pos:=*,*,*) ∈ staite(state)
(pos >= 2) /* nucleus-item */
do;
pronuc(prod,pos) \:= {state};
end;
end FREE;
FREE deletes a state and all its traces in the other data
structures. It also updates the requests for a processing
by other algorithms.
63
3.4.2 Computation of New Successors
------------------------------------
Algorithm SUCCESSORS is the heart of the incremental construction
method:
SUCCESSORS: proc;
/* compute new successors */
dcl set c STA;
REDINS: prod(*,statez,prod); /* insert a reduction */
ITEMINS(statez,prod,prolng(prod)+1);
end REDINS;
ill_set := Ø
do while successors_set ≠ Ø
set := successors_set;
successors_set := Ø
for ∀ statea ∈ set do;
for ∀ symb
(∃ (symb,*,*,oldsh | newsh,*) ∈ staite(statea))
do;
if ∃ (symb,*,*,shift | oldsh,stateb:=*)
∈ staite(statea)
then do;
stapre(stateb) \:= {statea};
ill_set u:= {stateb};
staite(statea) \:= {item | item = (symb,*,*,oldsh,*)
& item ∈ staite(statea)};
end;
ALLOC(symb,statex);
closure_set := Ø
for ∀ (symb,prod:=*,pos:=*,shift | newsh,*)
∈ staite(statea)
do;
ITEMINS(statex,prod,pos+1);
/* changes 'successors_set' */
stanum(statex) +:=1;
end;
COMPATIBLE(statea,statex,stateb:=*);
stapre(stateb) u:= {statea};
if statex = stateb then do;
CLOSURE(stateb,ITEMINS);
end;
else do;
FREE(statex);
laput_set u:= {stateb};
for ∀ item ∈ staite(statea)
(item = (symb,prod:=*,pos:=*,shift | newsh,*))
do;
FORWARD(stateb,prod,pos+1,REDINS);
iteact(item) := shift;
itesuc(item) := stateb;
end;
end;
end; /* ∀ symb */
end;
end;
end SUCCESSORS;
64
Given a production 'prod' to be deleted or inserted,
algorithm PROEDIT determined all states 'statea' which
contain an item with the marker before the first member
of 'prod', and it corrected the closure of these states
and collected them in 'successors_set'.
Furthermore, all lookaheads that might have become invalid by
the change were already deleted by LAGAR.
We now have to process the items with actions 'oldsh' resp.
'newsh', were the former must be deleted, and the latter
must get an appropriate successor state.
Deletion could not take place until now, because other-
wise paths in the transition graph would have been
destroyed while they are yet needed by LAGAR for the
deletion of possibly invalid lookaheads.
All items which shift the same symbol 'symb' must be regarded
together, since they must get the same successor state 'stateb'.
Only in the case of a single item for 'symb' with action
'newsh' such a successor does not yet exist - in the other
cases the edge from 'statea' to 'stateb' is deleted. Since
'stateb' thereby may become inaccessible at all, it is
noted in 'ill_set' for a processing by GARBAGE.
We could have tried to modify the nucleus of 'stateb'
in order to reflect the change of the set of items in
'statea' which shift 'symb'. We cannot exclude, however,
that there are other predecessors of 'stateb' which did
not change.
65
3.4.3 Garbage Collection for States
------------------------------------
Algorithm SUCCESSORS noted the states which possibly got
inaccessible from the start-state in 'ill_set'. We now determine
in four steps whether they are really superfluous.
GARBAGE: proc;
/* delete superfluous old states */
dcl (target,sound_set) c STA;
EXHAUST: proc(set:=*,target:=*,APPLY);
/* propagate something to all successors */
dcl source c STA;
source := set;
do while source ≠ Ø
target := Ø
for ∀ state ∈ source do;
SUCC(state,APPLY);
end;
source := target \ set;
set u:= target;
end;
end EXHAUST;
SUCC: proc(statea,APPLY);
/* do something with all direct successors */
for ∀ (*,*,*,shift,stateb:=*) ∈ staite(statea) do;
if stateb ¬∈ ill_set then
APPLY(statea,stateb);
end SUCC;
ADD: proc(*,stateb);
target u:= {stateb};
end ADD;
PREDEL: proc(statea,stateb);
stapre(stateb) \:= {statea};
end PREDEL;
/* 1. all successors of "ill" states get "ill" */
EXHAUST(ill_set:=*,target:=*,ADD);
/* 2. all states with a "sound" predecessor become "sound" */
sound_set := Ø
for ∀ stateb ∈ ill_set do;
if ∃ statea ∈ stapre(stateb) (statea ¬∈ ill_set)
then
sound_set u:= {stateb};
end;
/* 3. all successors of "sound" states become "sound" */
EXHAUST(sound_set:=*,target:=*,ADD);
/* 4. ill states which did not become "sound" are garbage */
lagar_set := Ø
ill_set \:= sound_set;
for ∀ statea ∈ ill_set do;
SUCC(statea,EDGEDEL);
end;
LAGAR;
for ∀ stateb ∈ ill_set do;
SUCC(statea,PREDEL);
66
FREE(statea);
end;
end GARBAGE;
Again we must be careful not to delete any edge before not all
lookaheads which were transmitted through the edges are deleted.
Note that the transition graph may contain cycles; the problem
is the decision whether a deleted edge was the last one leading
to a cycle.
Of course this is the classical garbage collection problem which
could have been solved by the usual marking algorithm (mark all
successors of the start-state and then delete the unmarked
states). That approach would, however, conflict with the
steadiness condition for the incremental construction. The
algorithm above limits the amount of work for small deletions
in the transition graph to a reasonable size.
In an en-bloc construction (without deletion of productions),
which is also possible with the algorithms of this work,
GARBAGE and LAGAR are not activated at all.
67
3.5 Transmission of Lookaheads
-------------------------------
When the new transition graph is built, we may (re-) compute the
lookaheads. The basic operation of the NQLALR(1) method for
that computation is the following operation:
LAINS: proc(stateb,statez,prod);
/* insert lookaheads for reductions */
for ∀ symb ≠ eop ((symb,*,*,*,*) ∈ staite(stateb)
/* " & sympro(symb) = Ø " for conventional LR(1) */ )
do;
item := (symb,prod,prolng(prod)+1,reduce,*);
if item ¬∈ staite(statez) then do;
if ∃ (symb,*,*,*,*) ∈ staite(statez) then
conflict_set u:= {statez};
else
laput_set u:= {statez};
end;
staite(statez) u:= {item};
end;
end LAINS;
LAINS takes all symbols 'symb' used in a state 'stateb'
and generates reduce-items of the form
(symb,prod,prolng(prod)+1,reduce,*)
in state 'statez'.
For conventional LR(1), only terminals 'symb' must be taken;
for LR(1)-like parsing of sentential forms, the nonterminals
of 'stateb' are copied, too .
If 'symb' was not yet present in 'statez', it might be
necessary to transmit it to other states. State 'statez'
must therefore later be processed by LAPUT.
A processing by CONFLICT is necessary if a conflict is introduced
by the new item.
Algorithm LAGET computes a first approximation to the lookahead
sets of new reductions, or of reductions for which the lookaheads
were deleted by LADEL (see section 3.2). In both cases such
reductions are recognized by the special lookahead 'eop':
LAGET: proc;
/* get lookaheads */
for ∀ statez ∈ laget_set do;
for ∀ item ∈ staite(statez)
(item = (eop,prod:=*,*,reduce,*))
do;
EMITTERS(statez,prod,LAINS);
staite(statez) \:= {item};
end;
end;
end LAGET;
The states which contain new reductions were stored in
'laget_set' by ITEMINS; the fact that the reduce-item needs
68
lookaheads is known from the special lookahead 'eop'.
The states that "emit" the necessary lookaheads are found
by EMITTERS, and LAINS is used to generate the appropriate
reduce-items in 'statez'.
Every time when the set of symbols used in a state was incremented,
that state was put in 'laput_set'. Now we are ready to transmit
the symbols used in those states to the receiving reductions:
LAPUT: proc;
do while laput_set ≠ Ø
set := laput_set;
laput_set := Ø
for ∀ stateb ∈ set do;
RECEIVERS(stateb,LAINS);
end;
end;
end LAPUT;
LAPUT may put the receiving state 'statez' again in
'laput_set', but only if 'statez' did not yet contain the
reduce-item. The recursive process will thus terminate
at the latest if all reductions in all states have all
symbols of 'symbs' as lookaheads.
69
4 Non-NQLALR(1) Grammars
-------------------------
For a deterministic parsing procedure there must be a mapping
from pairs of states and symbols to parser actions. The
construction method xLR(k) (x = "", "S", "NQLA", "LA") may,
however, lead to states with two or more different actions for
one symbol. Such states are called inadequate, or they are said
to have a conflict under that symbol. In this case the grammar
is not xLR(k).
The actions involved in a conflict may only be 'shift' and 'reduce'
or 'reduce' to at least two different productions.
As was mentioned in section 3.1, there is only one state
with action 'accept' in every parsing table constructed
by the algorithms in this work, and this state is never
changed during the incremental construction. Two 'shift'
actions for the same symbol lead to the same state by
definition.
Definition 4-1:
A conflict in a parsing table T constructed by a xLR(1)
method is a 4-tuple
(state,symb,prod1,prod2) ∈ states x symbs x prods x prods
such that for act ∈ {shift,reduce} there are two items
(symb,prod1,*,act ,*) ∈ state and
(symb,prod2,*,reduce,*) ∈ state.
Symbol 'symb' is called the conflicting lookahead, 'prod1'
and 'prod2' are the conflicting reductions.
From the possible defects a grammar might have when it is input
in a parser constructor for the first time, clerical errors and
unreachable or unbased nonterminals are rather easily detected
and removed, even by the unexperienced grammar designer.
Conflicts are also easily diagnosed in the parsing table, but
usually it is difficult to understand them. Therefore the
constructor should not produce conflicts wherever possible,
and it should give precise diagnostics for the remaining
conflicts.
In section 4.1, we first will discuss possible causes for
conflicts.
In section 4.2, the basic definitions and ideas for the removal
of NQLALR(1) conflicts are given, succeeded in section 4.3 by
the algorithms that try to remove the conflicts by splitting of
states.
70
4.1 Causes for Conflicts
-------------------------
There may be the following causes for a conflict:
(C1) the grammar is ambiguous,
(C2) the grammar is unambiguous, but not LR(k) for k >= 0,
(C3) the grammar is LR(k) with k >= 2,
(C4) the grammar is LR(1), but not LALR(1),
(C5) the conflict results from the NQLALR(1) method; it
would not occur with LALR(1).
For several reasons to be explained in the following we require
the grammars to be LR(1), and we therefore may combine causes
(C1), (C2) and (C3) to
(C123) the grammar is not LR(1).
The five causes need some remarks to motivate our approach.
(C1) Ambiguity
This cause occurs rather often. The ambiguity usually is not
intended by the user, therefore he should receive a precise
description of the possible different derivations.
The original grammars of some important programming languages
(ALGOL60, PASCAL) produce avalanches of (intended) ambiguities
that are to be resolved by semantic information, mostly about
identifiers. Knuth [1965] noted that in ALGOL60 for example
there are 8 derivations
=>+ .
The "dangling else" construction is a representative of another
class of intended ambiguities. The additional verbal rule ("an
ELSE matches the last preceeding THEN") might have its counter-
part in a conflict resolution strategem in the parsing algorithm
("shift in favour of a reduction"). Such approaches are suggested
by Aho, Johnson and Ullman [1975], and they are used in their
YACC parser constructor.
(C2) unambiguous and not LR
Knuth [1965] points out that nearly every unambiguous grammar
falls in one of the classes LR, RL or "parsable from both ends".
Cause (C2) is unlikely to occur in grammars of todays programming
languages.
71
(C3) k > 1
This cause is rare, but it occurs for example in a rule quoted
from the COBOL standard [ANSI X3.23-1974]:
communication_description_entry =
CD cd-name ';' FOR INITIAL? INPUT (
((';' SYMBOLIC QUEUE IS data-name)?
(';' SYMBOLIC SUB-QUEUE-1 IS data-name)?
(';' SYMBOLIC SUB-QUEUE-2 IS data-name)?
(';' SYMBOLIC SUB-QUEUE-3 IS data-name)?
(';' MESSAGE DATE IS data-name)?
(';' MESSAGE TIME IS data-name)?
(';' SYMBOLIC SOURCE IS data-name)?
(';' TEXT LENGTH IS data-name)?
(';' END KEY IS data-name)?
(';' STATUS KEY IS data-name)?
(';' MESSAGE COUNT IS data-name)? )?
(data-name/',')? )? '.'
Again the notational conventions of section 1.2.3 were used.
A lookahead ';' does not tell how much options were omitted,
but lookaheads of length 3 would suffice. It is rather tedious
to rewrite this rule in order to make it LR(1) without losing
the ordering of the options prescribed by the rule above.
On the other side, if we accept another structure of the parse
tree, such a rewriting is always possible by the transformations
of Mickunas, Lancaster and Schneider [1976], so that it seems
reasonable to require k <= 1.
There is another difficulty which speaks against k >= 2 : the
increased complexity of the parsing table. If generally lookahead
strings of length k are stored, much space will be wasted; if
longer lookaheads are only used in states where they are
neccessary, the parsing algorithm will be complicated.
(C4) LR(1) but not LALR(1)
One motivation for a general LR(1) method is rather an aestethical
one - the LR(1) property of a grammar seems more "natural", while the
conditions for LALR, SLR and NQLALR seem to be deduced from the
particular construction algorithms.
In addition, Pager [1977 PGM] argues that there are application
areas, for example natural language processing, where most of the
grammars produce conflicts of the classes (C1), (C2) and (C3).
It is very disadvantageous then to have a lot of additional
conflicts caused by a weak construction method.
In the following it is shown how to remove conflicts of the (C4)
category.
(C5) Conflict caused by NQLALR(1)
Conflicts of this type show that the NQLALR(1) method is inferior
72
to the LALR(1) method with respect to generality. Therefore the
algorithms in section 4.3 look whether such a conflict would not
have been present if LALR(1) had been employed, and in this case
the superfluous lookahead is simply deleted.
4.2 Foundations of the LR(1) Construction
------------------------------------------
From the discussion in the previous section it seems reasonable
to construct parsing tables for all LR(1) grammars, not more
and not less.
There are at least two ways that lead to this summit:
LR(1)
/ \
GF / \ Pager
1980 / \ [1977 PGM]
/ \
LALR(1)/ \
NQLALR(1)/ \
/ \
LR(0)/ \
/ \
/ \
/ \
While Pager [1977 PGM] described a direct attack by a genuine
LR(1) method which does not rely on a LR(0) construction, in this
work we have the additional burden of the incremental construction,
and therefore three intermediate camps are used.
Section 3 led us to the second of these. We will now first
look for a way from LALR(1) to LR(1); the step from NQLALR(1)
to LALR(1) then will be found at once.
73
4.2.1 A Non-LALR(1) Grammar
----------------------------
Aho and Ullman [1972] give an example (7.27) for a grammar which
is LR(1) but not LALR(1):
Example 4.2.1-1:
[axiom = S
.S = A a | B b | d A b | d B a
.A = c
.B = c
]
LALR(1) produces the following transition graph for this grammar:
2 axiom -> 3 eof =.
| S -> 4 eof =: axiom,1
| A -> 5 a -> 9 eof =: S,2
| d -> 6 A -> 10 b -> 13 eof =: S,3
| | B -> 11 a -> 14 eof =: S,3
| | c -> 8 a =: A,1
| | a =: B,1
| | b =: A,1
| | b =: B,1
| B -> 7 b -> 12 eof =: S,2
| c -> 8 ...
State 8 has conflicts for 'a' and 'b', but we could replace
it by two conflict-free states 8' and 8":
2 axiom -> 3 eof =.
| S -> 4 eof =: axiom,1
| A -> 5 a -> 9 eof =: S,2
| d -> 6 A -> 10 b -> 13 eof =: S,3
| | B -> 11 a -> 14 eof =: S,3
| | c -> 8' b =: A,1
| | a =: B,1
| B -> 7 b -> 12 eof =: S,2
| c -> 8" a =: A,1
| b =: B,1
The LR(0) compatibility criterion used by LALR(1) sometimes is
too strong, because the identification of some states
constructed by Knuth's LR(1) method may lead to conflicts
that were not present before.
In the example, it was possible to cancel the effect of the
LALR(1) merging of states. In some cases it can be seen from
the outset that such a splitting of states is not possible.
74
4.2.2 Overloaded Conflicts
---------------------------
Aho and Ullman [1977] proved
Lemma 4.2.2-1:
A grammar for which LALR(1) produces a shift-reduce
conflict cannot be LR(1).
Proof: Assume we have a LR(1) grammar. With LALR(1), let a
conflict occur in 'statei' for the configurations
i1 = [A1 = x1 @a y1, c1] and
i2 = [A2 = x2 @ , a].
The conflict is not present with LR(1), so it must originate
from merging a set of at least two states. All states in this
set have an configuration with the same core as i1, and at
least one state 'statej' of this set has an configuration
j2 = [A2 = x2 @ , a]
But this 'statej' would have the same conflict, and therefore
the grammar could not be LR(1).
A similar observation is made for reduce-reduce conflicts.
For this purpose it facilitates the presentation if the various
lookaheads and reductions in a state 'q' are denoted by a boolean
(m,n) matrix 'Mq' with
Mq(ri, sj) = 1 (0) if sj (not) is lookahead for ri
for all reductions r1...rm and all symbols s1...sn in state 'q'.
In every row and column of such a matrix there is at least one
bit set, because every reduction has at least one lookahead,
and we will not invent columns for symbols that do not occur
as lookahead in the state. A column with more than one bit set
indicates a reduce-reduce conflict.
An inadequate state 'q' with two reduce-reduce conflicts may be
represented by
q | s1 s2
---+------
r1 | 1 1
r2 | 1 1 ,
it could have resulted from merging two adequate states
q1 | s1 s2 q2 | s1 s2
---+------ ---+------
r1 | 0 1 and r1 | 1 0
r2 | 1 0 r2 | 0 1 ,
75
but if, for example, a third reduction were involved,
q' | s1 s2
---+------
r1 | 1 1
r2 | 1 1
r3 | 1 1
such a partitioning would not be possible. This leads to
Definition 4.2.2-2:
A boolean (m,n) matrix M with
∀ i ∈ {1...m} ∃ j ∈ {1...n} (M(i,j) = 1)
is called overloaded iff there does not exist a finite,
nonempty set of (m,n) matrices {M1...Ml} such that
(S1) M = M1 v M2 v ... v Ml
(S2) ∀ k ∈ {1...l} ∀ i ∈ {1...m}
¬ ∃ j ∈ {1...n} (Mk(i,j) = 1)
(S3) ∀ k ∈ {1...l} ∀ j ∈ {1...n}
¬ ∃ i1, i2 ∈ {1...m}
(i1 ≠ i2 & Mk(i1,j) = 1 & Mk(i2,j) = 1)
Obviously this property is decideable for every (n,m) matrix:
build the finite set of all (m,n) matrices, take every subset
{M1...Ml} of this set, and check it for the three conditions.
There are 2**(2**(m*n)) such subsets, and it would therefore be
very desireable to have an algorithm with a better complexity,
e.g. a polynomial one.
Even a good algorithm, however, would provide only a partial
answer to our original problem: if the matrix corresponding
to the state is not overloaded, we are not yet sure that
splitting is possible, and how this splitting should be done.
Therefore this way is not pursued further.
76
4.2.3 Lane Tracing
-------------------
A method which has much similarity to NQLALR, but which computes
correct lookaheads for all LALR grammars, is the lane tracing
algorithm described in Pager [1977 LTA]. As for NQLALR, the
contexts of items need not to be stored, but the lookaheads of
a reduction are computed from an existing set of items.
Definition 4.2.3-1:
For an item i1 = [A = x @a y], where x ∈ V+, a ∈ V,
y ∈ V*, the item
[A = x a @y]
is called a transition successor of i1, while, if a ∈ N,
denoted by |w|. There is one word __________________
A lane is a sequence i[1]...i[m], where for j = 1...m-1
i[j+1] is a (transition or closure) successor of i[j].
The lanes represent the paths along which contexts are propagated
in the canonical LR(1) method.
Definition 4.2.3-2:
The abbreviated form of a lane i[1]...i[m] is a subsequence
i[j1]...i[jn] such that for k = 1...n-1
(1) i[j[k]+1] is a closure successor of i[j[k]],
(2) i[l+1] is a transition successor of i[l] for
l = j[k+2]...j[k+1]-1.
If i1...in is a lane in abbreviated form with items
i[j] = [A[j] = x[j] @a[j] y[j]], j = 1...n,
then we will call y[n]...y[1] the string generated
by the lane.
The importance of these definitions is motivated by
Theorem 4.2.3-3:
The set of lookaheads for a reduce-item i is
{t ∈ T | ∃ l ∃ w ∈ V* (l is a lane ending with i
and w is the string generated by its
abbreviated form and t ∈ FIRST1(w) }.
Proof: Though lane-tracing starts at a reduction and follows
the edges of the transition graph backwards, the definitions
above are a mere reformulation of the rules for the
propagation of context-sets in the canonical LR(1) method.
Pager states that the number of possible lanes rises exponentially
with the number of items in the parsing table, and the essence of
his algorithm lies in the fact that the work he needs is at most
77
a linear function of the number of items in the parsing table.
In section 4.3 the idea of lane tracing is used to determine the
correct lookaheads for single reductions which are involved in a
conflict. Since the number of such reductions should be small,
it suffices to use a straight-forward lane tracing algorithm.
4.2.4 Strong Compatibility
---------------------------
Pager [1977 PGM] describes a method which constructs parsing
tables of minimal size from all LR(1) grammars. In the following
some results of that paper are summarized.
Definition 4.2.4-1:
For i=1,2 let pi[0] ∈ P be marked productions
pi[0] = (Ai[0], xi[0] yi[0]),
where the marker is before yi[0].
A string w ∈ V* is called a shared descendant of y1[0]
and y2[0] iff for i=1,2 and for ni ∈ NAT there are (possibly
empty) sequences of productions
pi[j] = (Ai[j], yi[j]), j = 1...ni,
and sequences of strings ui[j], j = 0...ni, such that
(1) ui[j-1] yi[j-1] = ui[j] Ai[j] =>
ui[j] yi[j] for j = 1...ni, and
(2) ui[0] = e and ui[ni] yi[ni] = w and
(3) the last member of the sequence p1[0]...p1[n1] is
different from the last member of p2[0]...p2[n2].
78
During the construction Pager uses the following special
compatibility criterion:
Definition 4.2.4-2:
For i=1,2 let the states
Si = {[Aj = xj @yj, Ui[j]] | j = 1...r}
be two sets of configuration groups with a common core.
S1 and S2 are called strongly compatible iff for
1 <= j1 < j2 <= r
(1) U1[j1] n U2[j2] = Ø & U2[j1] n U1[j2] = Ø, or
(2) the scanned strings y[j1] and y[j2] do not share
a descendant.
This criterion allows it to merge as much states as possible, as
Pager explains by the following intuitive argument:
Let, for example, c ∈ u1[j1] n u2[j2] for some c, j1, j2,
and let y[[j1] and y[j2] share a descendant, in contrast to
conditions (1) and (2). Let (B[j1], w[j1]) and (B[j2], w[j2])
be the corresponding different, last members of the sequences.
Then there will be a state in the parsing table which contains
the two configurations
[B[j1] = w[j1], c] and
[B[j2] = w[j2], c].
But this is a conflict.
For a formal proof for the minimality of the parsing table the
reader is referred to the original paper.
Though the result of this method is optimmal, its implementation
needs a lot of computational effort. The context-sets must be
stored for every item, and for every new state it must be checked
whether the state is strongly compatible to an existing state.
Therefore this method seemed to be unsuitable for an incremental
construction. We will, however, make use of definition 4.2.4-2
when we try to achieve general LR(1) in an afterthought.
79
4.3 Algorithms for LR(1)
-------------------------
Because of its simplicity and its efficiency the NQLALR(1)
method was chosen to compute the mass of lookaheads. For
most unambiguous grammars it should produce a conflict-free
parsing table.
Only if conflicts occur we have to be more careful. Our general
procedure then will be the following:
(1) Take a pair of conflicting items (shift-reduce or
reduce-reduce conflict).
(2) Use the lane tracing algorithm to determine the real
origins of the conflicting lookahead for the reduce-
item(s).
(3) Perhaps there is no origin at all - this indicates a
NQLALR(1)-specific conflict, and the corresponding
reduce-item may simply be deleted.
(4) For a reduce-reduce conflict look whether all lanes
for the two reductions diverge at some points. Recompute
the successors of the states at those bifurcations,
since successors were not strongly compatible.
(5) Repeat steps (1) to (4) until all pairs of conflicting
items are processed.
Surprisingly step (2) provides a solution to both the NQLALR(1)-
and the LALR(1)-specific problems. We will treat this step first.
80
4.3.1 The Graph of Lanes
-------------------------
The example in section 3.2 showed that it is the error of the
NQLALR(1) method to take all terminals from reduce-items as
lookaheads of other reductions.
This time we are interested in the origins of a single lookahead
'symb' of a single reduction. Again we go back the edges of the
transition graph until we reach the states 'stateb' where the
marker is before the left side 'left' of the production.
With the canonical LALR(1) method these states would contain some
configuration groups
ci = [Ai = xi @left yi, Ui], i = 1...n
Symbol 'symb' is a valid lookahead for the reduction to 'left'
if it is in FIRST1(yi Ui) for some 'i'. We first look whether
'symb' is in any FIRST1(yi), but if not, and if yi =>* e,
we would need the 'Ui' which are not directly available in our
method.
If yi =>* e, we must further go back, moving the marker in the
production for 'Ai' to the left, until we reach the states with
the marker before 'Ai'. Here the above argument applies again.
Stated with Pager's terminology, we trace the lanes until it is
clear whether 'symb' is in FIRST1 of the strings generated by
the abbreviated form of the lanes.
In algorithm EMITTERS it sufficed to compute a set of states
which "emitted" the lookahead. Now we are forced to build a
graph of lanes explicitly, since a state may be visited several
times by different items during the tracing process.
Definition 4.3.1-1:
A graph of lanes is a 2-tuple GL = (nodes,edges), where
(1) nodes c states x prods x NAT x ACT
is a finite set of nodes (which corresponds to a set
of items in states),
(2) edges c nodes x nodes
is a relation between nodes.
We will use the following mappings on nodes:
(3) nodsta : nodes --> states,
nodpro : nodes --> prods,
nodpos : nodes --> NAT,
nodact : nodes --> ACT,
nodsuc : nodes --> 2**nodes,
nodpre : nodes --> 2**nodes.
For n1 ∈ nodes we will combine the values of these mappings
to 6-tuples of the form
(sta, pro, pos, act, suc, pre).
81
For any n1,n2 ∈ nodes we require:
(4) (n1 ∈ nodpre(n2) & n2 ∈ nodsuc(n2)) <==>
(n1,n2) ∈ edges,
(5) (*,nodpro(n1),nodpos(n1),*,*) ∈ nodsta(n1),
(6) n1 ∈ nodpre(n2) <==>
∃ (*,nodpro(n1),nodpos(n1),shift,nodsta(n2))
∈ staite(nodsta(n1))
"n2 is a transition successor of n1"
|
(nodsta(n1) = nodsta(n2) & nodpos(n2) = 1 &
promem(nodpro(n1),nodpos(n1)) = prolef(nodpro(n2)))
"n2 is a closure successor of n1"
Algorithm LANE_GRAPH builds such a graph of lanes. It is given
a set 'nodes' initially containing a single node (the "root")
which corresponds to a conflicting reduction, and the conflicting
lookahead 'symb'.
LANE_GRAPH: proc(symb,nodes:=*,ORIGIN);
dcl (source,target) c (nodes1 u nodes2)
,(is_new, is_first, is_null) ∈ BOOL
,NODINS proc;
source := nodes;
do while source ≠ Ø
target := Ø
for ∀ nodeb ∈ source (nodeb =
(stateb:=*,prod:=*,pos:=*,act:=*,*,*))
do;
if pos >= 2 then do;
pos -:= 1;
for ∀ statea ∈ stapre(stateb) do;
NODINS(statea,prod,pos,nodea:=*,is_new:=*,nodeb);
if is_new then
target u:= {nodea};
end;
end; /* pos >= 2 */
else do; /* pos = 1 */
for ∀ itema ∈ staite(stateb) (itema =
(prolef(prod),proda:=*,posa:=*,shift,statec:=*))
do;
NODINS(stateb,proda,posa,nodea:=*,is_new:=*,nodeb);
NULLABLE(proda,posa+1,symb,is_first:=*,is_null:=*);
if is_first then
ORIGIN(nodea);
else
if is_null & is_new then
target u:= {nodea};
/* else '= error' : do nothing */
end; /* ∀ itema */
end; /* pos = 1 */
end;
source := target;
end;
end LANE_GRAPH;
82
If the marker is not before the first member, it is moved
one position left, and all predecessors are inserted in the
graph by the aid of NODINS. Otherwise, the items 'itema'
which shift the left side of 'prod' are considered.
Corresponding nodes are inserted by 'NODINS'. Algorithm
NULLABLE is then used to determine whether
(1) symb ∈ FIRST1(yi) <==> is_first = true,
(2) yi =>* e <==> is_null = true.
If (1), an origin for the lookahead is found, and one of
the algorithms ORIGIN1, ORIGIN2 is called. If (2), and if
the graph did not yet contain 'nodea', the lane must further
be traced backwards.
Algorithm NODINS is given a node 'nodeb' and an item (determined
by 'statea', 'proda' and 'posa') which shall be inserted as a
predecessor node 'nodea'. In 'is_new' we note whether the graph
already contained that node.
NODINS: proc(statea,proda,posa,nodea:=*,is_new:=*,nodeb);
/* insert a node in the graph of lanes */
/* for 'nodes' see LANE_GRAPH */
if ∃ node ∈ nodes
(node = (statea,proda,posa,*,*,*))
then do;
nodea := node;
nodsuc(node) u:= {nodeb};
is_new := false;
end;
else do;
nodea := (statea,prod,pos,error,{nodeb},Ø);
nodes u:= {nodea};
is_new := true;
end;
nodpre(nodeb) u:= {nodea};
end NODINS;
Though the requirements for NULLABLE are easily stated, the
algorithm itself is rather lengthy. In the following description
a mapping
stack: NAT --> prods x NAT x BOOL x BOOL
is used. The first two projections
stkpro: NAT --> prods
stkpos: NAT --> NAT
define a marked production, the third projection
stkexp: NAT --> BOOL
tells whether the marked symbol was already pushed onto the
stack, while the last projection
stknul: NAT --> BOOL
is 'true' iff the marked symbol is known to be nullable.
83
NULLABLE: proc(proda,posa,symbol,is_first:=*,is_null:=*);
dcl prod ∈ PRO, pos ∈ NAT, symb ∈ SYM;
stack(0) := (*,*,true,false); /* dummy element */
top := 1;
is_first := false;
stack(top) := (proda,posa,false,false);
do while top >= 1;
if ¬ stkexp(top) then do;
stkexp(top) := true;
symb := promem(stkpro(top),stkpos(top));
if symb = symbol then do;
is_first := true;
goto done;
end;
else
if symb ≠ eop then
for ∀ prod ∈ sympro(symb) do;
if ¬ ∃ i
(1 <= i & i <= top & stack(i) = (prod,1,*,*))
then do;
top +:= 1;
stack(top) := (prod,1,false,false);
end;
end; /* for ∀ prod */
else do; /* symb = eop */
i := top - 1; /* look for the expanded nonterminal */
do while ¬ stkexp(i); i -:= 1; end;
stknul(i) := true;
end; /* symb = eop */
end; /* ¬ stkexp(top) */
else do; /* stkexp(top) */
if stknull(top) then do;
stkpos(top) +:= 1;
stknul(top) := false;
end;
else
top -:= 1;
end; /* stkexp(top) */
end; /* while top >= 1 */
is_null := stknul(0);
done:
end NULLABLE;
The interesting marked production is stored in 'stack(1)',
the information whether the marked portion is nullable is
accumulated in the dummy element 'stack(0)'. Marked
nonterminals are now expanded, and if they are nullable,
the marker is moved right by one position, until eventually
the end of the production will be reached. In this case, the
"father"-symbol is nullable, too.
If 'symbol' is in FIRST1 of the marked portion, an origin has
been found, and the algorithm is terminated at once.
84
4.3.2 Processing of Conflicts
------------------------------
The following algorithm performs steps (1), (2) and (3)
mentioned in the introduction to section 4.3:
SELECT_PAIR: proc(state);
/* select and process a pair of conflicting items */
dcl (real_la1, real_la2) ∈ BOOL
,(ORIGIN1, ORIGIN2) proc;
do while ∃ item1 ∈ staite(state)
(∃ item2 ∈ staite(state) &
itesym(item1) = itesym(item2) &
(itepro(item1) ≠ itepro(item2) |
itepos(item1) ≠ itepos(item2)) &
iteact(item2) = reduce
);
symb := itesym(item1);
if iteact(item1) = shift then
nodes1 := {(state,itepro(item1),itepos(item1)
,shift,Ø,Ø)};
real_la1 := true;
end; /* shift-reduce conflict */
else do;
nodes1 := {(state,itepro(item1),itepos(item1)
,error,Ø,Ø)};
real_la1 := false;
LANE_GRAPH(symb,nodes1:=*,ORIGIN1);
end;
if real_la1 then do;
nodes2 := {(state,itepro(item2),itepos(item2)
,error,Ø,Ø)};
real_la2 := false;
LANE_GRAPH(symb,nodes2:=*,ORIGIN2);
if ¬ real_la2 then
staite(state) \:= {item2};
else
goto done;
end; /* real_la1 */
else
staite(state) \:= {item1};
end; /* while ∃ item ... */
conflict_set \:= {state};
done:
end SELECT_PAIR;
We take a pair of conflicting items, and for the
reduce-item(s) we build the graph(s) of lanes by the aid of
LANE_GRAPH. If this algorithm finds any origin for the
lookahead of reduce_item 'i', it sets 'real_lai' to 'true'
and calls algorithm ORIGINi for i=1,2. If no origin is found,
the lookahead is deleted, and another conflicting pair is
processed.
85
4.3.3 Splitting of States
--------------------------
Finally we have to attack the remaining steps (4) and (5) of our
general procedure.
In example 4.2-1, it was easy to see how state 8 must be splitted
to remove the conflicts. But how can this be seen in the general
case?
First we note that the presence of any conflict indicates that
both conditions of Pager's strong compatibility (definition
4.2.4-2) must have been violated during the construction process.
The presence of shared descendants (condition (2)) is a property
of the core of the states, and since the set of cores depends
only on the LR(0) parsing table, any LR(k) table for the grammar
under concern will have states where some scanned strings share
a descendant.
The only possibility we have is to avoid the violation of
condition (1): the context-sets must be made disjoint.
If we imagine the states and the edges of the graphs of lanes
passing through them, we have to look for bifurcations as shown
in figure 4.3.3-1.
---------------------------------------------------------------
x
... statec -----------+
|
x ↓
... statea ------> stateb ------> ... ------> root
---------------------------------------------------------------
Figure 4.3.3-1: Bifurcation before Splitting
While 'stateb' contains items which propagate the conflicting
lookahead to both reductions, 'statea' and 'statec' contain only
items which propagate it to only one of the reductions.
In this case, the x-successor of 'statec' is not strongly
compatible with the x-successor of 'statea', and therefore
'stateb' (and all states between 'stateb' and the root) should
be splitted. The resulting situation is shown in figure 4.3.3-2.
---------------------------------------------------------------
x
... statea ------> stateb1 -----> ... ------> root1
x
... statea ------> stateb2 -----> ... ------> root2
---------------------------------------------------------------
Figure 4.3.3-2: Situation after Splitting
86
The general case of course is not as simple as suggested by the
figures. For a single conflict we have to look simultaneously at
all pairs of lanes, and the conflict is removeable only if the
pairs always bifurcate at some states. Moreover, the lanes may
have cycles.
The existence of bifurcations for all lanes is a necessary
condition for a grammar to be LR(1):
Lemma 4.3.3-3:
Let TLA be a parsing table constructed by LALR(1) from a
grammar G.
If TLA contains a reduce-reduce conflict under symbol 'symb'
in state 'root', and if there is a pair of lanes which do
not separate (as indicated by figure 4.3.3-1), then G is not
LR(1).
Proof: Assume G were LR(1). Since the lanes do not separate,
there must be a state 'statec' common to both lanes which is
origin of at least one of the lanes. This state would have
two configurations
[A1 = x1 @ l1 y1, c1] and
[A2 = x2 @ l2 y2, c2],
where, for example,
symb ∈ FIRST1(y1) &
symb ∈ FIRST1(y2 c2).
By the argument of the proof of lemma 4.2.2-1 it is clear
that, even in the parsing table T constructed from G by
Knuth's LR(1) method, there must be a state 'stated'
containing two configurations
[A1 = x1 @ l1 y1, d1] and
[A2 = x2 @ l2 y2, d2],
where
symb ∈ FIRST1(y2 d2).
But this state ensures that the conflict of 'root' also
exists in some state of T.
The "bifurcation states" 'stateb' are easily detected during the
construction of the graph of lanes. Whenever LANE_GRAPH reaches
an origin of a lane, it calls one of the algorithms ORIGIN1,
ORIGIN2.
ORIGIN1 is called when a lane for the first reduction is traced.
It simply marks the node for the origin with action 'shift', while
all successors of this node are marked with 'reduce'.
87
ORIGIN1: proc(origin);
/* process an origin of the lookahead for the 1st reduction */
dcl (source,target) c nodes1;
real_la1 := true;
nodact(origin) := shift;
source := nodsuc(origin);
do while source ≠ Ø
target := Ø
for ∀ node ∈ source do;
hidden_set u:= {nodsta(node)};
if nodact(node) = error then do;
nodact(node) := reduce;
target u:= nodsuc(node);
end;
end;
source := target;
end;
end ORIGIN1;
Symbol 'symb' is really a lookahead for the 1st reduction.
The node for the origin is marked with 'shift', while all
its successors are marked with 'reduce'.
ORIGIN2 detects the bifurcation states while tracing a lane
forwards (from the origin to the root), and it prepares the
splitting process to be done by algorithm CONFLICT, below.
ORIGIN2: proc(origin);
/* process an origin of the lookahead for the 2nd reduction */
dcl (source,target) c nodes2;
real_la2 := true;
if ¬ ∃ node1 ∈ nodes1
(nodsta(node1) = nodsta(origin) &
nodact(node1) ≠ error)
then do; /* will the path join later? */
source := {origin};
do while source ≠ Ø
target := Ø
for ∀ nodea ∈ source do;
for ∀ nodeb ∈ nodsuc(nodea) do;
if ∃ node1 ∈ nodes1
(nodsta(node1) = nodsta(nodeb) &
nodact(node1) ≠ error)
then do;
if nodact(node1) = reduce then
/* look for an origin for the 1st reduction */
SEPARATE(nodea,nodeb);
else
removeable := false;
end;
else /* ∀ node1 (nodact(node1) = error) */
target u:= {nodeb};
end;
end;
source := target;
end;
end;
else
removeable := false;
end ORIGIN2;
88
Symbol 'symb' is really a lookahead for the 2nd reduction.
If we have found an originator for the second reduction, we will
have to look whether the path from this node joins a path for
the first reduction before the root (there the paths will
always join). There are three cases:
(1) The second originator already lies on a path for
the first reduction - the conflict cannot be
resolved.
(2) The two paths join at 'node2', and there lies no
originator for the first reduction between 'node2'
and the root - the state corresponding to 'node2'
can be splitted.
(3) As for (2), but there lies an originator for the
first reduction between 'node2' and the root - the
conflict cannot be resolved.
The latter two cases are distinguished by the following
subalgorithm.
SEPARATE: proc(nodea,nodeb);
/* separate two incompatible nodes */
dcl (source,target) c nodes2;
source := {nodeb}; hidden_set u:= source;
do while source ≠ Ø
/* look for an origin for the 1st reduction */
target := Ø
for ∀ node ∈ source do;
if ¬ ∃ node1 ∈ nodes1
(nodsta(node1) = nodsta(node) &
nodact(node1) = shift)
then
target u:= nodsuc(node);
else do; /* origin found */
removeable := false;
goto done;
end;
end;
source := target; hidden_set u:= source;
end;
statea := nodsta(nodea);
stateb := nodsta(nodeb);
for ∀ item ∈ staite(statea)
(item = (*,*,*,shift,stateb))
do;
iteact(item) := newsh;
end;
successors_set u:= {statea};
done:
end SEPARATE;
All states which must be splitted are collected in
'hidden_set'. If the lanes are separable, 'removeable' is
not set to 'false', and the deletion of the edge from
'statea' to 'stateb' is prepared: 'statea' is noted in
'successors_set', and the invalid shift-actions are
marked with 'newsh'.
89
When computing the new successors it must be avoided that the
lanes join again, i.e. that the conflict reappears. Therefore
a set 'sic_set' is introduced which provides information on the
states which are strongly incompatible (cf. definition 4.2.4-2):
(x,y) ∈ sic_set means that a conflict would reoccur if 'y' is
chosen as successor of 'x'.
This set is maintained and used in a modified version of algorithm
COMPATIBLE called by SUCCESSORS during the splitting phase (cf.
section 3.4.1):
COMPATIBLE: proc(statea,statex,statey:=*);
/* Pager's strong compatibility criterion */
dcl set c STA;
for ? (*,prod1:=*,pos1:=*,*,*) ∈ staite(statex)
(pos1 >= 2);
do;
set := pronuc(prod1,pos1);
for ∀ item ∈ staite(statex)
(item = (*,prod:=*,pos:=*,*,*) & pos >= 2
& prod ≠ prod1 & pos ≠ pos1)
do;
set n:= pronuc(prod,pos);
end;
end;
set \:= {state |
state ∈ set & stanum(state) ≠ stanum(statex)};
| for ∀ state ∈ set n hidden_set do;
| sic_set u:= {(statea,state)};
| end;
| set \:= {state | (statea,state) ∈ sic_set};
if set = Ø then do;
for ∀ (*,prod:=*,pos:=*,*,*) ∈ staite(statex) do;
pronuc(prod,pos) u:= {statex};
end;
statey := statex;
end;
else
for ? state ∈ set do;
statey := state;
end;
end COMPATIBLE;
First all states with the same core as 'statex' are collected
in 'set' (in the non-LALR case there may be more than one
such state).
The successor of 'statea' will be not strongly compatible
with any of these states which is in 'hidden_set', since
(1) it shares a descendant (because of the conflict) and
(2) it shares a context (the conflicting lookahead) with
some configuration in the states. This information is stored,
and another successor is chosen.
90
After all these preparations we are now ready for CONFLICT, the
algorithm which tries to remove (NQ-) LALR(1)-conflicts by
splitting of some states.
CONFLICT: proc;
dcl removeable ∈ BOOL, hidden_set c STA
,sic_set c STA x STA;
for ∀ state ∈ conflict_set do;
next_pair:
removeable := true;
hidden_set := Ø
successors_set := Ø
SELECT_PAIR(state);
if state ∈ conflict_set then do;
if removeable then do;
for ∀ statec ∈ successors_set do;
for ∀ itemc ∈ staite(statec)
(iteact(item) = newsh)
do;
EDGEDEL(statec,stateb);
end;
end; /* ∀ statec */
LAGAR;
SUCCESSORS;
LAGET;
LAPUT;
end; /* removeable */
else do; /* ¬ removeable */
for ∀ statec ∈ successors_set do;
for ∀ itemc ∈ staite(statec)
(item = (*,*,*,newsh,stateb:=*))
do;
iteact(itemc) := shift;
end;
end; /* ∀ statec */
hidden_set := Ø
successors_set := Ø
goto done;
end; /* ¬ removeable */
goto next_pair;
end; /* if state ∈ conflict_set */
end; /* for ∀ state ... */
done:
sic_set := Ø
end CONFLICT;
The condition after the call of SELECT_PAIR is necessary
because this algorithm deletes 'state' from 'conflict_set'
iff 'state' contains no or only NQLALR(1)-conflicts.
If the condition is 'true', an LALR(1)-conflicts was found,
and if it is 'removeable', the conflicting lookahead will be
deleted for the second reduction by LAGAR as a consequence
of 'EDGEDEL(statec,stateb)'. If the conflict is not
'removeable', the preparations for splitting are undone,
and CONFLICT terminates immediately without inspection
of other, perhaps removeable conflicts.
It seems appropriate finally to show the correct working of
the algorithm in more detail:
91
Theorem 4.3.3-4:
Applied to a parsing table T3 constructed from a current
grammar G by the algorithms of section 3,
(1) algorithm CONFLICT terminates and
(2) produces a conflict-free parsing table T4 if G is LR(1),
and in this case
(3) the number of states in T4 is minimal as long as
productions are not deleted by REORGANIZE.
Proof - General Remarks:
The property of all conflicts in a table T3 to be 'removeable'
is not affected by the splitting process: splitting will
duplicate lanes, but no new lookaheads are introduced. If
'removeable' turns out to be 'false' after some splitting
steps, this could have been seen when SELECT_PAIR were
applied to the corresponding conflict first.
It should be clear that CONFLICT really removes all conflicts
in a particular 'state'. The conflicts also cannot be
reintroduced in 'state', because the modified version of
COMPATIBLE will not allow conflict-generating lanes to lead
again to 'state'.
The lanes provide complete information on contexts, therefore
we may again consider configurations instead of items. In the
following 'CFS(state)' is used to denote the set of
configurations of 'state'.
Every splitting of a state 'state4' results in two states
'state41', 'state42' such that
CORE(state41) = CORE(state42) = CORE(state4),
and, for example,
state41 ∈ hidden_set &
state42 ¬∈ hidden_set.
The new states each lost the configuration(s) which is (are)
responsible for the lookahead of the conflicting reductions:
CFS(state41) c≠ CFS(state4) &
CFS(state42) c≠ CFS(state4) &
CFS(state41) ≠ CFS(state42)
Furthermore, for i=1,2 there are states 'statei' in the
parsing table T constructed from G by LR(1) with
CORE(statei) = CORE(state4i) &
CFS (statei) c CFS (state4i), i=1,2
(otherwise a parser using T4 would announce 'error' for
some w ∈ L(G)).
92
Proof of (1):
The algorithm terminates at once if 'removeable' is set to
'false'. Therefore we may now assume that 'removeable' is
always 'true'.
The number of states with different configuration sets is
finite for a finite grammar. Therefore an infinite execution
of the loop over 'conflict_set' would create at least two
different states with identical configuration sets. These
states must be strongly incompatible because COMPATIBLE did
not identify them, but then the two states must have different
configuration sets. This contradiction yields (1).
Proof of (2):
From lemma 4.3.3-3 we know that an LR(1) grammar ensures
bifurcation-points for all pairs of lanes, i.e. 'removeable'
is always 'true'. Therefore CONFLICT will be terminated by
exhaustion of 'conflict_set', but that implies a conflict-
free resulting parsing table.
Proof of (3):
COMPATIBLE only creates a new state if the existing states
are all stronly incompatible with the desired successor.
Therefore the resulting table will not contain any different,
but strongly compatible states, and (3) follows from Pager
[1977 PGM].
Deletion of productions could, however, remove the cause for
an incompatibility, and the algorithms will not reidentify
the states in this case - this is the reason for the
restriction in (3).
93
5 Implementation
-----------------
The algorithms in sections 3 and 4 were expanded to PL/1
procedures in a rather straightforward way.
The sets and mappings were represented by arrays or, if the
deletion of elements is necessary, by linked (and sometimes
ordered) lists.
All arrays were confined to one dimension because of storage
problems. For example, marked productions were represented
by a single integer. The avoidance of dimensions >= 2 perhaps
influenced the design of the entire parser constructor.
The quantors '∀' and '∃' were expanded to WHILE-loops
over some set. The operations 'u:=' and '\:=' result in a
processing of two ordered linked lists.
Deletion of productions (mainly algorithms PRODEL and ITEMMARK)
and the splitting of states (algorithm SEPARATE) are not yet
implemented.
Table 5-1 gives an impression of the size of various parts of
the implemented system.
+-----------------------------------------------------------+
| 1: section 3, NQLALR(1) 2008 |
| 2: section 4, LALR(1) 415 |
| 3: scanning, test-printout etc. 1748 |
+-----------------------------------------------------------+
| 4: 1+2+3, running incremental constructor 4171 |
| |
| 5: transformator constructor 1087 |
| 6: parser, transformator, output 2431 |
+-----------------------------------------------------------+
| 7: 4+5+6, running EXTRA system 7689 |
+-----------------------------------------------------------+
Table 5-1: Number of PL/1 source lines for components of
the EXTRA system
The constructor was initially tested with a set of some twenty
small "artificial" grammars from the literature which contain
a number of peculiarities (empty right sides, left and right
recursion, NQLALR(1)-, LALR(1)-, LR(1)-conflicts etc.).
A major claim of this work is the feasibility of the incremental
construction in practical applications. Therefore the constructor
was also used to construct parsing tables from ten "natural"
grammars of small to medium size.
The grammars are too long to be reproduced in this work.
As the names indicate, most of the grammars describe
programming languages or subsets thereof. Two criteria for
the selection of a grammar were unambiguity and the top-
down property explained below.
Table 5-2 gives several figures counted from these grammars:
94
sym number of symbols in the grammar (terminals and
nonterminals)
pro number of productions (right sides of rules)
gram length of the grammar, i.e.
sum(∀ p ∈ prods) (prolng(p) + 1)
sta0 number of states of the LR(0) (and hence SLR(k),
NQLALR(k), LALR(k)) parsing table
ite number of items (including reduce-items for
nonterminal lookaheads)
+-----------+---------------------------------------+
| | sym | pro | gram | sta0 | ite |
+-----------+---------------------------------------+
| BASIC | 175 158 418 234 4979 |
| FORTRAN77 | 239 305 964 513 3488 |
| MIDICAP | 286 310 887 459 8935 |
| MINCOBOL | 327 273 991 586 5705 |
| MINPL1 | 233 185 658 400 5343 |
| PAL | 110 81 267 157 2903 |
| PLM | 138 131 404 219 3572 |
| SPL | 145 135 395 220 3356 |
| WIRTH | 42 26 78 43 231 |
| XPL | 121 104 324 182 2524 |
+-----------+---------------------------------------+
Table 5-2: Size of the grammars and their parsing tables
Finally the constructor was instrumented to measure the times
used for the construction.
The grammars were all processed twice. First, all productions
were inserted en-bloc, and the table was reorganized only once.
Second, the table was reorganized every time after the insertion
of a single production.
The latter measurement is fair only if the grammar is inserted
"top-down", i.e. if the left side of every production just
inserted is already reacheable from the start-symbol.
Otherwise the insertion of the production would be delayed,
and later greater portions of the grammar would be inserted
en-bloc.
The productions of all tested grammars are ordered such that
this property is fulfilled.
95
Table 5-3 below shows the following figures:
pro number of productions inserted
t∀ time for one reorganization after the insertion
of all productions
t1 time for 'pro' reorganizations after the
insertion of every single production
t1/pro average time for the insertion of one
production
t1/t∀ "fragmentation factor"
All times are given in CPU-seconds on a Siemens 7.748 computer
which is nearly half as fast as an IBM 370/158. The times
include initialization of the system, reading of the grammar,
print-out and so on.
+-----------+-------+-------+-------+------+-------+
| | pro | t∀ | t1 |t1/pro| t1/t∀ |
+-----------+-------+-------+-------+------+-------+
| BASIC | 158 30.0 242.5 1.5 8.1 |
| FORTRAN77 | 305 33.8 396.2 1.3 11.7 |
| MIDICAP | 310 86.5 751.0 2.4 8.7 |
| MINCOBOL | 273 27.9 339.1 1.2 12.2 |
| MINPL1 | 185 72.3 335.4 1.8 4.6 |
| PAL | 81 66.6 288.3 3.6 4.4 |
| PLM | 131 34.9 260.0 2.0 7.4 |
| SPL | 135 35.1 225.9 1.7 6.4 |
| WIRTH | 26 4.8 36.6 1.4 7.6 |
| XPL | 104 27.7 182.2 1.8 6.6 |
+-----------+--------------------------------------+
Table 5-3: Construction times
96
6 Discussion
-------------
A new way has been shown for the construction of LR(1) parsing
tables. In contrary to the usual methods the parsing table is
not simply constructed, but it results from the modification
of an existing table. The modification process is smooth, i.e.
small changes in the grammar cause only small changes in the
parsing table.
Because of this property the constructor is well suited for a
translator system with syntactical extension facilities. The
incremental approach should also save much construction time
during the stepwise development of a large LR(1) grammar.
Incremental construction has been done previously by Vidart [1974]
and Heindel and Roberto [1975], but for a less general class of
grammars, namely LL(1).
LR(1) today seems to be the largest class of grammars which can
be parsed efficiently, and an incremental constructor for this
class was described in this work - as to my knowledge for the
first time.
The measurements reported in section 5 indicate that the implemented
version of the algorithms indeed has the postulated steadiness
property. The average time taken for the insertion of a single
additional production is rather independent of the size of the
grammar, and even if the incremental development is carried to
extremes (i.e. reorganization after every single production),
construction time increases only by a factor of 5 to 13.
The times for an ordinary en-bloc construction are also quite
competitive with those of several other LALR(1) constructors
in use at the University of Dortmund.
Section 3 is based on the NQLALR(1) method for the computation
of lookaheads. Though I recognized the NQLALR(1)-specific
problems in June 1979, DeRemer and Pennello [1979] were the first
who documented them. In the paper they present another, correct
LALR(1) method, and perhaps this method could be used in an
incremental constructor, too.
In contrary to the global relations used by DeRemer and Pennello
the basic step of the NQLALR method is a relatively local
operation, so that the effect of deletions and insertions may
remain under control. This is one of the reasons why I adhered
to NQLALR.
The second reason is that the algorithms for NQLALR are relatively
simple: it is not necessary to store any contexts with shift-
configurations resp. -items, and the rather complicated function
FIRST1 is avoided.
The last reason is that the NQLALR problem, if it ever occurs
in a grammar, can be solved by a correct tracing of the particular
lookahead. Basically the tracing process is that of Pager
[1977 LTA], but I ignored his efficiency considerations, because
97
they again rely on global properties of the lanes. I assume that
in general NQLALR(1)-conflicts are rare in natural, unambiguous
grammars, and therefore the tracing of only a few lanes should
be necessary.
Lane-tracing invents to do the last step - state splitting in
order to achieve general LR(1). This step is also mentioned by
Pager [1977 LTA], but there the algorithm of Pager [1977 PGM]
is used, and only a few details are given.
Algorithm CONFLICT incorporates the splitting step in the
incremental construction, and its termination and correctness
is proved at the end of section 4.
Since several parts of the algorithms in section 4 are
automatically skipped if they are not necessary, the attribute
"incremental" may also be applied to the way in which the
constructor exploits more and more general classes of grammars.
This design of a general LR(1) parser constructor seems to be
another, new result of this work.
Sometimes the reader may have been astonished at the particular
realization of an algorithm. Of course there may be other and
perhaps much more efficient possibilities. When making suggestions
for alterations and improvements the reader should note, however,
the strong mutual connections between all data structures.
For example, successors cannot be changed before all invalid
lookaheads have been deleted. New lookaheads cannot be
determined before all successors are recomputed.
Some peculiarities are caused by the desire to avoid mappings
with two or more arguments, because those would result in
storage and retrieval problems for large grammars.
The amount of detail in the two algorithmic sections might be
another source of critics. I had the following reasons to choose
this particular form of the presentation:
(1) I showed the feasibility of the incremental approach
by measurements obtained with an actual implementation,
therefore the description should not be too far away
from that implementation.
(2) This work should serve as a guide for other implemen-
tations; in my opinion other descriptions in the
literature (e.g. LaLonde [1971], Pager [1977 LTA],
Vidart [1974]) are too imprecise for this purpose.
(3) Other possible implementors should get an impression
of the size of the problem.
Nevertheless I thank the reader for the patience he showed until
here.
98
List of Algorithms
------------------
Algorithm defined in section
--------- ------------------
ALLOC 3.4.1
BACKWARD 3.2
CLOSURE 3.3
COMPATIBLE 3.4, 4.3.3
CONFLICT 4.3.3
DELTA 3.1
EMITTERS 3.2
EXTRA 3
FORWARD 3.2
FREE 3.4.1
GARBAGE 3.4.3
INIT 3.1
ITEMINS 3.3.2
ITEMMARK 3.3.1
LADEL 3.2
LAGAR 3.2
LAGET 3.5
LAINS 3.5
LANE_GRAPH 4.3.1
LAPUT 3.5
NODINS 4.3.1
NULLABLE 4.3.1
ORIGIN1 4.3.3
ORIGIN2 4.3.3
PARSE 3.1
PRODEL 3.3.1
PROEDIT 3.3
PROINS 3.3.2
RECEIVERS 3.2
REORGANIZE 3
SELECT_PAIR 4.3.2
SEPARATE 4.3.3
99
References
----------
Aho, A.V., and Ullman, J. [1972]
The theory of parsing, translation, and compiling.
Two volumes. Prentice Hall, Princeton, N.Y. (1972)
Aho, A.V., Johnson, S.C., and Ullman, J. [1975]
Deterministic parsing of ambiguous grammars.
Comm. ACM 18, 8 (Aug. 1975) 441-452
Aho, A.V., and Ullman, J. [1977]
Principles of compiler design.
Addison-Wesley, London (1977)
Anderson, T., Eve, J., and Horning, J.J. [1973]
Efficient LR(1) parsers.
Acta Inf. 2 (1973) 12-39
[ANSI X3.23-1974]
American national standard programming language COBOL.
American National Standards Institute, New York (1978)
Backus, J. [1978]
Can programming be liberated from the von Neumann style?
A functional style and its algebra of programs.
Comm. ACM 21, 8 (Aug. 1978) 613-641
Brown, P.J. [1967]
The ML/1 macro processor.
Comm. ACM 10, 10 (Oct. 1967) 618-623
Brown, P.J. [1974]
Macro processors.
John Wiley & Sons, London (1974)
Celentano, A. [1978]
Incremental LR parsers.
Acta Inf. 10 (1978) 307-321
Cheatham, T.E. [1966]
The introduction of definitional facilities into higher
level programming languages.
Proc. AFIPS Fall Joint Computer Conference, Vol. 29 (1966)
623-637
Christensen, C., and Shaw, C.J. (Eds.) [1969]
Proceedings of the extensible languages symposium.
ACM SIGPLAN Notices 4, 8 (Aug. 1969)
Cole, A.J. [1976]
Macro processors.
Cambridge University Press, Cambridge (1976)
DeRemer, F.L. [1969]
Practical translators for LR(k) languages.
Ph.D. Thesis, Electrical Engineering Department, M.I.T.,
Cambridge, Massachusetts (1969)
DeRemer, F.L. [1971]
100
Simple LR(k) grammars.
Comm. ACM 14, 7 (July 1971) 453-460
DeRemer, F.L. [1974]
Transformational grammars.
In: Compiler Construction - An Advanced Course.
Springer Lecture Notes in Computer Science, Vol. 21,
Springer-Verlag, Berlin Heidelberg New York (1974) 121-145
DeRemer, F.L., and Pennello, T.J. [1979]
Efficient computation of LALR(1) look-ahead sets.
Proc. SIGPLAN Symposium on Compiler Construction,
Denver, Colorado, 1979.
ACM SIGPLAN Notices 14, 8 (Aug. 1979) 176-187
Errata: ACM SIGPLAN Notices 15, 2 (Feb. 1980) 29
Earley, J. [1970]
An efficient context-free parsing algorithm.
Comm. ACM 13, 2 (Feb. 1970) 94-102
Fischer, G. [1979]
EXTRA - ein erweiterbarer Übersetzer
(In German: EXTRA - an extensible translator).
Proceedings GI - 9. Jahrestagung, Bonn Oct. 1979.
Informatik Fachberichte 19, Springer Verlag,
Berlin Heidelberg New York (1979) 240-251
Griswold, R.E., Poage, J.F., and Polonsky, I.P. [1971]
The SNOBOL4 programming language.
Prentice Hall, Inc., Englewood Cliffs (1971)
Harrison, M.A., and Havel, I.M. [1974]
On the parsing of deterministic languages.
JACM 21, 4 (Oct. 1974) 525-548
Heindel, L.E., Roberto, J.T. [1975]
LANG-PAK - an interactive language design system.
American Elsevier Publ. Company, Inc.
New York, London, Amsterdam (1975)
[IBM 1973]
OS PL/I Checkout Compiler: TSO User's Guide.
IBM Form No. SC33-0033
Knuth, D.E. [1965]
On the translation of languages from left to right.
Inf. Contr. 8 (Oct. 1965) 607-639
Knuth, D.E. [1971]
Top-down syntax analysis.
Acta Inf. 1 (1971) 79-110
Korenjak, A.J. [1969]
A practical method for constructing LR(k) processors.
Comm. ACM 12, 11 (Nov. 1969) 613-623
LaLonde, W.R. [1971]
An efficient LALR parser generator.
Technical Report CSRG-2, Department of Computer Science,
University of Toronto, Canada (1971)
101
LaLonde, W.R., Lee, E.S., and Horning, J.J. [1971]
An LALR(k) parser generator.
Proc. IFIP Congress 71. North-Holland Publ. Company,
Amsterdam (1972) 513-518
Leavenworth, B.M. [1966]
Syntax macros and extended translation.
Comm. ACM 9, 11 (Nov. 1966) 790-793
Lewis, P.M. II and Stearns, R.E. [1968]
Syntax directed transductions.
Journ. ACM 15, 3 (July 1968) 465-488
McKeeman, W.M., Horning, J.J., and Wortman, D.B. [1970]
A compiler generator.
Prentice Hall, Inc., Englewood Cliffs (1970)
Metzner, J.R. [1979]
A graded bibliography on macro systems and extensible
languages.
ACM SIGPLAN Notices 14, 1 (Jan. 1979) 57-68
Mickunas, M.D., Lancaster, R.L., and Schneider, V.B. [1976]
Transforming LR(k) grammars to LR(1), SLR(1) and (1,1)
bounded right context grammars.
Journ. ACM 23, 3 (July 1976) 511-533
Mooers, C.N., and Deutsch, L.P. [1965]
TRAC, a text handling language.
Proc. 20th ACM National Conference (1965) 229-246
Pager, D. [1977 LTA]
The lane tracing algorithm and ways of enhancing its
efficiency.
Inf. Sci. 12, 1 (1977) 19-42
Pager, D. [1977 PGM]
A practical general method for constructing LR(k) parsers.
Acta Inf. 7 (1977) 249-268
Rüthing, H., and Schönhold, M. [1980]
Syntax-gesteuerte Übersetzung mit Hilfe von
LR(1)-Analysatoren.
(In German: Syntax-directed translation with the aid
of LR(1)-analyzers.)
Diplomarbeit, Department of Computer Science, University
of Dortmund, (Feb. 1980)
Schneider, H.J. [1975]
Syntax-directed description of incremental compilers.
Lecture Notes in Computer Science, Vol. 26,
Springer-Verlag, Berlin Heidelberg New York (1975) 192-201
Schuman, S. (Ed.) [1971]
Proceedings of the international symposium on extensible
languages.
ACM SIGPLAN Notices 6, 12 (Dec. 1971)
Strachey, C. [1965]
A general purpose macro processor.
Comp. Journal 8 (1965) 225-241
102
Vidart, J. [1974]
Extensions syntaxiques dans une contexte LL(1)
(In French: Syntactical extensions in a LL(1)-context).
These pour obtenir le grade de docteur du troisieme cycle,
specialite: informatique. Universite de Grenoble, France
(1974)
Waite, W.M. [1973]
Implementing software for non-numeric applications.
Prentice Hall, Englewood Cliffs (1973)
Wegbreit, B. [1970]
Studies in extensible programming languages.
Ph.D. Dissertation, Harvard University, Cambridge,
Massachusetts (1970)
Wirth, N. [1977]
What can we do about the unnecessary diversity of notations
for syntactic definitions?
Comm. ACM 20, 11 (Nov. 1977) 822