Incremental LR(1) Parser Construction as an Aid to
    --------------------------------------------------

                Syntactical Extensibility
                -------------------------
 
 
                      DISSERTATION

               zur Erlangung des Grades des

             Doktors der Naturwissenschaften

                der Universität Dortmund

               an der Abteilung Informatik

                          von



                     Georg Fischer
 




                       Dortmund

                         1980 



Table of Contents ----------------- Acknowledgements 3 Symbols and Abbreviations 4 1 Introduction 5 1.1 Survey of Previous Work 9 1.2 EXTRA - an EXtensible TRAnslator 12 1.2.1 EXTOL - The Extensible, Translation Oriented Language 15 1.2.2 Other Features of EXTRA 21 1.2.3 Example: Extension of the Metagrammar 24 2 Definitions and Notations 25 2.1 The Set Oriented Algorithmic Language SOL 25 2.2 Context-Free Grammars 28 2.3 LR(k) Grammars 30 2.4 Construction of LR(k) Parsers 32 2.5 Data Structures and Mappings 36 3 Algorithms for NQLALR(1) 42 3.1 Initialization and the Parsing Algorithm 44 3.2 The NQLALR(1) Method for the Computation of Lookaheads 47 3.3 Modification of the Grammar 54 3.3.1 Deletion of a Set of Productions 55 3.3.2 Insertion of a Production 57 3.3.3 Reduced Grammars 58 3.4 Modification of the Set of States 60 3.4.1 Compatibility of States 60 3.4.2 Computation of New Successors 63 3.4.3 Garbage Collection for States 65 3.5 Transmission of Lookaheads 67 4 Non-NQLALR(1) Grammars 69 4.1 Causes for Conflicts 70 4.2 Foundations of the LR(1) Construction 72 4.2.1 A Non-LALR(1) Grammar 73 4.2.2 Overloaded Conflicts 74 4.2.3 Lane Tracing 76 4.2.4 Strong Compatibility 77 4.3 Algorithms for LR(1) 79 4.3.1 The Graph of Lanes 80 4.3.2 Processing of Conflicts 84 4.3.3 Splitting of States 85 5 Implementation 93 6 Discussion 96 List of Algorithms 98 References 99
Zusammenfassung --------------- Es werden Algorithmen zur Konstruktion und insbesondere zur Veränderung von Syntaxanalysatoren für Programmiersprachen beschrieben. Aus einer Grammatik wird die Tabelle für einen LR(1)-Analysator - über die Zwischenstufen LR(0), NQLALR(1), LALR(1) und allgemeines LR(1) - generiert. Die mächtigeren (und teureren) Verfahren werden nur bei Bedarf eingesetzt. Die "inkrementelle" Konstruktion geht aus (1) von einer Grammatik und der zugehörigen Tabelle und (2) von einer Menge von Produk- tionen, die (2.1) zu der Grammatik hinzugefügt oder (2.2) aus ihr gestrichen werden sollen. Die existierende Tabelle wird dann so modifiziert, daß sie der neuen Tabelle wieder entspricht. Selbst wenn im Extremfall jede Produktion einzeln eingefügt wird, ergaben Messungen an Grammatiken für übliche Programmier- sprachen nur einen Rechenzeitfaktor zwischen 4 und 13 gegenüber dem normalen "en-bloc"-Verfahren. Bei häufigen Änderungen an der Grammatik dürfte deshalb das inkrementelle Verfahren den Aufwand für die Parser-Konstruktion deutlich verringern. Die Algorithmen sind ein wesentlicher Bestandteil des Systems EXTRA, das die interaktive Beschreibung, Ausführung und Abänderung von Übersetzern ermöglicht. Mit EXTRA können zum Beispiel sehr einfach Vorübersetzer hergestellt werden, die syntaktische Erweiterungen in Programmiersprachen verarbeiten. Abstract -------- Algorithms for the construction and especially for the modification of parsers for programming languages are described. The table for an LR(1) parser is generated from a grammar by the intermediate steps LR(0), NQLALR(1), LALR(1) and general LR(1). The more powerful (and more expensive) methods are used only when necessary. The "incremental" construction starts with (1) a grammar and the corresponding table and (2) with a set of productions which are (2.1) to be added to or (2.2) to be deleted from the grammar. The existing table then is modified such that it corresponds again to the new grammar. Even if at the extreme the productions are added one by one, measurements with grammars for usual programming languages show only a factor in computing time between 4 and 13, compared with the normal "en-bloc"-approach. For frequent changes of the grammar of a programming language, the incremental method therefore should significantly reduce the effort for the parser construction. The algorithms are a main component of the system EXTRA which allows the interactive description, execution and modification of translators. With EXTRA it is very easy, for example, to build preprocessors which handle syntactical extensions in programming languages.
3
Acknowledgements ---------------- This work would have been impossible without the aid of many people. Though my interest in compiler construction reaches back to my studies until 1975 in München, they were really brought to light by the project group "Compiler Construction Systems" in 1977/78 at the University of Dortmund. The work and discussions of the ten students in that group strongly influenced the ideas and design of the EXTRA system and its parser generator. I especially thank Helmut Ludwigs, Udo Lipeck and Klaus Gröning for their interest and their contributions. Two members of the group, Manfred Schönhold and Horst Rüthing, designed and implemented the transformational features of the EXTRA system. Without their work, the system would not be operational, and I thank them for their engagement. My collegues at the Lehrstuhl 1 often participated at the evolution of this work, and I thank them for their encouragement and their critics. For the typing of the manuscript I used an editor written by Horst Dahmen. To the members of the project NCAP go my thanks for the patience they always showed when I occupied one of their terminals. Werner Davids read an earlier version of the manuscript, and he helped much in improving my English, but of course he is not responsible for all the errors and weak constructions which I introduced afterwards. Prof. Dr. Bernd Reusch provided me the opportunity to write this work. He and especially Dr. Gisbert Dittrich read it very carefully, and I express my gratitude for their discussions, their hints and the time they spent. Finally, I would like to thank my wife, who showed amazing patience and understanding while I was writing this dissertation .
4
Symbols and Abbreviations ------------------------- The use of the line printer implies some restrictions on the notation in this work. Subscripting is not possible; it is mostly indicated by placing the single subscript to the right of the subscripted element, or by the enclosure in square brackets. Exponentiation is also omitted, or it is indicated by the operator "**". The following list gives an overview to the symbols used in this work. ∈ element of ¬∈ not element of c contained in or equal to c≠ contained in but unequal to x cartesian product 2**X powerset of X Ø empty set card( ) cardinality, number of elements e empty word X* set of all words over the alphabet X (e ∈ X*) |x| length, number of symbols in word x HEAD(x,k) first k symbols of x TAIL(x,k) last k symbols of x x**k k-fold concatenation of x || concatenated wwith ==> if ... then ... <==> iff (if and only if) :<==> by definition iff --> mapping from ... to ... |-> is mapped to => directly derives (in one step) =>+ derives (in one or more steps) =>* derives or is -> shift ) =: reduce ) parser actions =. accept ) ?? error ) := by definition equal to & and | or ¬ not ∀ for all ∃ exists ? for any < less than <= less than or equal = equal ≠ not equal > greater than
5
1 Introduction --------------- Hundreds if not thousands of programming languages have been invented since the beginning of computer programming. Most of them were, if ever implemented, only used by a few people. The costs and the mental effort for designing, implementing and teaching a new language are very high. Therefore it seems astonishing that there are always people left over which cannot be satisfied by any of the (say) twenty languages besides COBOL and FORTRAN which are in wide-spread use today. One reason for this fact perhaps is the continuous appearance of new application areas on the one and of new computer architectures on the other side. It seems quite natural that programming languages which are the link between both fields cannot remain unaffected by these changes. There may be another, perhaps even better explanation: the psychological one. The ordinary programmer has no means to develop a new language. The computer scientist, however, because of his education is not willing to let his ideas be narrowed by the awkward constructions of the available languages, and he soon becomes very attracted by the idea to invent (and perhaps implement) his own language. Therefore the number of programming languages must be proportional to the number of research institutions which provide suitable freedom and financial support. From this explanation it is obvious that the flood cannot be dammed up by fixed languages such as PL/1 or ADA, as mighty as the organization behind the language might be. Some language designers very early tried to attack the problem more basically, and they introduced extensibility into programming languages. By 1970 it seemed that this would be the solution to the Tower of Babel in computer science (cf. for example Christensen and Shaw [1969], Schuman [1971]), and proposals for the extensibility of nearly every aspect in a language were made. In the meantime this enthusiasm has faded out. Extensibility introduces serious implementation problems. The extensible features (modes and operator definition) in ALGOL68, for example, are one explanation for the delay (if not failure) of that language. It might also be argued that extensibility is inherently dangerous, because the programs written by one person become unreadable to everybody else. Though this danger cannot be ignored, it seems not to be a problem of extensible languages per se. To some extent the same problem occurs with the heavy use of macros in assembler language programming, or even with procedures in high level languages. Both techniques are recommended today, and suitable organizational actions were developped to cope with the mentioned problem.
6 It may be the point that the compilers will have no difficulty to read extended languages as long as they are given the definitions of the extensions. In this work another attempt is made to support one special aspect of extensibility, namely syntactical extensibility. Given a base language with some context-free grammar, the more experienced user should be enabled to define his own notation for statements, declarations, operators and so on. Together with the new syntax he should describe how to translate the extensions into constructions of the base language. The goto-less programming debate shows a good example where only syntactical extensibility is needed. The "structured" constructions such as IF-THEN-ELSE, DO-WHILE, CASE etc. only serve to clarify the program and to decrease the amount of clerical work. Their semantics may be described by conditional goto statements not only in theory, but in fact a compiler translates them in several such gotos. With a syntactical extension facility in a language it is an easy task to define missing control flow statements. Often syntactical extensibility is done with several restrictions, for example (1) extensions must be defined before the entire program, (2) they are processed in a separate ("preprocessor") pass, (3) the applications of the extended constructions have to start with a special delimiter or with a unique keyword, (4) the constructions may not be nested, (5) extended constructions in the result of the translation process are not recognized again, (6) extensions are allowed only for specific constructions, e.g. statements, and so on. Such restrictions of course facilitate the implementation at the expense of generality and ease of use. In section 1.2 the translation system EXTRA is described which dispenses with most of these requirements. The most notable exceptions are the conditions that all input into the system - base grammar, extension definitions, and programs to be translated - can be described by one grammar, and that this grammar must be LR(1). LR(1) parsing was chosen because today it seems to be the most reasonable compromise between efficient implementation and ease of grammar design. Most unambiguous context-free are LR(1) without any transformation - a statement which is not true for other modern parsing methods, LL(1) for example. Only a functional description of the full EXTRA system is given
7 in section 1.2. The rest of this work concentrates on one major component of the system, namely the incremental LR(1) parser constructor, while the implementation of the translation mechanisms is described by Rüthing and Schönhold [1980]. Incremental construction means the following: (1) Given an LR(1) parsing table corresponding to a context-free grammar, and (2) given a set of productions (at the extreme a single production) to be (2.1) inserted into the grammar or (2.2) deleted from the grammar, (3) modify the parsing table such that it corresponds to the new grammar. The incremental approach may also be illustrated by an analogy between grammar design and ordinary programming. The parser constructor the corresponds to the language processor. There are three possibilities to process a program after a change: (1) all statements are recompiled, (2) the new statements only are recompiled and the result is linked to the result of the old statements, (3) new and old statements are always interpreted anew. Possibility (1) is the usual approach for languages such as ALGOL60, COBOL, FORTRAN and so on. Possibility (3) is most often used with BASIC, but interpretation is considerably slower than execution of compiled code. Possibility (2) should be the most economic one if compilation and execution times are considered together, but this approach implies great difficulties if the user is allowed to manipulate block structure, declarations and other non-local constructions. With a list of restrictions (2) is implemented in IBM's PL/1 Checkout Compiler [IBM 1973], the general case is treated by Schneider [1975] and Celentano [1978]. The attribute "incremental" in the title of this work was taken from the latter two references, though here grammars resp. parsing tables are incremented, while Schneider and Celentano increment programs resp. code. It is clear that an incremental parser constructor is the main prerequisite for syntactical extensibility. Though the statement of the problem is very simple, its solution is not trivial, since even the algorithms for usual LR(1) parser construction are rather complex. The reader should note that LR(1) was postulated in contrast to LALR(1). Small tables for all LR(1) grammars will be constructed in the steps LR(0), NQLALR(1) ("not quite" LALR(1)), LALR(1) and finally LR(1), where the last or the last two steps may be skipped if they are not necessary.
8 In the following section we will first survey the literature on the two areas which are linked by this work, namely syntactical extensibility and LR(k) parser construction. Section 1.2 is an updated version of Fischer [1979]. It describes a syntactically extensible translation system implemented at the University of Dortmund by Horst Rüthing, Manfred Schönhold and myself in 1979. Section 2 introduces several notations for algorithms and grammars. LR(k) grammars are defined, and the data structures which are built and modified by the algorithms are presented. In section 3 the first set of algorithms for the incremental construction of parsing tables given. This section relies on the NQLALR(1) method for the construction of lookaheads, and it should already produce conflict-free tables for most natural, unambiguous grammars. If conflicts do occur, the algorithms of section 4 may be used to determine whether the grammar is LALR(1) or LR(1), and the corresponding parsing tables may be constructed. The implemented version of the algorithms were applied to a set of grammars of programming languages, and several measurements were carried out. The results are reported in section 5. In section 6 finally the approach is summarized and discussed.
9
1.1 Survey of Previous Work ---------------------------- Syntactical Extensibility The field of extensibility in programming languages is too wide to be covered thoroughly in this work. For a bibliography the reader is referred to Metzner [1979]. Good introductions to the field are also given in the books of Cole [1976] and Brown [1974]. Here we will only discuss a historical root of extensibility, and several references with a specific connection to this work. There is one extension mechanism which is in widespread use since the beginnning of programming: the macro in assembler languages. According to Brown [1974] "a macro is a facility for replacing one sequence of symbols by another. A piece of computer software for supporting macro activity is called a macro processor". The macro concept evolved from simple replacement schemes to rather sophisticated possibilities for the generation of program text, and it seems that the macro is one of the weapons which allowed assembler languages to withstand high-level programming languages so long. High-level languages usually have no or only weak built-in macro processing possibilities; the compile-time facilities of PL/1 are one of the exceptions. About 1965 the macro concept was made independent of any base language. In GPM (Strachey [1965]) and TRAC (Mooers and Deutsch [1965]) fixed characters are used to surround a macro call and to separate the parameters, but in ML/1 (Brown [1967]) a macro call already starts with the name of the macro only, and the other delimiters may be chosen freely. In STAGE2 (Waite [1973]) finally a pattern matching is performed on every input line, and therefore even infix operators can be handled. As Cole [1976] points out, all replacement schemes discussed so far are effective before or during the lexical analysis phase in a compiler or assembler. The next step is to build the macro facilities in the syntax analysis pass,as was proposed by Cheatham [1966] and by Leavenworth [1966]. In high-level languages this approach should be much more efficient, but of course it complicates the parsing process considerably. Wegbreit [1970] for example describes a language with several aspects of extensibility. He uses Earley's parsing method (Earley [1970]), but the time estimations rise from O(n**3) to O(n**8) because of syntactical extensibility. The title of this work could have been obtained from the thesis of Vidart [1974] by the replacement of one letter. Vidart modifies the tables of an LL(1) parser to achieve syntactical extensibility. The productions added by the user may destroy the LL(1) property of the old grammar, therefore the system applies a transformation to the grammar in order to ensure LL(1).
10 There are, however, limitations pertinent to LL(1) parsing: the Boolean expression grammar in Knuth [1971] is an example for an LR(0) grammar whose language cannot be generated by any LL(k) grammar. The LANG-PAK system of Heindel and Roberto [1975] is the last mentioned here. LANG-PAK was designed for the design of small application languages, and the interactive use is especially well supported. Nonterminals may be defined, undefined (!) and examined, trial sentences may be input to test the current grammar, the grammar may be stored in and fetched from a file at any stage, and semantics may be connected with the productions. The parsing method is called "left-to-right, top-down, fast-back" what should again be LL(1). The system is remarkably small (756 lines of FORTRAN) and fully operational. LR(k) Parser Construction LR(k) theory was introduced in the fundamental paper of Knuth [1965] which is reported in more detail in sections 2.3 and 2.4. Unfortunately Knuth's algorithm does not directly lead to a practical method for automatic LR(k) parser construction: for realistic grammars his algorithm would get into serious time and space problems. During the past fifteen years several methods with an acceptable complexity have been developped. Korenjak [1969] uses an approach which is also known from precedence parsing: first tables are computed with Knuth's algorithm for several important subgrammars (e.g. for statements, declarations, expressions etc.), and then the tables are linked together to form a table for the entire grammar. Appropriate choice of the subgrammars and error treatment at the boundary between two subtables are two difficulties with this method. Almost all other practical methods follow an advice of DeRemer [1969], which is based on the following observations: (1) LR(0) parsing tables are reasonably small and easy to compute, but most natural grammars are not LR(0). (2) The conflicts occurring in the LR(0) table can most often be resolved if the parser is given a "lookahead" of length 1 (or k >= 1 in the general case). DeRemer therefore first computes the LR(0) set of states. For the computation of the lookaheads there are several different methods. SLR(k) is the first successful method for the automatic construction of tables for LR(k) parsing tables. The lookahead for a reduction to nonterminal 'A' is determined from the strings which may follow on 'A' in any sentential form deriveable from the start-symbol. This method was implemented and evaluated by DeRemer [1969,1971] and by Anderson, Eve and Horning [1973]. All grammars for which the LR(0) table can be made conflict-free
11 by the addition of some lookahead are called LALR(k) by DeRemer [1969]. This definition comprises more grammars than SLR(k), and especially LALR(1) is the standard method today. It was first implemented by LaLonde [1971], and as SLR(k) it was shown to be superior to precedence parser construction and to Korenjak's method (LaLonde, Lee and Horning [1971]). The method is also thoroughly discussed by Aho and Ullman [1972,1977]. There is a third method, termed NQLALR(k) ("not quite" LALR(k)) by DeRemer and Pennello [1979], which is a little bit less powerful than LALR(k). Here the lookaheads are taken from states determined by a tree of predecessor states. Probably several implementations which claimed LALR(k) have in fact used this method, and because of its relative simplicity and efficiency it is also used in this work (cf. sections 3.2, 3.5). In the meantime work concentrates on the most efficient LALR method. Pager [1977 LTA] and DeRemer and Pennello [1979] must be cited here. Pager [1977 PGM] describes a method which produces conflict-free parsing tables with a minimal number of states for all LR(k) grammars. In principle states that would be constructed by Knuth's method are merged as with LALR, NQLALR and SLR, but Pager avoids the case that the merging introduces a conflict which would be not present in the LR(k) table. The corresponding checks must be made throughout the construction, however, and therefore the method is inefficient if the grammar turns out to be LALR(k). To achieve general LR(1) another idea appearing already in DeRemer [1969] is used in this work: first an LALR(1) table is constructed, but then the constructor tries to remove any remaining conflicts by splitting of some states. In section 4.3 we obtain the minimal table of Pager [1977 PGM], but probably with less effort in the case of a few LALR(1) conflicts.
12
1.2 EXTRA - an EXtensible TRAnslator ------------------------------------- Introduction The EXTRA systems allows the interactive description and use of a translator for LR(1) languages. The current description may successively be extended, and test translations may be interspersed with the definition phases. This approach signifi- cantly decreases the costs for the implementation and maintenance of a translator. Originally the system was designed for the tranlation between high-level programming languages. Usual text editors are not suited to this task, because they cannot handle nested structures. A string manipulation language such as SNOBOL4 (Griswold, Poage and Polonsky [1971]) could handle context-free languages, but its back- tracking pattern matching process is too slow for realistic translations. What we would look for is an easy-to-define front-end of an ordinary compiler which would read and analyze the source text possibly in one pass, and by a syntax-directed translation in the sense of Lewis and Stearns [1968] would translate it into the desired target text. Another design issue of the system originates from the obser- vation that the costs for using known parser constructors are unnecessary high in realistic compiler construction projects. One might object that the parser needs to be generated only a very few times. In practice this is not true, because often the design of new languages (and hence of its grammars) is not yet completed when the compiler construction is started. Furthermore the implementor himself wishes to change the grammar because he needs additional hooks for semantics, and finally inconsistencies and weak points in the language may cause additional constructor runs. In contrary to the repeated sacrifice of some minutes of CPU-time for the change of a single production, the EXTRA system has a "steadiness" property: local changes of the grammar only cause local changes in the internal parsing and translation tables. We will refer to this as the incremental construction method in contrast to the usual en-bloc construction. There are other properties which ease the use of the EXTRA system. The problem of the displacement of numbers for semantic routines after a change of the grammar is avoided, since the translation is defined almost entirely in terms of the grammar. Grammar and source text are read in and analyzed by the same system components; this unifies the notation and reduces the size of the system and its data structures. There is an escaping mechanism that allows the activation of user written program segments, and there are some commands that enable the interactive use of the system.
13 General Structure of the System The EXTRA system has much similarity to a very simple translator as it is shown in figure 1.2-1. There is a scanner that reads the characters of the source text and builds symbols. The string of symbols is syntactically analyzed by the parser; this component builds a syntax tree. Parts of the tree may be modified by the tree transformator. Finally the leaves of the modified tree are collected and put to an external medium by a small output program. --------------------------------------------------------------- Source Text Transformational | Grammar ↓ | Scanner | | | | | ↓ | | Parser <---------- Table <---- Parser <-----+ | | Constructor | | | | ↓ | | Tree Transformator <-- Table <---- Transformator <---+ | | Constructor | | ↓ | Output Program | ↓ Target Text --------------------------------------------------------------- Figure 1.2-1: Structure of a Simple Translator Parser and tree transformator are best driven by tables that contain the language dependent information. These tables are constructed from a transformational grammar (DeRemer [1974]) by a parser constructor and a transformator constructor in a separate step prior to the actual translation of source text. As shown by figure 1.2-2, this separation is almost entirely overcome in the EXTRA system, and this is one of the main features of the system.
14 --------------------------------------------------------------- Source Text and Transformational Grammar | ↓ Scanner | ↓ Parser <---------- Table <---- Parser | Constructor | ↑ | | | +-------------------------+ | | | ↓ ↓ ↓ Tree Transformator <-- Table <---- Transformator | Constructor ↓ Output Program | ↓ Target Text --------------------------------------------------------------- Figure 1.2-2: Structure of the EXTRA System The input to EXTRA is a mixture of parts of the transformational grammar and of the source text to be translated. Both components of the input are read by one scanner, and they are syntactically analyzed by the same parser. The tree trans- formator is the first that recognizes the meaning of the grammatical components. They are stored by special semantic actions of the transformator, and at definite moments the syntax directed translation is stopped, the tables are modified such that they reflect the new grammar, and then the translation is resumed.
15
1.2.1 EXTOL - The Extensible, Translation Oriented Language ------------------------------------------------------------ Description of Syntax Along with the babylonian proliferation of new programming languages goes a variety of notations for grammars; cf. the title of Wirth's paper [1977]. We will adopt a mixture of that proposal (for delimiters '.', '=' and '|') and of PL/1 (for identifiers, numbers and strings). The entire input for EXTRA has the following raw structure: ... [ ... ] ... [ ... ] ... [ ... ] ... In square brackets, portions of the transformational grammar are given; between these packages, source text that is to be trans- lated may be interspersed. This enables the interactive testing and correction of a defintion. Since we would like to analyze the entire input by the same parser, there should be one grammar that describes that input. But this grammar must be brought into the system, i.e. it must be read in, and then there must be a description (= grammar) for it, and this description must be brought into the system ... . Therefore the grammar for the first package in '[' and ']' is defined by a fixed procedure in the EXTRA system: the metaparser. This procedure is able to read the metagrammar that starts with the following rules: EOF IDENTIFIER NUMBER STRING [AXIOM = EXTRA_INPUT .EXTRA_INPUT = MEMBERETIES '[' GRAMMAR ']' SOURCE_TEXT => #1 | EXTRA_INPUT '[' GRAMMAR ']' SOURCE_TEXT => #1 .GRAMMAR = RULES => #2 .RULES = RULE | RULES '.' RULE .RULE = LEFT_SIDE '=' RIGHT_SIDES => #4 | LEFT_SIDE '-' RIGHT_SIDES => #5 .LEFT_SIDE = IDENTIFIER => #3 .RIGHT_SIDES = RIGHT_SIDE | RIGHT_SIDES '|' RIGHT_SIDE .RIGHT_SIDE = SYNTAX_PART SEMANTIC_PART .SYNTAX_PART = MEMBERETIES => #6 .SOURCE_TEXT = /* extensions by the user */ .MEMBERETIES = => #7 | MEMBERETIES MEMBER .MEMBER = PRIMARY .PRIMARY = IDENTIFIER => #8 | STRING => #9 | NUMBER => #8 /* (the metagrammar is continued below) */ As the reader may guess, the grammar above is written in the notation that it describes. Once the metagrammar is read in, it is used to construct a LR(1)
16 parsing table. From there on, the LR(1) parser in the EXTRA system may use this table to read more grammar; that grammar then is used to modify the table and so on. The constructions '=> #n' on the right side are produced by the rules for SEMANTIC_PART that are given below. These constructions are quite important, they activate the so-called special semantic actions which are described in detail in section 1.2.2. If in the following parse a string of symbols is reduced to EXTRA_INPUT, for example, the tree transformator will perform action #1, in which case the translation of SOURCE_TEXT is put out, and the system is prepared to scan GRAMMAR in the following. Actions #3, #6, #7, #8 and #9 are used to store the LEFT_SIDE and the MEMBERs of the RIGHT_SIDEs of a rule. #4 indicates that the rule is to be inserted in the grammar, while #5 will delete it from the grammar. #2 finally reorganizes the entire parsing table such that it reflects the new grammar, and it prepares the system to scan SOURCE_TEXT in the following. Though it does not occur in the metagrammar, we will use the term "production" for a LEFT_SIDE together with one of its RIGHT_SIDEs. Several special cases must be noted if a production is inserted (deleted): (1) The LEFT_SIDE of the production is not yet (no more) accessible from the start-symbol AXIOM : the production will not occur in the parsing table. (2) The LEFT_SIDE does not yet (no more) produce any terminal string : the production will occur in the parsing table (applications of this anomaly are discussed in connection with the backspacing of the parser, below. (3) There is already a rule in the grammar with the same LEFT_SIDE : the new production is united with the old ones. (4) The first (last) production with a certain symbol as LEFT_SIDE is inserted (deleted) : the symbol switches from terminal to nonterminal (from nonterminal to terminal). (5) The production is already (is not) in the grammar : it is ignored, possibly associated semantics are changed (an error message is issued). The first line of the metagrammar determines the names of several symbols that are built from characters by the scanner, and that "do not stand for themselves". Note that the association of the names to the symbols is done by the ordering. Identifiers that do not occur in the first line or as a LEFT_SIDE stand for themselves, i.e. they are keywords (cf. section 1.2.2, scanner). Strings stand for their "content", they are used to separate (sequences of) special characters from the delimiters
17 of the grammatical notation. Simple Tree Transformations It is an easy task to build a syntax tree during the parse: if the parser reduces to a production, a node for the LEFT_SIDE is inserted, and the MEMBERs of the RIGHT_SIDE are made the sons of that node. The simplest way to define a translation is to modify the sons and their ordering when the subtree for the LEFT_SIDE is built. The notation for such modifications is straightforward: after the production we write "what it shall become", i.e. we note a '=>' and then the sequence of symbols that are to become the new sons of the LEFT_SIDE: .SEMANTIC_PART = TRANSLATIONETY GENERATIONETIES => #10 .TRANSLATIONETY = => #11 | '=>' ELEMENTETIES => #12 .ELEMENTETIES = => #13 | ELEMENTS .ELEMENTS = ELEMENT | ELEMENTS ELEMENT .ELEMENT = STRING => #14 | IDENTIFIER => #15 | NUMBER => #16 | IDENTIFIER ':' NUMBER => #17 | '#' NUMBER => #18 While the SYNTAX_PART of a production is used to modify the parsing table, the SEMANTIC_PART is translated to some instructions that are interpreted by the tree transformator every time the parser reduces to that production. These instructions are generated parallel to the recognition of the components of SEMANTIC_PART, and they are later interpreted from left to right. By action #2 (after a ']') they are somewhat modified, and several consistency checks are made. Again we note some special cases: (1) An empty TRANSLATIONETY (#11) indicates that the subtree for the production remains unchanged. (2) If ELEMENTETIES is empty (#13), however, the translation of the production is the empty string. (3) Every ELEMENT that occurs as a MEMBER in the SYNTAX_PART stands for a copy of that MEMBER. If it is a nonterminal, the entire subtree corresponding to the nonterminal is copied. (The first occurrence of the ELEMENT is in fact not copied, but "taken"). (4) If the ELEMENT occurs more than once as a MEMBER, the different occurrences can be distinguished by suffixing the ordinal number (':1', ':2' etc.).
18 (5) If the ELEMENT does not occur as a MEMBER, it must either be a terminal or an accumulator (see below). (6) For an ELEMENT of the form '#' NUMBER an instruction is generated by action #18 that results in the activation of a piece of code in the EXTRA system during the translation. Accumulators Simple tree transformations suffice, for example, to translate arithmetic expressions from infix to prefix or postfix notation, but often it is necessary to transport portions of the syntax tree over a greater distance. In EXTRA so-called accumulators are used to store subtrees during the translation of a production; later in the parsing process these subtrees can be identified by the IDENTIFIER of the accumu- lator, and a copy of the stored subtree can be inserted in the translation, or it can be used to build another accumulator. The notation for the definition and the application of accumulators stresses the resemblance of theses constructions to nonterminals: .GENERATIONETIES= | GENERATIONETIES GENERATION .GENERATION = '->' ACCUMULATOR '=' ELEMENTETIES => #23 .ACCUMULATOR = => #24 | IDENTIFIER => #25 An example of the power of a single accumulator is the reversion of a list: [SOURCE_TEXT = LIST => REVERSE .LIST = IDENTIFIER -> REVERSE = IDENTIFIER | LIST ',' IDENTIFIER -> REVERSE = IDENTIFIER ',' REVERSE ] Every accumulator represents not only one subtree, but a stack of trees. All accumulators in the SEMANTIC_PART of a production are processed in three steps: (1) For every occurrence of (the IDENTIFIER of) an accumulator as an ELEMENT a copy of the element on the top of the corresponding stack is made, and these copies are used to build the subtrees specified by the ELEMENTETIES of the TRANSLATIONETY and the GENERATIONs. (2) The top elements of all accumulators that are used as an ELEMENT are deleted from the stacks. (3) For every GENERATION with a nonempty ACCUMULATOR the subtree corresponding to ELEMENTETIES is pushed onto the stack. With the empty alternative for ACCUMULATOR it is possible explicitly to delete incarnations of accumulators (only step (2) is applied).
19 Accumulators are a little bit more powerful than a stack, since it is also possible to access subtrees which lay deeper in the stack by the ':' NUMBER construction for ELEMENTs; ':1' denotes the top element, ':2' the element below the top and so on. These elements are also deleted in step (2). It is clear that the user always must be aware of the order in which productions are applied by the bottom-up parser when he uses accumulators. The transformator constructor checks whether there may be SOURCE_ TEXTs which lead to the use of an accumulator whose stack is empty, or which lead to incarnations that will be never used. Such checks are made when the tables are reorganized (action #2). Backspacing of the Parser Sometimes it is necessary further to modify parts of the tree that have already been generated, or to perform special semantic actions for such parts. Instead of inventing operations that access and manipulate subtrees of the MEMBERs and ELEMENTs in a production, we modify the parser, since it should be the only analyzing mechanism in the system. EXTOL provides for two additional forms of ELEMENTs: .ELEMENT = '@' => #19 | COMBINED_SYMBOL COMBINED_ELEM .COMBINED_SYMBOL= IDENTIFIER => #21 .COMBINED_ELEM = '(' ELEMENTS ')' => #22 ] Action #19 backspaces the read-head of the parser. All ELEMENTs to the right of the '@' are prefixed to the remaining SOURCE_ TEXT, and they are reanalyzed. In the implementation the parser switches between the scanner and a simple procedure that extracts the next leaf from the tree. The '@' may not be used in a GENERATION; this restriction is not checked by the metagrammar, but by the transformator constructor. If parts of the subtree behind '@' need no further examination, they may be combined to a nonterminal COMBINED_SYMBOL with the aid of the second alternative for ELEMENT during the generation of the tree. The parser later will only find that nonterminal. This feature may be used to overcome the loss of efficiency intro- duced by the backspacing. COMBINED_SYMBOLs necessitate nonterminal lookahead symbols in the LR(1) parsing table (the definition of the function FIRSTk may be generalized without difficulty so that sentential forms are the result instead of terminal strings). The following example shows, however, that nonterminal lookaheads may destroy the LR(1) property of a grammar: [AXIOM = A B x | D B y .B = .A = z
20 .D = z ] The corresponding parsing table has a conflict for nonterminal 'B' which is resolved if only terminal lookaheads ('x' and 'y') are used. In such cases the user must dispense with COMBINED_ SYMBOLs. Backspacing of course is very powerful, because the entire SOURCE_ TEXT can be processed by several passes. It implies, however, also a serious problem: the portion of the tree that is reanalyzed must be derivable from the current grammmar. Syntax errors occurring during reanalysis are more serious than those during the usual parse: the former are not caused by the programmer who wrote the SOURCE_TEXT, but by the designer of GRAMMAR. The system will perhaps check for the possibility of such errors in the future; at present the decidability problems are uunsolved, and the responsibility for correct backspacing is up to the user. There is a trick which makes it possible to separate the language that may be input from the one that is read during reanalysis: use unbased nonterminals, i.e. nonterminals which generate no terminal string. The parser has no problem to shift such nonterminals or to use them as lookaheads, but it cannot occur during the normal parse of SOURCE_TEXT.
21
1.2.2 Other Features of EXTRA ------------------------------ There are some features of the EXTRA system that are not directly related to the transformational grammar. The user may wish to change and extend these features, but he can do this only by modifying the PL/1 source-code of the EXTRA system. Scanner As already mentioned, a fixed scanner is used to read both the grammar and the source program. In its present form it is PL/1-oriented. A precise description would need a lot of additional description, therefore only some hints are given here with the aid of the following rules: [LETTER = A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | '_' .DIGIT = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 .QUOTE = '''' .IDENTIFIER = LETTER | IDENTIFIER LETTER | IDENTIFIER DIGIT .NUMBER = DIGIT | NUMBER DIGIT .OPEN_STRING = /* a sequence of zero or more characters different from QUOTE */ .STRING = QUOTE OPEN_STRING QUOTE -> CONTENT = OPEN_STRING | STRING QUOTE OPEN_STRING QUOTE -> CONTENT = CONTENT QUOTE:1 OPEN_STRING /* this is a COMMENT */ ] The scanner returns the symbols IDENTIFIER, NUMBER and STRING. COMMENTs end after the first '*/', they may contain several '/*'. The length of an IDENTIFIER, NUMBER or STRING is only limited by the storage available in the EXTRA system. End of lines are always ignored; spaces and COMMENTs are terminators for all symbols except STRING. A STRING is terminated by a QUOTE that is not followed by another QUOTE. COMMENTs in STRINGs and vice versa are not recognized. There is a problem concerning the sequence of the characters I D E N T I F I E R , because it is used in several quite different ways: (1) in the metagrammar, it denotes names of nonterminals (syntactical variables), accumulators and terminals that consist of letters (keywords), (2) in the grammar given by the user, it denotes an arbitrary identifier in the SOURCE_TEXT to be translated,
22 (3) in the SOURCE_TEXT it is just such an identifier. Despite this confusion there is only one restriction that affects the SOURCE_TEXT: Keywords may not be used as identifiers. This restriction does not seem to be serious, since many programming languages (COBOL, BASIC, SIMULA67, PASCAL, ADA) have it, though PL/1 and FORTRAN do not have it. There is another convention for strings and (sequences of) interpunctuation characters: Initially the scanner does not recognize any character that is no letter and no digit and that is not enclosed in QUOTEs. The scanner enters the CONTENT of every string in a special list, however, and if it subsequently reads an interpunc- tuation character, it tries to extract a symbol from the input equal to the longest sequence of characters that was already stored in this list. This possiblity does not affect the rules for MEMBER and ELEMENT in the metagrammar: terminals that are not keywords should always be enclosed in QUOTEs. From the input ':' '=' ':=' ::= '::=' ::= the scanner would thus extract the symbols STRING STRING STRING : := STRING ::= Special Semantic Actions ELEMENTs of the form '#' NUMBER cause the tree transformator to execute a program segment that is located in the EXTRA system via NUMBER. By this escape mechanism all internal data structures and the full power (and danger) of PL/1 are available during the translation. Of course this feature should be used only for tasks that cannot be accomplished by the means of EXTOL. A series of numbers is used to connect the construction of the parser table and the table for the tree transformator with the rules of the metagrammar, as can be seen from section 1.2.1: '#2' for example reorganizes the entire parser table. The semantic actions expect a fixed ordering of the MEMBERS of the production to which they are applied. By careful modifications the user could, however, change for example the delimiters of the metagrammar. For the higher numbers (beginning at 35) the user has to provide his own PL/1-code, and he must recompile at least one procedure and relink the EXTRA system.
23 Typical actions that can only be accomplished by the '# n'-constructs are: (1) Create a new identifier for a temporary variable that was introduced by the translation. (2) Convert a label identifier to a FORTRAN statement number. (3) Insert a symbol that causes the output program to begin a new line, or to tabulate to a specified column position. (4) Initialize or increment a counter, create a symbol that corresponds to the value of that counter. (5) Store attributes of identifiers in a symbol table, modify symbols on the parse stack according to their attributes. Output program The tree transformator always knows the portion of the tree that may be given out because it is surely unaffected by the following translation. For every symbol in the leaves of that portion, the transformator calls the output program. The output program may perform some additional translation, but it should only have a memory limited to a few adjacent symbols. The output program is thus the counterpart to the scanner; it is responsible for the external representation of the target text, for formatting, file peculiarities and so on. For a translation to FORTRAN, for example, it would autonomously create continuation cards, and it would map identifiers such that their length is at most six. Commands EXTRA offers a few commands that facilitate the interactive use and the communication with the environment of the system: [RULE = COMMAND .COMMAND = LOAD IDENTIFIER => #30 | STORE IDENTIFIER => #31 | READ IDENTIFIER => #32 | WRITE IDENTIFIER => #33 ] The IDENTIFIERs always designate a file. Without the COMMANDs the user would have been forced always to start with the metagrammar. By STORE he may save the current description in internal form on a file, and later he can LOAD and further extend or modify it. READ redirects the entire input to the specified file, and WRITE redirects the output medium.
24
1.2.3 Example: Extension of the Metagrammar -------------------------------------------- Various extensions of the BNF notation have been proposed in the literatur in order to allow regular expressions on the right side of productions. In the EXTRA system such extensions may be defined precisely, and with the backspacing facility the expanded rules can also be inserted in the parsing table. We need two special semantic actions which are easy to write: #50 changes the preceeding ELEMENT to a new identifier, #51 is analogous to #8, it replaces the MEMBER in the production which is currently built by the preceeding ELEMENT. Furthermore, we need two accumulators: [EXTRA_INPUT - EXTRA_INPUT '[' GRAMMAR ']' SOURCE_TEXT .EXTRA_INPUT = EXTRA_INPUT SUB GRAMMAR BUS SOURCE_TEXT .SUB = '[' -> APPEND = -> GEN = ' ' #50 /* initialize the accumulators */ .BUS = ']' -> = APPEND GEN /* delete all acumulators */ .MEMBER = PRIMARY '*' => GEN #51 /* "zero or more" */ -> APPEND = APPEND '.' GEN '=' '|' GEN PRIMARY -> GEN = ' ' #50 | PRIMARY '+' => GEN #51 /* "one or more" */ -> APPEND = APPEND '.' GEN '=' PRIMARY '|' GEN PRIMARY -> GEN = ' ' #50 | PRIMARY '?' => GEN #51 /* "optional" */ -> APPEND = APPEND '.' GEN '=' '|' PRIMARY -> GEN = ' ' #50 | PRIMARY '/' PRIMARY /* "separated by" */ => GEN #51 -> APPEND = APPEND '.' GEN '=' PRIMARY:1 '|' GEN PRIMARY:2 PRIMARY:1 -> GEN =' ' #50 .PRIMARY = '(' RIGHT_SIDES ')' => GEN #51 -> APPEND '.' GEN '=' RIGHT_SIDES -> GEN = ' ' #50 .RULES = RULE => RULE @ APPEND -> APPEND = | RULES '.' RULE => RULES '.' RULE @ APPEND -> APPEND = ] The accumulator GEN always containns the next identifier for the nonterminals that must be created, while APPEND contains the generated rules. Since APPEND is reanalyzed when it is appended to RULES, the generated rules are also inserted in the grammar, and the parser constructor will insert them in the table without at all being affected by the extensions.
25
2 Definitions and Notations ---------------------------- It is assumed throughout this work that the reader has some familiarity with formal language theory, LR(k) parsing, translation and programming. Several notations which are essential for the understanding of sections 3 and 4 will be presented in the following subsections. 2.1 The Set Oriented Algorithmic Language SOL ---------------------------------------------- The construction of LR(k) parsers is a very dynamical process that probably cannot be described by static conditions, sets and formulae. Moreover, the construction process is rather complex, and a verbal description either would not allow the necessary accuracy, or it would become unreadably long. Because of the recursive and parallel nature of wide parts of the construction method it would have been an interesting task to use a functional programming style as proposed for example by Backus [1978]. Functional languages are just evolving at present, however, and it is uncertain whether a completely new notation would enhance the clarity of the description of a complex algorithm. Furthermore it is an aim of this work to show that the incremental construction can be done efficiently with todays means; the functional description would therefore be too great a departure from the actual implementation mentioned in section 5. It is common practice to formulate algorithms in a variant of a well known programming language, possibly enriched by statements in a natural language. In this work PL/1 is used as base language. As can be seen from the grammar below, only very few elements of that language were retained: lexical level, assignment, GOTO, IF-THEN-ELSE, DO-WHILE- END, procedures and calls. Several extensions were made to allow quantors and operations on sets, and to abbreviate some constructions that occur very often. Lexical Level of SOL The notation for the individual symbols of the language is much the same as in PL/1. Comments in algorithms are enclosed in '/*' and '*/', for example: /* this is a comment */ In order to distinguish between words of the English language and identifiers with a fixed meaning, the latter are enclosed in quotation marks in the text if confusion is possible. Keywords of SOL and identifiers for variables, sets, mappings consist of a sequence of lower case letters, digits and
26 underscores ('_'). Names of algorithms (i.e. PL/1 procedures) and of some important sets are marked out by the use of upper case letters. Syntax of SOL Using the conventions of the EXTRA system (cf. section 1.2, especially the extensions of 1.2.3), the syntax of the SOL language is described by the following rules. Some remarks concerning the semantics are given verbally. [SOURCE_TEXT = ALGORITHM .ALGORITHM = IDENTIFIER ':' PROC FORMAL_PARAMETERS_PACK? ';' (DCL DECLARATION/',' ';')? ALGORITHM* (STATEMENT ';')* END IDENTIFIER ';' .FORMAL_PARAMETERS_PACK = '(' (IDENTIFIER (':=' '*')?)/',' ')' The construction ':=*' indicates that the value of a formal parameter may be changed in the algorithm; in the absence of this suffix it will be used only. A formal parameter may be the name of a (variable) algorithm. This construction is rather often used in sections 3 and 4 to avoid duplication of parts of algorithms. .DECLARATION = (IDENTIFIER | '(' IDENTIFIER/',' ')') ('∈' | 'c') IDENTIFIER Declarations indicate the domain of objects. They are given only when this domain cannot easily be concluded from the usage (or the IDENTIFIER) of the object. .STATEMENT = ASSIGNMENT | GOTO IDENTIFIER | CALL_STATEMENT | DO_GROUP | CONDITIONAL_STATEMENT | LABEL ':' STATEMENT .LABEL = IDENTIFIER .ASSIGNMENT = VARIABLE ':=' EXPRESSION | VARIABLE OP ':=' EXPRESSION => VARIABLE ':=' VARIABLE OP '(' EXPRESSION ')' For example, 'A u:= {b};' means that 'b' becomes an element of the set 'A'. .VARIABLE = IDENTIFIER ('(' EXPRESSION/',' ')')? The IDENTIFIER may be a set, state, symbol and so on. If it is followed by a list of expressions enclosed in parentheses, it usually is a mapping. Assignment to such a mapping means the (re-) definition of the value for a specific set of arguments. In the implementation, where all states, symbols etc. are realized by natural numbers, these mappings are of course ordinary arrays.
27 .EXPRESSION = PRIMARY/OP .PRIMARY = 0 | 1 /* natural numbers */ | SET | 'Ø' /* the empty set */ | VARIABLE | TUPLE .SET = '{' IDENTIFIER '|' BOOLEAN_EXPRESSION '}' .OP = '+' /* add */ | '-' /* subtract */ | 'u' /* unite with */ | 'n' /* cut with */ | '\' /* delete (set difference) */ .TUPLE = '(' COMPONENT/',' ')' .COMPONENT = EXPRESSION/'|' | (IDENTIFIER ':=')? '*' In a BOOLEAN_EXPRESSION, a COMPONENT which is '*' may have any value; this value is assigned to the named object if '*' is preceeded by IDENTIFIER ':='. For TUPLEs used in a CALL_STATEMENT, ':=*' again stands for a possible change of the actual parameter. .CALL_STATEMENT = IDENTIFIER TUPLE .DO_GROUP = GROUP_HEAD ';' (STATEMENT ';')* END .GROUP_HEAD = DO | DO WHILE BOOLEAN_EXPRESSION | FOR variable ':=' expression TO expression (BY expression)? DO | FOR ('∀' | '?') QUALIFIED_EXPRESSION DO .QUALIFIED_EXPRESSION = IDENTIFIER ('∈' EXPRESSION)? | IDENTIFIER ('∈' EXPRESSION)? '(' BOOLEAN_EXPRESSION ')' The qualificator 'FOR ?' chooses any one of the objects fulfilling the condition. The STATEMENTs in such a DO_GROUP are executed at most once. .CONDITIONAL_STATEMENT = IF BOOLEAN_EXPRESSION THEN (STATEMENT | DO_GROUP) ';' (ELSE (STATEMENT | DO_GROUP) ';')? .BOOLEAN_EXPRESSION = BOOLEAN_PRIMARY/BOOL_OP .BOOLEAN_PRIMARY = RELATION | '∃' QUALIFIED_EXPRESSION | '¬' BOOLEAN_PRIMARY .BOOL_OP = '&' /* and */ | '|' /* (inclusive) or */ .RELATION = EXPRESSION REL_OP EXPRESSION .REL_OP = '<' | '<=' /* less or equal */ | '=' | '≠' | '>' | '>=' /* greater or equal */ | '∈' /* element of */ | '¬∈' /* not element of */ | 'c' /* contained in */ ]
28
2.2 Context-Free Grammars -------------------------- A vocabulary V is a finite, nonempty set of symbols. A word w over V is a finite sequence of symbols in V. The length of w is denoted by |w|. There is one word e with |e| = 0, the empty word. V* is the set of all words over V, including e. For k >= 0 HEAD(w,k) (TAIL(w,k)) gives the string consisting of the first (last) k symbols in w. w**k is the k-fold concatenation of w. We restrict to such context-free grammars (called "reasonable") which are relevant for the following of this paper. Definition 2.2-1: A reasonable context-free grammar RCFG is a 4-tuple G = (nonts,terms,prods,start), where (1) nonts is a finite, nonempty set of nonterminals, (2) terms is a finite, nonempty set of terminals, (3) prods c (nonts x (nonts u terms)*) is a finite, nonempty set of productions, (4) start ∈ nonts is the start symbol of G. The vocabulary (set of symbols) of G is defined by (5) symbs := nonts u terms. In addition, we require for every RCFG: (6) nonts n terms = Ø, (7) ∀ A ∈ symbs (A ∈ nonts <==> ∃ w ∈ symbs* ((A,w) ∈ prods)), (8) ∀ y ∈ symbs (∃ A,x,z ∈ symbs* ((A,x y z) ∈ prods) | ∃ x ∈ symbs* ((y,x) ∈ prods)), (9) (see below, section 2.3). If (A,w) ∈ prods, we will call 'A' the left side and 'w' the right side of the production, and we will also write A = w in accordance with the notation introduced in section 1.2. The usual "->" sign is reserved in this work for the shift action of LR parsers. Definition 2.2-1 excludes several cases which cause problems in practice. Henceforth we will most often omit the prefixes "reasonable" and "context-free" since we will not consider other grammars.
29 Definition 2.2-2: A grammar G defines relations in symbs* x symbs* by (1) w1 => w2 :<==> (w1,w2) ∈ "=>" :<==> ∃ (A,w) ∈ prods ∃ u,v ∈ symbs* (w1 = u A v & w2 = u w v) "directly derives" (2) w =>+ w' :<==> ∃ n > 0 ∃ w0...wn (w = w0 => w1 => ... => wn = w') "derives" (3) w =>* w' :<==> w =>+ w' | w = w' "derives or is" A sequence w0 => ... => wn is called a derivation (of wn from w0). A word w ∈ symbs* is called nullable iff w =>* e. If start =>* w, w is called a sentential form, and if in addition w ∈ terms*, w is called a sentence. The set of all sentences L(G) := {t ∈ terms* | start =>+ t} is the language generated by G. For w ∈ L(G) there may be several derivations from 'start'. Some of them are excluded by the convention always to replace the rightmost nonterminal in w1 of (1), i.e. we require v ∈ terms*. In the following we will only use such rightmost derivations, if not explictly stated otherwise. The remaining variety of derivations leads us to a property of the grammar. Definition 2.2-3: For a grammar G = (nonts,terms,prods,start) let r and l(A) for every A ∈ nonts be new symbols not in terms u nonts. The structure grammar of G is defined by SG(G) := (nonts,terms u {l(A) | A ∈ nonts} u {r} ,{(A, l(A) w r) | (A,w) ∈ prods}, start). To every derivation start => ... => w ∈ L(G) in G uniquely corresponds a derivation start => ... => st(w) ∈ L(SG(G)) in SG; this word st(w) is called the structure tree of w. G is called unambiguous iff for every w ∈ L(G) and for all derivations start => ... => w in G the corresponding derivations in SG lead to the same structure tree. Otherwise G is called ambiguous.
30
2.3 LR(k) Grammars ------------------- One could hardly write a paper containing the string "LR" without reference to the original work of Knuth [1965]. This section and the following are rather closely patterned after that article. Let us first for a fixed k >= 0 append the following requirement to definition 2.2-1 for reasonable context-free grammars: (9) ∃ axiom,eof ∈ symbs ((start, axiom eof**k) ∈ prods & ∀ (A,w) ∈ prods ((A,w) ≠ (start, axiom eof**k) ==> (A ≠ start & ¬ ∃ x,y ∈ symbs* (w = x start y | w = x eof y)))) The endmarker 'eof' (other authors use "$", "-|", "#" or "|?_;" for this symbol) ensures that generally a lookahead string of length k is available. From (5) to (9) we can conclude that eof ∈ terms. Requirement (9) also excludes derivations of the form start =>+ start which would collide with the unambiguity of LR grammars (cf. Harrison and Havel [1974]). Henceforth we will only consider grammars that fulfill restriction (9). We are interested in bottom-up parsers which try to reconstruct the derivation of a sentence while reading the latter from left to right. Such a parser sometimes must reduce a sequence of symbols to the left side of a production. For the points at which such reductions may take place we will use a special term: Definition 2.3-1: For a grammar G let x ∈ symbs* be a sentential form. Any pair (n,p) is called a handle of x iff ∃ u,w ∈ symbs* ∃ v ∈ terms* ∃ A ∈ nonts (p = (A,w) ∈ prods & n = |u w| & start =>* u A v => u w v = x).
31 LR(k) grammars are such that if the parser tries to reconstruct the derivation, the handle can always be uniquely determined if the parser looks ahead k symbols: Definition 2.3-2 (Knuth [1965]): A RCFG G is said to be translatable from left to right with bound k (briefly "an LR(k) grammar") iff ∀ x ∈ symbs* ∀ v1,v2 ∈ terms* ∀ p1,p2 ∈ prods ∀ n1,n2 >= 0 ( (n1 = |x| & (n1,p1) is a handle of x v1 & (n2,p2) is a handle of x v2 & HEAD(v1,k) = HEAD(v2,k) ) ==> (n1 = n2 & p1 = p2) ). A grammar is an LR grammar if there is a k such that the grammar is an LR(k) grammar. A language is an LR language if there is an LR grammar that generates it. Though Knuth caused an avalanche of literature on LR grammars, the main theoretical knowledge remained rather fixed during the time. The following collection of important results on LR grammars is cited from Harrison and Havel [1974]: (1) For a given grammar and a given k >= 0 it is decidable whether the grammar is an LR(k) grammar or not. (2) It is undecidable whether a given grammar is an LR grammar or not. (3) Every LR grammar is unambiguous. (4) The language generated by an LR grammar is deterministic. The DPDA can be effectively constructed. (5) Any deterministic language has an LR(1) grammar. (6) Any LR grammar is equivalent to an LR(1) grammar. Results (1) to (5) are all from Knuth [1965]. (6) is a consequence of (4) and (5); a direct transformation is described by Mickunas, Lancaster and Schneider [1976]. Knuth's second proof for (1) is the key to the practical utilization of LR grammars, since it gives a method how to construct parsers for this class of grammars.
32
2.4 Construction of LR(k) Parsers ---------------------------------- First we need two functions on words: Definition 2.4-1: For a grammar G, k >= 0 and a word w ∈ symbs* let (1) FIRST(w,k) := {t ∈ terms**k | ∃ (Ai,wi) ∈ prods ∃ ui ∈ symbs* ∃ vi ∈ terms* for i = 0...n (w = u0 A0 v0 => u0 w0 v0 = u1 A1 v1 => ... => un wn vn = t vn)}, (2) EFF(w,k) := {t ∈ terms**k | ... /* same as above */ ... => un wn vn = t vn & ∀ i = 0...n (wi ≠ e | ui ≠ e))}. The e-free FIRST-function EFF thus disallows derivation steps where a nonterminal at the extreme left is replaced by the empty word. For the construction we furthermore need a notion for the possible situations during the parse: Definition 2.4-2: For a grammar G and k >= 0 a configuration c is a 3-tuple c = (p, j, t), where p = (A,w) ∈ prods, 1 <= j <= |w|+1, t ∈ terms**k. The string 't' is called the context (or lookahead) of the configuration. A state is a set of configurations. Configurations are more intuitively written in the form c = [A = u @v, t], where u v = w; the marker '@' stands before the 'j'th symbol in 'w'. The string before the marker corresponds to the portion of 'p' which is already recognized; the marker symbolizes the read-head of the parser, and 't' is a string of terminals which are expected after the reduction to 'A'. The terminology is not at all consistent in the literature; what we here (and DeRemer [1971], McKeeman et al. [1970], Pager [1977]) call a "configuration" is also called "item" (Aho and Ullman [1972 etc.]), "state" (Knuth [1965]) or even "partial state" (Korenjak [1969]). The "items" in this work (see the next section) are a combination of the configurations and the GOTO and ACTION tables of Aho and Ullman [1972,1977]. They do not contain contexts unless the
33 marker '@' is at the extreme right. Knuth's famous second proof for the decidability of the LR(k) property of a grammar G describes the construction of the set of states of an LR(k) parser as if it were always carried out parallel to the parsing of a string 'input': LR_K_ALGORITHM: proc(G,k,input); dcl stack ∈ states*,(state,state2) ∈ states, symbol ∈ symbs, (i,j,m,act) ∈ NAT, Z: NAT --> 2**(symbs**k), lng: NAT --> NAT, left: NAT --> symbs; step0: /* initialization */ state := {[start = @axiom eof**k,eof**k]}; /* the start-state */ stack := state; step1: /* compute the closure of 'state' */ state2 := state u {[A2 = u2 @v2, t2] | ∃ A1 ∈ nonts ∃ u1,v1 ∈ symbs* ∃ t1 ∈ terms**k ([A1 = u1 @A2 v1, t1] ∈ state & t2 ∈ FIRST(v1 t1,k))}; /* 18 */ if state2 ≠ state then do; state := state2; goto step 1; end; step2: /* determine the next action of the parser */ m := 0; Z(0) := Ø for ∀ [A = u @v, t] ∈ state do; if v ≠ e then Z(0) u:= EFF(v t,k); /* 19 */ else do; m := m + 1; Z(m) := t; /* 20 */ lng(m) := |u v|; left(m) := A; end; end; act := m + 1; for i := 0 to m do; for j := i+1 to m do; if Z(i) n Z(j) ≠ Ø then /* G is not LR(k) */ goto done; end; if HEAD(input,k) ∈ Z(i) then act := i; end; /* for i */ if act = 0 then do; /* shift */ symbol := HEAD(input,1); input := TAIL(input, |input| - 1); end; else if act <= m then do; /* reduce */ symbol := left(m); stack := HEAD(stack, |stack| - lng(act)); /* 21 */ state := TAIL(stack,1); end; else /* input ¬∈ L(G) : error */ goto done; step3: /* compute the successor state */ state2 := {[A = u symbol @v, t] | [A = u @symbol v, t] ∈ state}; /* 23 */ state := state2; stack := stack || state; if state ≠ {[start = axiom @eof**k,eof**k]} then
34 goto step1; else /* input ∈ L(G) : accept */ done: end LR_K_ALGORITHM; The construction terminates: since 'symbs' and 'prods' are finite, the powerset of the set of all possible configurations is finite, but this is the set of all possible states. In line (2) the context is irrelevant because the parser will already accept the sentence when it has read 'axiom', and then "sees" eof**k. The numbering of the steps and equations is from Knuth [1965]. The algorithm above is about as long as Knuth's description, but it avoids any "...", renaming of subscripts, and verbal formulations such as "defined recursively as the smallest set satisfying ..." or "Z,Z0...Zp must all be disjoint". In contrast to Knuth we did not stack symbols, because this is only necessary if the semantic information behind the symbols is needed when reducing a production. In practice it is neither possible nor necessary to perform steps 1 and 3 and the check for disjointness of the Z(i) always anew for the same states. The algorithm usually is divided in two parts: a constructor which precomputes the states, their closure and especially a parsing table, i.e. a mapping from states and lookahead strings of length k to parser actions. This table then is used by the parser (or parsing algorithm) to analyze specific sentences. For every pair (state,lookahead) the table must provide information whether (1) to shift, and what state will be the successor, (2) to reduce, and to what production (only the left side and the length of the right side are of interest), (3) to accept, (4) to announce error. In practice it is always a problem to store the parsing table. Since entries for case (4) occur very often, they are usually omitted, and an error is announced if the mapping is undefined for a pair. The factor 'k' has a very strong influence on the size of the table, therefore virtually all practical methods use k = 1 (only few real grammars are LR(0)). Knuth already pointed out that it is possible to identify states with respect to the parser actions as long as the
35 Z(i) remain disjoint. This merging leads to the LALR(k) construction methods and parsing tables (note that the parsing algorithm remains the same). The only difference between Knuth's LR parsing and LALR parsing may be that with LALR an error is detected some reductions later than with an LR(k) table. However, the parser will never shift an erroneous symbol with either table. Definition 2.4-3: The core of a state 'q' is the set of marked productions: CORE(q) := {(p,j) | ∃ t ∈ terms**k ((p,j,t) ∈ q}. A configuration group is a 3-tuple (p,j,{t1...tn}), where (p,j,t1) ... (p,j,tn) are configurations with a common core. The canonical LALR(k) construction method is defined by the following informal Algorithm 2.4-4: (1) Construct a LR(k) set of states and the corresponding parsing table. (2) As long as there are two different states with equal cores, identify the states, i.e. union their sets of configurations together with the corresponding parser actions, and replace all occurrences of either state by the resulting state. (3) Check whether the parsing table continues to define a unique mapping - if so (if not), the grammar is (is not) LALR(k). This algorithm only serves for definitional purposes - again it is impracticable because it relies on an LR(k) table which must first be computed and stored. The tables obtained with LALR(k) for k >= 1 all have the same number of states as the LR(0) parsing table of the grammar. This number is acceptably small for realistic grammars, and therefore there has been great interest in algorithms for LALR(k) which avoid the previous construction of an LR(k) table.
36
2.5 Data Structures and Mappings --------------------------------- The description of the incremental construction is complicated by the fact that the sets 'nonts', 'terms' and 'prods' which constitute a grammar may be changed between two reorganization steps. Sometimes we will need new symbols, productions, states and so on. Therefore we will distinguish between the "possible" and the "current" sets: Definition 2.5-1: (1) SYM is an denumerable set of possible symbols, (2) PRO c SYM x SYM* is an denumerable set of possible productions, (3) STA is an denumerable set of possible states, (4) ACT = {shift,reduce,accept,error,oldsh,newsh} is the set of parser actions, (5) NAT = {0,1,2,3,...} is the well-known set of natural numbers, (6) ITE c SYM x PRO x NAT x ACT x STA is an denumerable set of possible items, (7) nonts c SYM, terms c SYM, states c STA, and items c ITE are finite sets of current nonterminals, terminals, states and items, respectively. Since we are mainly interested in the construction of the parsing table, we will not mention the changes of 'nonts', 'terms' and 'prods', but we will assume that the current grammar always fulfills conditions (5) to (9) of definition 2.2-1 during a reorganization step. The changes in the current sets of states and items, i.e. the creation of new and the deletion of superfluous elements will be explicitly stated, however. Though these changes may seem trivial, they strongly influence the implementation. Creation may always lead to exhaustion of the available storage, and deleted elements should be reuseable, i.e. they should be collected in free lists. As shown in figure 2.5-2, the sets of definition 2.5-1 are related by several mappings. It is the main task of the construction to adjust the values of these mappings after a change in the grammar.
37 -------------------------------------------------------------- pronuc stapre --- stanum -------------------------> states --- staite -->> items NAT ↑ | <---itesuc----- | -----itepos---> ↑ ↑ | | | | | | | | stasym iteact | | | | | | | itesym ↓ itepro prolng symsta | ACT | | | | | | | | | | | ↓ <-------------- prolef -------------- | symbs prods --------------- sympro ------------>> promem A --> B : mapping from A to B A -->> B : mapping from A to 2**B A,B --> C : mapping from A x B to C -------------------------------------------------------------- Figure 2.5-2: Mappings between Data Structures In the following the mappings are collected with respect to their domains, and several explanations and properties are given. Some mappings are not really necessary for the construction, but they are introduced because they shorten the description, and especially because they enhance the efficiency of the algorithms. Productions Definition 2.5-3: (1) prolef: prods --> symbs (A,w) |-> A "the left side", (2) prolng: prods --> NAT (A,w) |-> |w| "the length of the right side", (3) promem: prods x NAT --> symbs ((A,w),i) |-> if 1 <= i <= |w| then TAIL(HEAD(w,i),1) else eop "the 'i'th member of the right side".
38 A fourth mapping (4) pronuc: prods x NAT --> 2**states is defined in section 3.4.1. The symbol 'eop' ("end of production") is assumed to be different from 'start' and 'eof', and like these symbols it cannot occur in any production that is inserted or deleted. Items The main components of the parsing table are the items which in this work are a combination of the configurations and of the parser actions. Definition 2.5-4: An item is a 5-tuple of the form (0) (symb,prod,pos,act,succ) ∈ symbs x prods x NAT x ACT x states The individual components are obtained by the mappings (1) itesym: items --> symbs "the marked symbol resp. the lookahead", (2) itepro: items --> prods "the production which is marked", (3) itepos: items --> NAT "the position of the marker", (4) iteact: items --> ACT "the action of the parser", (5) itesuc: items --> states "the successor state". During the construction the components 'prod' and 'pos' will always define a marked production, therefore for any 'item' we will have 1 <= pos <= prolng(prod) + 1. (As in definition 2.4-2, the marker '@' is before the 'pos'th symbol in the right side of 'prod', and again items are sometimes more intuitively written in the form [A = u @v]. If pos <= prolng(prod), we have a shift-item of the form (promem(prod,pos),prod,pos,shift,state) and the successor 'state' is defined.
39 If pos = prolng(prod) + 1, we have a reduce-item (la,prod,pos,reduce,*), where 'la' is the lookahead (-symbol because we only use k = 1). The successor '*' is formally used as a state, but it is irrelevant. There is exactly one item (eof,prod,3,accept,*) in every parsing table (cf. section 3.1). The actions 'oldsh' and 'newsh' are only used during the reorganization to mark some shift-items. In the examples parsing tables are often given in an abbreviated and linearized form. Example 2.5-5: For the grammar [axiom = S .S = S a S b .S = ] the following parsing table is obtained with NQLALR(1): 2 axiom -> 3 eof =. | S -> 4 eof =: axiom,1 | | a -> 5 S -> 6 a -> 5 ... | | | b -> 7 eof =: S,4 | | | a =: S,4 | | | b =: S,4 | | a =: S,0 | | b =: S,0 | a =: S,0 States are numbered 2,3... (the implemented system always assigns number 2 to the start state, and all examples were computed by this system). Every entry has the form state symbol action where action is one of the following: -> q : shift 'symbol' and goto state 'q', =: A,n : for lookahead 'symbol' reduce; the left side is 'A', pop 'n' elements off the stack, =. : accept. If there is more than one action in a state, following lines are used. The state number is not repeated in this case, but the indentation and the vertical bars indicate to what state an action belongs.
40 If a state is followed by three dots, its actions were already given in other lines. Though the marked productions are not given in this representation of the parsing table, they can be concluded easily if the shift arrows are traced backwards. States Definition 2.5-6: (1) staite: states --> 2**items "the set of items" (2) stapre: states --> 2**states stateb ∈ stapre(statea) :<==> ]?= (*,*,*,shift,stateb) ∈ staite(statea) "all predecessor states" (3) stasym: states --> symbs stasym(state) = symb :<==> ∃ (*,prod,pos,*,*) ∈ staite(state) (pos >= 2 & symb = promem(prod,pos-1)) "the symbol that accesses the state" A fourth mapping (4) stanum: state --> NAT is defined in section 3.4.1. A symbol 'symb' is said to be used in a state 'state' iff ∃ (symb,*,*,*,*) ∈ staite(state). As the terms "successor" and "predecessor" already indicate, states may be seen as nodes of a directed graph whose edges are marked with the shifted symbols: Definition 2.5-7: Given an LR(k) parsing table with a set of states 'states', the corresponding transition graph is a 3-tuple TG = (states,E,M), where (1) E c states x states (statea,stateb) ∈ E :<==> ∃ (*,*,*,shift,stateb) ∈ staite(statea) <==> statea ∈ stapre(stateb), (2) M: E --> symbs M(statea,stateb) = symb :<==> ∃ (symb,*,*,shift,stateb) ∈ staite(statea) <==> stasym(stateb) = symb.
41 One edge in the graph may represent several shift-items in 'statea'. There may be cycles in the graph, there is a root (the start-state), and there may be leaves (states with no shift-items). Symbols Definition 2.5-8: (1) sympro: symbs --> 2**prods A |-> {p ∈ prods | A = prolef(p)} "all productions for the symbol", (2) symsta: symbs --> 2**states x |-> {state ∈ states | x = stasym(state)} "all states accessed by the symbol". A symbol 'symb' is a nonterminal if sympro(symb) ≠ Ø, otherwise it is a terminal.
42
3 Algorithms for NQLALR(1) --------------------------- We are now ready to describe the algorithms for the incremental construction of an LR(1) parser. For the computation of lookaheads, we will use a simple method that lies between SLR(1) and LALR(1); it was termed NQLALR(1) ("Not Quite LALR(1)") by DeRemer and Pennello [1979]. This method is especially well suited for the incremental construction; its weakness with respect to the class of grammars for which it is applicable will be discussed and overcome in section 4. Let us look back to the system described in section 1.2. If we concentrate on the parser constructor, such a system has the following structure: EXTRA: proc; dcl (delete_set,insert_set) c PRO; INIT; do while true; delete_set := /* a set of productions to be deleted from the grammar */; insert_set := /* a set of productions to be inserted in the grammar */; REORGANIZE; end; end EXTRA; INIT creates a very simple grammar and a corresponding parsing table. Algorithm REORGANIZE is the parser constructor; it reorganizes the parsing table such that it corresponds again to the new grammar. This step may be repeated several times. The design of algorithm REORGANIZE is the main theme of this work. The algorithm consists of a series of steps: REORGANIZE: proc; dcl changed ∈ BOOL ,(lagar_set,laget_set,laput_set,successors_set ,conflict_set) c STA ,closure_set c SYM; changed := true; do while changed; PROEDIT(delete_set,* ,PRODEL); PROEDIT(insert_set,changed:=*,PROINS); LAGAR; SUCCESSORS; end; GARBAGE; LAGET; LAPUT; CONFLICT; end REORGANIZE; PRODEL and PROINS are used to change the grammar and to prepare the computation of the new set of states. As long as
43 a production became accessible from the start-symbol, the switch 'changed' is true, and there may be other productions which can be inserted now. LAGAR deletes lookaheads that may become invalid by the changes. The transition graph is changed by SUCCESSORS. Thereby we may be left with some states which are no longer accessible from the start-state. Such states are deleted by GARBAGE. With LAGET and LAPUT the lookahead-sets of reductions are (re-)computed. CONFLICT finally looks at conflicts and tries to remove them. It may detect that states must be splitted for this purpose. All algorithms in REORGANIZE are closely interrelated; therefore it is difficult to describe them in a sequential ordering. We begin with the initialization in section 3.1. The computation of lookaheads is most severely affected by the incremental approach. In section 3.2, the NQLALR(1) method for that computation is presented and justified, and it is shown how the entire loookahead complex can be separated from the modifications of the transition graph for states. Thereafter the modification of the grammar in section 3.3 may be understood. In section 3.4 the modification of the transition graph is described, and in section 3.5 the lookaheads are (re-) computed. All questions which arise if conflicts occur are put aside until section 4.
44
3.1 Initialization and the Parsing Algorithm --------------------------------------------- The incremental construction of a LR(1) parsing table starts with a very simple grammar that has only one production. symbs := {start,axiom,eof} prods := {prod1} prod1 := (start, eof axiom eof) We define the corresponding parsing table to consist of two states with only one item: 2 axiom -> 3 eof =. The complete initialization is done by INIT: proc; /* initialization of the grammar */ sympro(start) := {prod1}; promem(prod1,1) := eof; promem(prod1,2) := axiom; promem(prod1,3) := eof; /* initialization of the parsing table */ ALLOC(eof ,state2:=*); ALLOC(axiom,state3:=*); staite(state2) := {(axiom,prod1,2,shift,state3)}; staite(state3) := {(axiom,prod1,3,accept,*)}; stapre(state3) := {state2}; pronuc(prod1,2):= {state2}; pronuc(prod1,3):= {state3}; end INIT; Algorithm ALLOC is a small algorithm which creates a new state that is accessed by the first parameter. It is given in section 3.4.1. This choice of the initial grammar and parsing table needs some remarks: (1) The first member 'eof' of the production has the only purpose to make the item in 'state2' a nucleus-item (with a marker-position >= 2). (2) The symbols 'start' and 'eof' are not accessible to the user of the constructor, they cannot occur in any production which is added later, and 'prod1' cannot be deleted. (3) The only way to extend the grammar now is to add a production with 'axiom' on the left side. (4) 'state2' has no predecessor, therefore no lookaheads can be received by this state.
45 The parsing algorithm consists of a loop in which one of the actions 'shift', 'reduce', 'accept' or 'error' is performed; the loop is left by action 'accept': PARSE: proc; dcl top ∈ NAT, state ∈ states, (left,symb) ∈ symbs; state := state2; top := 1; stack(top) := state; read: SCAN(symbol:=*); /* get next symbol form the input word */ next_action: DELTA(state,symbol,item:=*); if iteact(item) = shift then do; state := itesuc(item); top +:= 1; stack(top) := state; goto read; end; else if iteact(item) = reduce then do; left := prolef(itepro(item)); top -:= prolng(itepro(item)); state := stack(top); DELTA(state,left,item:=*); state := itesuc(item); top +:= 1; stack(top) := state; goto next_action; end; else if iteact(item) = error then do; if symb ≠ eof then do; /* issue an error-message and try next input-symbol */ goto read; end; /* else 'eof' after error or iteact(item) = accept: stop parsing */ end PARSE; The algorithm uses a mapping stack: NAT --> states which represents the well-known parse stack of states. We do not stack symbols because the shifted symbol can always be determined from a state by the mapping 'stasym'.
46 The parser as well as the constructor need a small algorithm which computes the next action to be performed by the parser: DELTA: proc(state,symbol,item:=*); if ∃ item1 ∈ staite(state) (item1 = (symbol,*,*,*,*)) then do; if ∃ item2 ∈ staite(state) (item2 ≠ item1 & item2 = (symbol,*,*,*,*) then /* conflict for 'symbol' */ DECIDE(state,symbol,item:=*); else item := item1; end; else item := (symbol,*,*,error,*); end DELTA; DELTA returns an item with action 'error' for a (state,symbol)- pair that is not stored in the parser-table. DECIDE is called if there is more than one possible action. The latter algorithm may resolve the conflict by selecting the shift-item in favour of reduce-items, or by using semantic information, but it is not described here. If we look at the initial grammar and if we assume that the first 'eof' has already been read by some magic, i.e. the remaining input string is 'axiom eof', then the parser will perform only two actions: +----------+-------------------+---------------+------------+ | actual | stack contents | remaining | action | | state | | input string | | +----------+-------------------+---------------+------------+ | state2 | | axiom eof | -> state3 | | state3 | state2 | eof | =. | +----------+-------------------+---------------+------------+ If the grammar is extended by some productions with 'axiom' on the left side, we have to add the closure items to 'state2', and we must compute additional states; in the input 'axiom' may be replaced by a string derivable from that symbol, and the parser will perform additional actions. These actions are to be inserted before the actions in the table above; note that the table always ends with those two actions.
47
3.2 The NQLALR(1) Method for the Computation of Lookaheads ----------------------------------------------------------- The NQLALR(1) method is very simple in concept. Let us assume that we have already constructed an LR(0) parsing table, and that we want to compute the lookaheads for a reduction to production 'prod' in state 'statez'. This approach is much the same as in the SLR(1) method of DeRemer [1971]. With SLR(1), we would take all terminals which may follow on the left side of the production in any sentential form derivable from the start symbol. Here we imagine the possible situations of the parser. If prod = (left, symb[1]...symb[n]), and if the parser performs the reduction for 'prod' in 'statez', it will pop a state 'statea' from the stack and it will go to that state afterwards. Here it will always shift 'left' and go to a state 'stateb': --------------------------------------------------------------- symb[1] symb[n] statea --------> ... --------> statez | | left ↓ stateb --------------------------------------------------------------- Figure 3.2-1: A subgraph typical for the NQLALR(1) method If we assume that the lookaheads in 'stateb' have already been computed by the usual LALR(1) method, then it is obvious that the terminals which are expected by the parser in 'stateb' have something to do with the lookaheads we want to compute: In the situation of figure 3.2-1, every symbol that is not used in 'stateb' in a shift action or as lookahead in a reduction is no valid lookahead for a reduction to 'prod' in 'statez'. The NQLALR(1) method puts this statement into positive form: Every symbol that is used in 'stateb' ... is taken as a lookahead for 'prod' in 'statez'. The method starts with the parsing table obtained by the LR(0) method, and it transmits the used terminals from "emitting" states ('stateb') to "receiving" states ('statez'). The process iterates as long as any state obtains an additional terminal (initially only shifted terminals are transmitted). A precise definition of this process is the algorithm LAPUT in section 3.5.
48 The "Not Quite" prefix is justified by Lemma 3.2-1: There are grammars which are LALR(1), but not NQLALR(1). Proof: DeRemer and Pennello [1979] give such a grammar. Another example is the 5th grammar from Mickunas, Lancaster and Schneider [1976]: [axiom = S .S = o A o | o B q | q B o .A = c | c A .B = c | c B ] Here NQLALR(1) would produce the following transition graph: 2 axiom -> 3 eof =. | S -> 4 eof =: axiom,1 | o -> 5 A -> 7 o -> 12 eof =: S,3 | | B -> 8 q -> 13 eof =: S,3 | | c -> 9 o =: A,1 | | o =: B,1 | | A -> 14 o =: A,2 | | B -> 15 o =: B,2 | | | q =: B,2 | | q =: B,1 | | c -> 9 ... | q -> 6 B -> 10 o -> 16 eof =: S,3 | c -> 11 o =: B,1 | B -> 15 ... | c -> 11 ... The action 'o =: B,1' in state 9 (and hence the conflict) is not present with LALR(1). Lemma 3.2-2: Every NQLALR(1) grammar is an LALR(1) grammar. Proof: Compared with LALR(1), we obtain at most some additional lookaheads with NQLALR(1), but not fewer. We had no conflict with NQLALR(1), therefore we will have none with LALR(1). As mentioned in section 2.4, additional lookaheads will only cause the parser to perform some additional reductions before the detection of an error, but an erroneous symbol is never shifted. Therefore LR(1) parsers also operate correctly when they use conflict-free parsing tables constructed by NQLALR(1). Figure 3.2-1 may be interpreted in two ways: (1) a terminal in 'stateb' must be emitted to all reductions reductions for productions 'prod' in states 'statez', or (2) a reduction for a production 'prod' receives as its lookaheads the terminals of all appropriate states 'stateb'.
49 An en-bloc construction method would need (1) only, but for the interactive construction (2) is also important. The two interpretations lead to two mappings: emitters: states --> 2** (states x prods) receivers: states x prods --> 2** states With 'emitters', we could take the terminals of a state and "put" them to a set of reductions in states, while with 'receivers' we can "get" the terminals of a set of states for a reduction in a state. To compute these mappings, we must be able to trace the horizontal sequence of transitions in figure 3.2-1 in both directions. From left to right ("forward") this trace is done by: FORWARD: proc(statea,prod,posa,APPLY); /* move the marker to the end of the production */ dcl (statei,statej) ∈ states, posi ∈ NAT, symi ∈ symbs; statei := statea; posi := posa; do while posi <= prolng(prod); symi := promem(prod,posi); if (symi,prod,posi,shift,statej:=*) ∈ staite(statei) then do; statei := statej; posi +:= 1; end; else goto done; end; APPLY(statea,statei,prod); done: end FORWARD; Starting with a production 'prod' marked at position 'posa' in state 'statea', the algorithm tries to move the marker to the end of the production. If all transitions for the currently marked members of the production already exist, FORWARD finally calls APPLY.
50 The other direction is traced by: BACKWARD: proc(statez,length,APPLY); /* move the marker to the beginning of the production */ dcl (seti,setj) c states, (statea,statej) ∈ states; setj := {statez}; do i := 1 to length; seti := Ø for ∀ statej ∈ setj do; seti u:= stapre(statej); end; setj := seti; end; for ∀ statea ∈ setj do; APPLY(statea); end; end BACKWARD; The starting nodes of all paths of length 'length' that end in state 'statez' are determined, and APPLY is called for the states 'statea' at these nodes. A realization of the mappings 'emitters' and 'receivers' by a static data structure would consume too much storage, and it would lead to difficult reorganization problems when the transition graph is changed. Instead, two algorithms are used which compute an element of the powersets on the right side, and which call an algorithm APPLY with every state in that element. EMITTERS: proc(statez,prod,APPLY); /* where do lookaheads come from ? */ dcl setb c states; ECOPY: proc(statea); DELTA(statea,prolef(prod),item:=*); setb u := {itesuc(item)}; end ECOPY; setb := Ø BACKWARD(statez,prolng(prod),ECOPY); for ∀ stateb ∈ setb do; APPLY(stateb,statez,prod); end; end EMITTERS; We move the marker which is at the end of 'prod' back before the first member by the aid of BACKWARD. In all states reached by BACKWARD we shift the left side of 'prod'; the resulting states are the "emitters", and APPLY is called for them.
51 RECEIVERS: proc(stateb,APPLY); /* where do lookaheads go to ? */ dcl setz c states; RCOPY: proc(statez); setz u:= {statez}; end RCOPY; for ∀ prod ∈ sympro(stasym(stateb)) do; setz := Ø for ∀ statea ∈ stapre(stateb) do; FORWARD(statea,prod,1,RCOPY); end; for ∀ statez ∈ setz do; APPLY(stateb,statez,prod); end; end; end RECEIVERS; State 'stateb' is either accessed by a nonterminal or by a terminal. In the latter case, no lookaheads are transmitted. For a nonterminal, lookaheads are transmitted along the paths which are built for the various productions for that nonterminal, and that start at the predecessors of 'stateb'. APPLY is called for the corresponding reduce items in the "receiving" state. Note that the calls of APPLY in the two algorithms have the same parameter lists. During the interactive modification there are four operations which may affect lookaheads that were previously computed by the NQLALR(1) method: (SD) a s?_ymbol is no longer used in a state, it is d?_eleted, (SI) a new s?_ymbol is used in a state, it is i?_nserted, (ED) an e?_dge of the transition graph is d?_eleted, (EI) a new e?_dge is i?_nserted. For symbols the relevant state is always the emitting state 'stateb' of figure 3.2-1. For edges there are four subcases depending on the interpretation of the edge as one of the h?_orizontal edges in figure 3.2-1, or as the v?_ertical one. In the following the appropriate actions for cases (SD), (SI), (EDH), (EDV) are described. Cases (EIH) and (EIV) are treated by ITEMINS in section 3.3.2 and by LAGET and LAPUT respectively (section 3.5).
52 (SD) A Symbol is Deleted from a State Algorithm LAGAR processes an entire set of states that have lost some symbols: LAGAR: proc; /* garbage collection for lookaheads */ dcl set c states; do while lagar_set ≠ Ø set := lagar_set; lagar_set := Ø for ∀ stateb ∈ set do; RECEIVERS(stateb,LADEL); end; end; end LAGAR; LAGAR propagates the deletion of lookaheads in a state 'stateb' to all states 'statez' that received lookaheads from 'stateb'. Since there is no easy way to know whether a lookahead 'symb' in 'statez' steems only from 'stateb', or whether it is also received from another state, LADEL is used to delete all reduce-items for the reductions which receive lookaheads from 'stateb': LADEL: proc(*,statez,prod); dcl lost ∈ BOOL; lost := false; for ∀ itemz ∈ staite(statez) (itemz = (symbz:=*,prod,prolng(prod)+1,reduce,*)) do; if symbz ≠ eop then staite(statez) \:= {itemz}; if ¬ ∃ (symz,*,*,*,*) ∈ staite(statez) then lost := true; end; staite(statez) u:= {(eop,prod,prolng(prod)+1,reduce,*)}; laget_set u:= {statez}; if lost then lagar_set u:= {statez}; end LADEL; LADEL replaces all reduce-items for a production 'prod' in a state 'statez' by one reduce-item with the special lookahead 'eop'. The latter is reexpanded by LAGET (cf. section 3.5). The first parameter '*' is necessary because LADEL is used as parameter APPLY to RECEIVERS. If state 'statez' at least once loses the last item with a certain symbol, LAGAR must be activated for 'statez', too. LADEL only deletes lookaheads which are different from 'eop'. A state which has only reductions under 'eop' will not again be stored in 'lagar_set', and the loop in LAGAR will thus terminate.
53 (SI) A Symbol is Inserted in a State We cannot transmit the symbol to the receiving states until the transition graph has completely been reorganized. Therefore we simply remember that 'stateb' must emit all its symbols: laput_set u:= {stateb}; (EDH) A Horizontal Edge is Deleted A closer look to an edge shows the following situation: statei = statej = {... {... [l1, x1 @symbi y1] symbi [l1, x1 symbi @y1], ... --------> ... [ln, xn @symbi yn], [ln, xn symbi @yn], ...} ...} i.e. the edge may represent several shift items. We have to move the marker forward to the end of all productions which have symbol 'symbi' marked in 'statei'. The reductions which are reached then can no longer be assumed to receive their lookaheads through the edge, and since we do not know whether they receive them from elsewhere, we delete all those lookaheads: EDGEDEL: proc(statei,statej); /* lookaheads are no longer transmitted through an edge */ for ∀ item ∈ statei (item = (symbi:=*,prod:=*,pos:=*,shift | oldsh,statej)) do; FORWARD(statej,prod,pos+1,LADEL); end; for ∀ prod ∈ sympro(stasym(statej)) do; FORWARD(statei,prod,1,LADEL); end; end EDGEDEL; The second for-loop of this algorithm already solves the next task: (EDV) The Vertical Edge is Deleted We start with the marker before the first member of all productions of the symbol that accesses 'statej' (= 'stateb') and we move the marker forward to the end of the productions. The lookaheads of the reductions reached then are all deleted by LADEL. The algorithms and operations above are used at various places in the following sections. They isolate the difficulties of lookaheads in connection with thhe incremental construction, so that we may now concentrate on the modification of the grammar and of the transition graph.
54
3.3 Modification of the Grammar -------------------------------- Both tasks, the deletion and the insertion of a set of productions, have several similarities. In this section the algorithms are presented which apply to both tasks, while the differences are treated in two subsections. We will use the term "edit" for one of the operations "delete" or "insert". Editing is done with the aid of PROEDIT: proc(prod_set,changed:=*,APPLY); dcl left ∈ SYM; changed := false; for ∀ prod ∈ prod_set do; left := prolef(prod); for ∀ stateb ∈ symsta(left) do; changed := true; prod_set \:= {prod}; for ∀ statea ∈ stapre(stateb) do; APPLY(statea,prod); end; end; end; end PROEDIT; For every production in 'prod_set' all states are determined which shift the left side of that production. At least these states are affected by the change of the grammar, and another algorithm APPLY is called for each of them. In 'changed' it is noted whether there was any such state, i.e. whether 'left' is already reachable from the start symbol (cf. section 3.3.3). APPLY is either PRODEL or PROINS. If a state shifts the left side of a production, then it must contain an item with the marker before the first member of the right side of that production. Generally it contains the closure, i.e. for every marked nonterminal the state contains all productions of that nonterminal, with the marker before the first member. The closure is computed with the aid of CLOSURE: proc(state,APPLY); /* compute the closure items of a set of symbols */ dcl set c SYM; do while closure_set ≠ Ø set := closure_set; closure_set := Ø for ∀ symb ∈ set do; for ∀ prod ∈ sympro(symb) do; APPLY(state,prod,1); end; end; end; end CLOSURE; The algorithm calls another algorithm APPLY for every item in the closure; it is assumed that APPLY inserts the marked symbol in 'closure_set'. APPLY is either ITEMMARK (for deletion) or ITEMINS (for insertion of a production).
55
3.3.1 Deletion of a Set of Productions --------------------------------------- At least three types of data in the parsing table are computed by a closure operation: lookaheads, items and states. Generally the algorithms for deletion in such closed data structures appear to be more complex than their counterparts for insertion. If an element is inserted, we only have to insert the closure of that element, but if an element is deleted, we cannot delete all its closure elements, since some of them may yet be in the closure of other remaining elements. In the case of lookaheads (algorithms LAGAR, LADEL in section 3.2) we deleted the closure and we noted that it should later be recomputed by LAGET. It is not necessary to be more careful, since lookaheads do not (yet, see section 4) affect closure-items and states. Unnecessaray deletion of closure-items, however, would cause unnecessary deletion of lookaheads, so it is advisable to be more cautious here. Algorithm PRODEL is given a state and a production with the marker before the first member. PRODEL: proc(state,prod); dcl lost ∈ BOOL; closure_set := Ø for ∀ item ∈ staite(state) do; if itepos(item) >= 2 then do; /* nucleus-item */ if iteact(item) = shift then closure_set u:= {itesym(item)}; end; else if iteact(item) = shift then iteact(item) = oldsh; else if iteact(item) = reduce then iteact(item) = oldre; end; CLOSURE(state,ITEMMARK); lost := false; for ∀ item ∈ staite(state) do; if iteact(item) = oldre then do; /* empty production */ staite(state) \:= {item}; if ¬ ∃ (itesym(item),*,*,shift | reduce,*) ∈ staite(state) then lost := true; end; else if iteact(item) = oldsh then do; EDGEDEL(state,itesuc(item)); if ¬ ∃ (itesym(item),*,*,shift | reduce,*) ∈ staite(state) then lost := true; end; end; if lost then lagar_set u:= {state}; if prolng(prod) > 0 then /* no empty production */ successors_set u:= {state}; end PRODEL;
56 The closure is updated in three steps: (1) all closure-items are marked to be "old", (2) the closure of the nucleus-items is recomputed, and the resulting items are marked with their normal actions, (3) by SUCCESSORS in section 3.4 the remaining "old" items are deleted. For all "old" shift-items the corresponding edge will be deleted from the transition graph; therefore by EDGEDEL we delete all lookaheads which are transmitted through this edge. If there are symbols which are no longer used in the state ('lost' = true), the state must be processed by LAGAR. PRODEL only deletes the item with the marker before 'prod'; in GARBAGE we will delete all states which had a nucleus-item for 'prod'. Algorithm ITEMMARK marks all items in the new closure with their usual actions ('shift' resp. 'reduce'). ITEMMARK: proc(state,prod,*); for ∀ item ∈ staite(state) (item = (symb:=*,prod,1,act:=*,*)) do; if act = oldre then iteact(item) := reduce; else if act = oldsh then do; iteact(item) := shift; closure_set u:= {symb}; end; end; end ITEMMARK;
57
3.3.2 Insertion of a Production -------------------------------- The analogon to PRODEL for insertion is PROINS: proc(state,prod); closure_set := Ø ITEMINS(state,prod,1); CLOSURE(state,ITEMINS); for ∀ symb (∃ (symb,*,*,newsh,*) ∈ staite(state)) do; if ∃ (symb,*,*,shift,stateb:=*) ∈ staite(state) then EDGEDEL(state,stateb); end; if prolng(prod) > 0 then successors_set u:= {state}; end PROINS; After the insertion of the item which has the marker before the first member of 'prod' the closure of the marked symbol is also inserted. If there are old items which already shift a symbol 'symb', the state will need a different 'symb'- successor, therefore we delete all lookaheads transmitted through the existing edge. PROINS is always called after PRODEL, so we need not care for 'oldsh' items. The algorithm for the insertion of a single item is more complicated because it is also used later in SUCCESSORS to insert items which have the marker not before the first member. ITEMINS: proc(state,prod,pos); /* insert an item in a state */ symb := promem(prod,pos); if ¬ ∃ (symb,prod,pos,*,*) ∈ staite(state) then do; item := (symb,prod,pos,newsh,*); if symb ≠ eop then do; /* shift item */ if ¬ ∃ (symb,*,*,shift | newsh,*) ∈ staite(state) then closure_set u:= {symb}; if ¬ ∃ (symb,*,*,shift | newsh | reduce,*) ∈ staite(state) then laput_set u:= {state}; successors_set u:= {state}; end; /* shift item */ else do; /* reduce item */ iteact(item) := reduce; laget_set u:= {state}; end; /* reduce item */ if ∃ (symb,*,*,reduce,*) ∈ staite(state) then conflict_set u:= {state}; staite(state) u:= {item}; end; end ITEMINS; Symbol 'symb' is the marked member. If the item is not yet in the state, it is inserted, and if 'symb' was not used previously in the state, it must be transmitted as lookahead
58 by LAPUT. If a conflict is introduced by the new item, the state must later be processed by CONFLICT (see section 4). New reduce-items must get their lookaheads by the aid of LAGET.
3.3.3 Reduced Grammars ----------------------- Context-free grammars are usually required to be reduced: Definition 3.3.3-1: A grammar G = (nonts,terms,prods,start) is called reduced iff the following two conditions are fulfilled: (R1) ∀ A ∈ nonts ∃ form1,form2 ∈ symbs* (start =>* form1 A form2) ('A' is reachable), (R2) ∀ A ∈ nonts ∃ word ∈ terms* (A =>* word) ('A' is based). We will now discuss the two conditions separately. Reachability (R1) is in a certain sense automatically ensured by any LR construction method: Lemma 3.3.3-2: ∀ A ∈ nonts (∃ stateb ∈ states (stasym(stateb) = A) ==> (R1)) Proof: For every predecessor 'statea' of 'stateb' there is a path of shift-transitions from the start-state to 'statea'; 'form1' is the concatenation of all symbols which are shifted along that path. Assume that the parser has read a string such that it is in 'stateb' now: its last action was a shift of 'A' from 'statea' to 'stateb'. For every state reached by the parser there is a string form2 ∈ terms* which continues the word consumed so far to a sentence (the "correct prefix" property).
59 Basement (R2) is not ensured in our construction. There are three reasons to allow unbased nonterminals: (1) We do not want to require the user to insert the productions in a special ordering; he may later insert a production which bases the nonterminal. (2) In the EXTRA system presented in section 1.2 unbased nonterminals are useful in connection with the backspacing of the parser. They are used to introduce parts of the grammar which can only be generated, but not be input. (3) It seems that there is no easy way to check (R2) by looking at the parsing table. Unbased nonterminals may lead to states which do not use any terminal, but not necessarily, as can be seen from Example 3.3.3-3: [axiom = S .S = A .A = b C | d .C = d C ] 2 axiom -> 3 eof =. | S -> 4 eof =: axiom,1 | A -> 5 eof =: S,1 | b -> 6 C -> 8 eof =: A,2 | d -> 7 eof =: A,1
60
3.4 Modification of the Set of States -------------------------------------- 3.4.1 Compatibility of States ------------------------------ The construction process for LR parsing tables terminates because a new successor state will be created only if the table does not yet contain an appropriate state. In the latter case, the state already in the table is made the successor. In practice a new state is always created, then it is decided whether any existing state is compatible with the new one, and if such a state exists, the new one will be deleted and the existing state will be taken as successor. The various construction methods differ in the compatibility criteria (cf. Pager [1977 PGM]) they use, and especially in the field between LALR and LR these criteria are quite important. The criteria are rated from weak to strong: the strongest criterion which is used by SLR(k), NQLALR(k) and LALR(k) is the LR(0) criterion: Two states are compatible iff their cores (sets of marked productions without contexts) are equal. The weakest criterion is Knuth's criterion: Two states are compatible iff they are equal, i.e. iff their sets of configurations (with contexts) are equal. These two criteria have the outstanding advantage that they are applied very easily. Unfortunately Knuth's criterion produces impractical large sets of states for grammars of programming languages, thus the LR(0) criterion is the one which is used in almost all implementations. Pager [1977 PGM], however, presents two other criteria that lie between the extremes: parsers can be generated for all LR(k) grammars, and the resulting sets of states are acceptably small. From his weak and strong compatibility criteria the latter is especially interesting because it ensures a minimal number of states. We will come back to this issue in section 4. Let us now see how we can decide whether two states are compatible. A first aid is Lemma 3.4.1-1 (Pager [1977 PGM]): Two states are compatible iff their sets of nucleus configurations are equal. Proof: The closure configurations of a state depend entirely on the nucleus configurations of that state. This result applies also to the LR(0) criterion, where contexts are not regarded. Nevertheless, the search for a compatible state remains very
61 time consuming if it is applied to every state that has been previously generated. One way to reduce the number of candidates is well known (cf. Aho and Ullman [1977]): If the new state 'statex' is accessed by 'symb', only the states in 'symsta(symb)' need to be tested for compatibility. (Poof: A state which is accessed by a symbol different from 'symb' cannot have the same nucleus as 'statex', because the accessing symbol is the one before the marker in a nucleus item.) For symbols such as 'identifier' or ',' the sets 'symsta' could be rather lengthy. Lemma 3.4.1-1 normally is used only to decrease the effort for the comparison of two states. We will use it here, however, to build a data structure which stores the candidate states: pronuc: prods x NAT --> 2**states (prod,pos) |-> {state | ∃ (*,prod,pos,*,*) ∈ staite(state) & pos >= 2 } "the states which have the marked production in their nucleus" stanum: states --> NAT state |-> card({(prod,pos) | ∃ (*,prod,pos,*,*) ∈ staite(state) & pos >= 2}) "the number of nucleus items in a state" The following algorithm determines for a state 'statex' a state 'statey' which is compatible with 'statex'; at the extreme 'statey' is the same as 'statex'. COMPATIBLE: proc(statea,statex,statey:=*); /* LR(0) compatibility criterion */ dcl set c STA; for ? (*,prod1:=*,pos1:=*,*,*) ∈ staite(statex) (pos1 >= 2); do; set := pronuc(prod1,pos1); for ∀ item ∈ staite(statex) (item = (*,prod:=*,pos:=*,*,*) & pos >= 2 & prod ≠ prod1 & pos ≠ pos1) do; set n:= pronuc(prod,pos); end; end; set \:= {state | state ∈ set & stanum(state) ≠ stanum(statex)}; if set = Ø then do; for ∀ (*,prod:=*,pos:=*,*,*) ∈ staite(statex) do; pronuc(prod,pos) u:= {statex}; end; statey := statex; end; else for ? state ∈ set do; statey := state; end; end COMPATIBLE;
62 In the case that the nucleus has only one element, the statement in the first 'for'-loop is not executed at all. Otherwise the dissection ensures that only states are retained whose nuclei contain the nucleus of 'statex', and after the for-loop equality of the nuclei is required. The mapping 'pronuc' is adjusted in the case where 'statey' is equal to 'statex'. Pager [1977 PGM] writes that the average number of nucleus-items in the parsers he constructed was 1.7, therefore the sets 'pronuc' should be very small, and COMPATIBLE should be rather efficiently find the right state. A modified version of algorithm COMPATIBLE is described in section 4.3.3. Only that version makes use of the parameter 'statea'. During the change of the transition graph new states must be created, and because of the incremental approach we also have to delete superfluous states. We use small separate algorithms for these operations. ALLOC: proc(symb,state:=*); for ? state ∈ STA \ states do; states u:= {state}; stanum(state) := 0; staite(state) := Ø stapre(state) := Ø stasym(state) := symb; symsta(symb) u:= {state}; end; end ALLOC; ALLOC creates a new state which is properly connected to the other data structures, but which has an empty item set. The state is accessed by 'symb'. FREE: proc(state); symsta(stasym(state)) \:= {state}; states \:= {state}; successors_set \:= {state}; laget_set \:= {state}; laput_set \:= {state}; lagar_set \:= {state}; conflict_set \:= {state}; hidden_set \:= {state}; for ∀ (*,prod:=*,pos:=*,*,*) ∈ staite(state) (pos >= 2) /* nucleus-item */ do; pronuc(prod,pos) \:= {state}; end; end FREE; FREE deletes a state and all its traces in the other data structures. It also updates the requests for a processing by other algorithms.
63
3.4.2 Computation of New Successors ------------------------------------ Algorithm SUCCESSORS is the heart of the incremental construction method: SUCCESSORS: proc; /* compute new successors */ dcl set c STA; REDINS: prod(*,statez,prod); /* insert a reduction */ ITEMINS(statez,prod,prolng(prod)+1); end REDINS; ill_set := Ø do while successors_set ≠ Ø set := successors_set; successors_set := Ø for ∀ statea ∈ set do; for ∀ symb (∃ (symb,*,*,oldsh | newsh,*) ∈ staite(statea)) do; if ∃ (symb,*,*,shift | oldsh,stateb:=*) ∈ staite(statea) then do; stapre(stateb) \:= {statea}; ill_set u:= {stateb}; staite(statea) \:= {item | item = (symb,*,*,oldsh,*) & item ∈ staite(statea)}; end; ALLOC(symb,statex); closure_set := Ø for ∀ (symb,prod:=*,pos:=*,shift | newsh,*) ∈ staite(statea) do; ITEMINS(statex,prod,pos+1); /* changes 'successors_set' */ stanum(statex) +:=1; end; COMPATIBLE(statea,statex,stateb:=*); stapre(stateb) u:= {statea}; if statex = stateb then do; CLOSURE(stateb,ITEMINS); end; else do; FREE(statex); laput_set u:= {stateb}; for ∀ item ∈ staite(statea) (item = (symb,prod:=*,pos:=*,shift | newsh,*)) do; FORWARD(stateb,prod,pos+1,REDINS); iteact(item) := shift; itesuc(item) := stateb; end; end; end; /* ∀ symb */ end; end; end SUCCESSORS;
64 Given a production 'prod' to be deleted or inserted, algorithm PROEDIT determined all states 'statea' which contain an item with the marker before the first member of 'prod', and it corrected the closure of these states and collected them in 'successors_set'. Furthermore, all lookaheads that might have become invalid by the change were already deleted by LAGAR. We now have to process the items with actions 'oldsh' resp. 'newsh', were the former must be deleted, and the latter must get an appropriate successor state. Deletion could not take place until now, because other- wise paths in the transition graph would have been destroyed while they are yet needed by LAGAR for the deletion of possibly invalid lookaheads. All items which shift the same symbol 'symb' must be regarded together, since they must get the same successor state 'stateb'. Only in the case of a single item for 'symb' with action 'newsh' such a successor does not yet exist - in the other cases the edge from 'statea' to 'stateb' is deleted. Since 'stateb' thereby may become inaccessible at all, it is noted in 'ill_set' for a processing by GARBAGE. We could have tried to modify the nucleus of 'stateb' in order to reflect the change of the set of items in 'statea' which shift 'symb'. We cannot exclude, however, that there are other predecessors of 'stateb' which did not change.
65
3.4.3 Garbage Collection for States ------------------------------------ Algorithm SUCCESSORS noted the states which possibly got inaccessible from the start-state in 'ill_set'. We now determine in four steps whether they are really superfluous. GARBAGE: proc; /* delete superfluous old states */ dcl (target,sound_set) c STA; EXHAUST: proc(set:=*,target:=*,APPLY); /* propagate something to all successors */ dcl source c STA; source := set; do while source ≠ Ø target := Ø for ∀ state ∈ source do; SUCC(state,APPLY); end; source := target \ set; set u:= target; end; end EXHAUST; SUCC: proc(statea,APPLY); /* do something with all direct successors */ for ∀ (*,*,*,shift,stateb:=*) ∈ staite(statea) do; if stateb ¬∈ ill_set then APPLY(statea,stateb); end SUCC; ADD: proc(*,stateb); target u:= {stateb}; end ADD; PREDEL: proc(statea,stateb); stapre(stateb) \:= {statea}; end PREDEL; /* 1. all successors of "ill" states get "ill" */ EXHAUST(ill_set:=*,target:=*,ADD); /* 2. all states with a "sound" predecessor become "sound" */ sound_set := Ø for ∀ stateb ∈ ill_set do; if ∃ statea ∈ stapre(stateb) (statea ¬∈ ill_set) then sound_set u:= {stateb}; end; /* 3. all successors of "sound" states become "sound" */ EXHAUST(sound_set:=*,target:=*,ADD); /* 4. ill states which did not become "sound" are garbage */ lagar_set := Ø ill_set \:= sound_set; for ∀ statea ∈ ill_set do; SUCC(statea,EDGEDEL); end; LAGAR; for ∀ stateb ∈ ill_set do; SUCC(statea,PREDEL);
66 FREE(statea); end; end GARBAGE; Again we must be careful not to delete any edge before not all lookaheads which were transmitted through the edges are deleted. Note that the transition graph may contain cycles; the problem is the decision whether a deleted edge was the last one leading to a cycle. Of course this is the classical garbage collection problem which could have been solved by the usual marking algorithm (mark all successors of the start-state and then delete the unmarked states). That approach would, however, conflict with the steadiness condition for the incremental construction. The algorithm above limits the amount of work for small deletions in the transition graph to a reasonable size. In an en-bloc construction (without deletion of productions), which is also possible with the algorithms of this work, GARBAGE and LAGAR are not activated at all.
67
3.5 Transmission of Lookaheads ------------------------------- When the new transition graph is built, we may (re-) compute the lookaheads. The basic operation of the NQLALR(1) method for that computation is the following operation: LAINS: proc(stateb,statez,prod); /* insert lookaheads for reductions */ for ∀ symb ≠ eop ((symb,*,*,*,*) ∈ staite(stateb) /* " & sympro(symb) = Ø " for conventional LR(1) */ ) do; item := (symb,prod,prolng(prod)+1,reduce,*); if item ¬∈ staite(statez) then do; if ∃ (symb,*,*,*,*) ∈ staite(statez) then conflict_set u:= {statez}; else laput_set u:= {statez}; end; staite(statez) u:= {item}; end; end LAINS; LAINS takes all symbols 'symb' used in a state 'stateb' and generates reduce-items of the form (symb,prod,prolng(prod)+1,reduce,*) in state 'statez'. For conventional LR(1), only terminals 'symb' must be taken; for LR(1)-like parsing of sentential forms, the nonterminals of 'stateb' are copied, too . If 'symb' was not yet present in 'statez', it might be necessary to transmit it to other states. State 'statez' must therefore later be processed by LAPUT. A processing by CONFLICT is necessary if a conflict is introduced by the new item. Algorithm LAGET computes a first approximation to the lookahead sets of new reductions, or of reductions for which the lookaheads were deleted by LADEL (see section 3.2). In both cases such reductions are recognized by the special lookahead 'eop': LAGET: proc; /* get lookaheads */ for ∀ statez ∈ laget_set do; for ∀ item ∈ staite(statez) (item = (eop,prod:=*,*,reduce,*)) do; EMITTERS(statez,prod,LAINS); staite(statez) \:= {item}; end; end; end LAGET; The states which contain new reductions were stored in 'laget_set' by ITEMINS; the fact that the reduce-item needs
68 lookaheads is known from the special lookahead 'eop'. The states that "emit" the necessary lookaheads are found by EMITTERS, and LAINS is used to generate the appropriate reduce-items in 'statez'. Every time when the set of symbols used in a state was incremented, that state was put in 'laput_set'. Now we are ready to transmit the symbols used in those states to the receiving reductions: LAPUT: proc; do while laput_set ≠ Ø set := laput_set; laput_set := Ø for ∀ stateb ∈ set do; RECEIVERS(stateb,LAINS); end; end; end LAPUT; LAPUT may put the receiving state 'statez' again in 'laput_set', but only if 'statez' did not yet contain the reduce-item. The recursive process will thus terminate at the latest if all reductions in all states have all symbols of 'symbs' as lookaheads.
69
4 Non-NQLALR(1) Grammars ------------------------- For a deterministic parsing procedure there must be a mapping from pairs of states and symbols to parser actions. The construction method xLR(k) (x = "", "S", "NQLA", "LA") may, however, lead to states with two or more different actions for one symbol. Such states are called inadequate, or they are said to have a conflict under that symbol. In this case the grammar is not xLR(k). The actions involved in a conflict may only be 'shift' and 'reduce' or 'reduce' to at least two different productions. As was mentioned in section 3.1, there is only one state with action 'accept' in every parsing table constructed by the algorithms in this work, and this state is never changed during the incremental construction. Two 'shift' actions for the same symbol lead to the same state by definition. Definition 4-1: A conflict in a parsing table T constructed by a xLR(1) method is a 4-tuple (state,symb,prod1,prod2) ∈ states x symbs x prods x prods such that for act ∈ {shift,reduce} there are two items (symb,prod1,*,act ,*) ∈ state and (symb,prod2,*,reduce,*) ∈ state. Symbol 'symb' is called the conflicting lookahead, 'prod1' and 'prod2' are the conflicting reductions. From the possible defects a grammar might have when it is input in a parser constructor for the first time, clerical errors and unreachable or unbased nonterminals are rather easily detected and removed, even by the unexperienced grammar designer. Conflicts are also easily diagnosed in the parsing table, but usually it is difficult to understand them. Therefore the constructor should not produce conflicts wherever possible, and it should give precise diagnostics for the remaining conflicts. In section 4.1, we first will discuss possible causes for conflicts. In section 4.2, the basic definitions and ideas for the removal of NQLALR(1) conflicts are given, succeeded in section 4.3 by the algorithms that try to remove the conflicts by splitting of states.
70
4.1 Causes for Conflicts ------------------------- There may be the following causes for a conflict: (C1) the grammar is ambiguous, (C2) the grammar is unambiguous, but not LR(k) for k >= 0, (C3) the grammar is LR(k) with k >= 2, (C4) the grammar is LR(1), but not LALR(1), (C5) the conflict results from the NQLALR(1) method; it would not occur with LALR(1). For several reasons to be explained in the following we require the grammars to be LR(1), and we therefore may combine causes (C1), (C2) and (C3) to (C123) the grammar is not LR(1). The five causes need some remarks to motivate our approach. (C1) Ambiguity This cause occurs rather often. The ambiguity usually is not intended by the user, therefore he should receive a precise description of the possible different derivations. The original grammars of some important programming languages (ALGOL60, PASCAL) produce avalanches of (intended) ambiguities that are to be resolved by semantic information, mostly about identifiers. Knuth [1965] noted that in ALGOL60 for example there are 8 derivations =>+ . The "dangling else" construction is a representative of another class of intended ambiguities. The additional verbal rule ("an ELSE matches the last preceeding THEN") might have its counter- part in a conflict resolution strategem in the parsing algorithm ("shift in favour of a reduction"). Such approaches are suggested by Aho, Johnson and Ullman [1975], and they are used in their YACC parser constructor. (C2) unambiguous and not LR Knuth [1965] points out that nearly every unambiguous grammar falls in one of the classes LR, RL or "parsable from both ends". Cause (C2) is unlikely to occur in grammars of todays programming languages.
71 (C3) k > 1 This cause is rare, but it occurs for example in a rule quoted from the COBOL standard [ANSI X3.23-1974]: communication_description_entry = CD cd-name ';' FOR INITIAL? INPUT ( ((';' SYMBOLIC QUEUE IS data-name)? (';' SYMBOLIC SUB-QUEUE-1 IS data-name)? (';' SYMBOLIC SUB-QUEUE-2 IS data-name)? (';' SYMBOLIC SUB-QUEUE-3 IS data-name)? (';' MESSAGE DATE IS data-name)? (';' MESSAGE TIME IS data-name)? (';' SYMBOLIC SOURCE IS data-name)? (';' TEXT LENGTH IS data-name)? (';' END KEY IS data-name)? (';' STATUS KEY IS data-name)? (';' MESSAGE COUNT IS data-name)? )? (data-name/',')? )? '.' Again the notational conventions of section 1.2.3 were used. A lookahead ';' does not tell how much options were omitted, but lookaheads of length 3 would suffice. It is rather tedious to rewrite this rule in order to make it LR(1) without losing the ordering of the options prescribed by the rule above. On the other side, if we accept another structure of the parse tree, such a rewriting is always possible by the transformations of Mickunas, Lancaster and Schneider [1976], so that it seems reasonable to require k <= 1. There is another difficulty which speaks against k >= 2 : the increased complexity of the parsing table. If generally lookahead strings of length k are stored, much space will be wasted; if longer lookaheads are only used in states where they are neccessary, the parsing algorithm will be complicated. (C4) LR(1) but not LALR(1) One motivation for a general LR(1) method is rather an aestethical one - the LR(1) property of a grammar seems more "natural", while the conditions for LALR, SLR and NQLALR seem to be deduced from the particular construction algorithms. In addition, Pager [1977 PGM] argues that there are application areas, for example natural language processing, where most of the grammars produce conflicts of the classes (C1), (C2) and (C3). It is very disadvantageous then to have a lot of additional conflicts caused by a weak construction method. In the following it is shown how to remove conflicts of the (C4) category. (C5) Conflict caused by NQLALR(1) Conflicts of this type show that the NQLALR(1) method is inferior
72 to the LALR(1) method with respect to generality. Therefore the algorithms in section 4.3 look whether such a conflict would not have been present if LALR(1) had been employed, and in this case the superfluous lookahead is simply deleted.
4.2 Foundations of the LR(1) Construction ------------------------------------------ From the discussion in the previous section it seems reasonable to construct parsing tables for all LR(1) grammars, not more and not less. There are at least two ways that lead to this summit: LR(1) / \ GF / \ Pager 1980 / \ [1977 PGM] / \ LALR(1)/ \ NQLALR(1)/ \ / \ LR(0)/ \ / \ / \ / \ While Pager [1977 PGM] described a direct attack by a genuine LR(1) method which does not rely on a LR(0) construction, in this work we have the additional burden of the incremental construction, and therefore three intermediate camps are used. Section 3 led us to the second of these. We will now first look for a way from LALR(1) to LR(1); the step from NQLALR(1) to LALR(1) then will be found at once.
73
4.2.1 A Non-LALR(1) Grammar ---------------------------- Aho and Ullman [1972] give an example (7.27) for a grammar which is LR(1) but not LALR(1): Example 4.2.1-1: [axiom = S .S = A a | B b | d A b | d B a .A = c .B = c ] LALR(1) produces the following transition graph for this grammar: 2 axiom -> 3 eof =. | S -> 4 eof =: axiom,1 | A -> 5 a -> 9 eof =: S,2 | d -> 6 A -> 10 b -> 13 eof =: S,3 | | B -> 11 a -> 14 eof =: S,3 | | c -> 8 a =: A,1 | | a =: B,1 | | b =: A,1 | | b =: B,1 | B -> 7 b -> 12 eof =: S,2 | c -> 8 ... State 8 has conflicts for 'a' and 'b', but we could replace it by two conflict-free states 8' and 8": 2 axiom -> 3 eof =. | S -> 4 eof =: axiom,1 | A -> 5 a -> 9 eof =: S,2 | d -> 6 A -> 10 b -> 13 eof =: S,3 | | B -> 11 a -> 14 eof =: S,3 | | c -> 8' b =: A,1 | | a =: B,1 | B -> 7 b -> 12 eof =: S,2 | c -> 8" a =: A,1 | b =: B,1 The LR(0) compatibility criterion used by LALR(1) sometimes is too strong, because the identification of some states constructed by Knuth's LR(1) method may lead to conflicts that were not present before. In the example, it was possible to cancel the effect of the LALR(1) merging of states. In some cases it can be seen from the outset that such a splitting of states is not possible.
74
4.2.2 Overloaded Conflicts --------------------------- Aho and Ullman [1977] proved Lemma 4.2.2-1: A grammar for which LALR(1) produces a shift-reduce conflict cannot be LR(1). Proof: Assume we have a LR(1) grammar. With LALR(1), let a conflict occur in 'statei' for the configurations i1 = [A1 = x1 @a y1, c1] and i2 = [A2 = x2 @ , a]. The conflict is not present with LR(1), so it must originate from merging a set of at least two states. All states in this set have an configuration with the same core as i1, and at least one state 'statej' of this set has an configuration j2 = [A2 = x2 @ , a] But this 'statej' would have the same conflict, and therefore the grammar could not be LR(1). A similar observation is made for reduce-reduce conflicts. For this purpose it facilitates the presentation if the various lookaheads and reductions in a state 'q' are denoted by a boolean (m,n) matrix 'Mq' with Mq(ri, sj) = 1 (0) if sj (not) is lookahead for ri for all reductions r1...rm and all symbols s1...sn in state 'q'. In every row and column of such a matrix there is at least one bit set, because every reduction has at least one lookahead, and we will not invent columns for symbols that do not occur as lookahead in the state. A column with more than one bit set indicates a reduce-reduce conflict. An inadequate state 'q' with two reduce-reduce conflicts may be represented by q | s1 s2 ---+------ r1 | 1 1 r2 | 1 1 , it could have resulted from merging two adequate states q1 | s1 s2 q2 | s1 s2 ---+------ ---+------ r1 | 0 1 and r1 | 1 0 r2 | 1 0 r2 | 0 1 ,
75 but if, for example, a third reduction were involved, q' | s1 s2 ---+------ r1 | 1 1 r2 | 1 1 r3 | 1 1 such a partitioning would not be possible. This leads to Definition 4.2.2-2: A boolean (m,n) matrix M with ∀ i ∈ {1...m} ∃ j ∈ {1...n} (M(i,j) = 1) is called overloaded iff there does not exist a finite, nonempty set of (m,n) matrices {M1...Ml} such that (S1) M = M1 v M2 v ... v Ml (S2) ∀ k ∈ {1...l} ∀ i ∈ {1...m} ¬ ∃ j ∈ {1...n} (Mk(i,j) = 1) (S3) ∀ k ∈ {1...l} ∀ j ∈ {1...n} ¬ ∃ i1, i2 ∈ {1...m} (i1 ≠ i2 & Mk(i1,j) = 1 & Mk(i2,j) = 1) Obviously this property is decideable for every (n,m) matrix: build the finite set of all (m,n) matrices, take every subset {M1...Ml} of this set, and check it for the three conditions. There are 2**(2**(m*n)) such subsets, and it would therefore be very desireable to have an algorithm with a better complexity, e.g. a polynomial one. Even a good algorithm, however, would provide only a partial answer to our original problem: if the matrix corresponding to the state is not overloaded, we are not yet sure that splitting is possible, and how this splitting should be done. Therefore this way is not pursued further.
76
4.2.3 Lane Tracing ------------------- A method which has much similarity to NQLALR, but which computes correct lookaheads for all LALR grammars, is the lane tracing algorithm described in Pager [1977 LTA]. As for NQLALR, the contexts of items need not to be stored, but the lookaheads of a reduction are computed from an existing set of items. Definition 4.2.3-1: For an item i1 = [A = x @a y], where x ∈ V+, a ∈ V, y ∈ V*, the item [A = x a @y] is called a transition successor of i1, while, if a ∈ N, denoted by |w|. There is one word __________________ A lane is a sequence i[1]...i[m], where for j = 1...m-1 i[j+1] is a (transition or closure) successor of i[j]. The lanes represent the paths along which contexts are propagated in the canonical LR(1) method. Definition 4.2.3-2: The abbreviated form of a lane i[1]...i[m] is a subsequence i[j1]...i[jn] such that for k = 1...n-1 (1) i[j[k]+1] is a closure successor of i[j[k]], (2) i[l+1] is a transition successor of i[l] for l = j[k+2]...j[k+1]-1. If i1...in is a lane in abbreviated form with items i[j] = [A[j] = x[j] @a[j] y[j]], j = 1...n, then we will call y[n]...y[1] the string generated by the lane. The importance of these definitions is motivated by Theorem 4.2.3-3: The set of lookaheads for a reduce-item i is {t ∈ T | ∃ l ∃ w ∈ V* (l is a lane ending with i and w is the string generated by its abbreviated form and t ∈ FIRST1(w) }. Proof: Though lane-tracing starts at a reduction and follows the edges of the transition graph backwards, the definitions above are a mere reformulation of the rules for the propagation of context-sets in the canonical LR(1) method. Pager states that the number of possible lanes rises exponentially with the number of items in the parsing table, and the essence of his algorithm lies in the fact that the work he needs is at most
77 a linear function of the number of items in the parsing table. In section 4.3 the idea of lane tracing is used to determine the correct lookaheads for single reductions which are involved in a conflict. Since the number of such reductions should be small, it suffices to use a straight-forward lane tracing algorithm.
4.2.4 Strong Compatibility --------------------------- Pager [1977 PGM] describes a method which constructs parsing tables of minimal size from all LR(1) grammars. In the following some results of that paper are summarized. Definition 4.2.4-1: For i=1,2 let pi[0] ∈ P be marked productions pi[0] = (Ai[0], xi[0] yi[0]), where the marker is before yi[0]. A string w ∈ V* is called a shared descendant of y1[0] and y2[0] iff for i=1,2 and for ni ∈ NAT there are (possibly empty) sequences of productions pi[j] = (Ai[j], yi[j]), j = 1...ni, and sequences of strings ui[j], j = 0...ni, such that (1) ui[j-1] yi[j-1] = ui[j] Ai[j] => ui[j] yi[j] for j = 1...ni, and (2) ui[0] = e and ui[ni] yi[ni] = w and (3) the last member of the sequence p1[0]...p1[n1] is different from the last member of p2[0]...p2[n2].
78 During the construction Pager uses the following special compatibility criterion: Definition 4.2.4-2: For i=1,2 let the states Si = {[Aj = xj @yj, Ui[j]] | j = 1...r} be two sets of configuration groups with a common core. S1 and S2 are called strongly compatible iff for 1 <= j1 < j2 <= r (1) U1[j1] n U2[j2] = Ø & U2[j1] n U1[j2] = Ø, or (2) the scanned strings y[j1] and y[j2] do not share a descendant. This criterion allows it to merge as much states as possible, as Pager explains by the following intuitive argument: Let, for example, c ∈ u1[j1] n u2[j2] for some c, j1, j2, and let y[[j1] and y[j2] share a descendant, in contrast to conditions (1) and (2). Let (B[j1], w[j1]) and (B[j2], w[j2]) be the corresponding different, last members of the sequences. Then there will be a state in the parsing table which contains the two configurations [B[j1] = w[j1], c] and [B[j2] = w[j2], c]. But this is a conflict. For a formal proof for the minimality of the parsing table the reader is referred to the original paper. Though the result of this method is optimmal, its implementation needs a lot of computational effort. The context-sets must be stored for every item, and for every new state it must be checked whether the state is strongly compatible to an existing state. Therefore this method seemed to be unsuitable for an incremental construction. We will, however, make use of definition 4.2.4-2 when we try to achieve general LR(1) in an afterthought.
79
4.3 Algorithms for LR(1) ------------------------- Because of its simplicity and its efficiency the NQLALR(1) method was chosen to compute the mass of lookaheads. For most unambiguous grammars it should produce a conflict-free parsing table. Only if conflicts occur we have to be more careful. Our general procedure then will be the following: (1) Take a pair of conflicting items (shift-reduce or reduce-reduce conflict). (2) Use the lane tracing algorithm to determine the real origins of the conflicting lookahead for the reduce- item(s). (3) Perhaps there is no origin at all - this indicates a NQLALR(1)-specific conflict, and the corresponding reduce-item may simply be deleted. (4) For a reduce-reduce conflict look whether all lanes for the two reductions diverge at some points. Recompute the successors of the states at those bifurcations, since successors were not strongly compatible. (5) Repeat steps (1) to (4) until all pairs of conflicting items are processed. Surprisingly step (2) provides a solution to both the NQLALR(1)- and the LALR(1)-specific problems. We will treat this step first.
80
4.3.1 The Graph of Lanes ------------------------- The example in section 3.2 showed that it is the error of the NQLALR(1) method to take all terminals from reduce-items as lookaheads of other reductions. This time we are interested in the origins of a single lookahead 'symb' of a single reduction. Again we go back the edges of the transition graph until we reach the states 'stateb' where the marker is before the left side 'left' of the production. With the canonical LALR(1) method these states would contain some configuration groups ci = [Ai = xi @left yi, Ui], i = 1...n Symbol 'symb' is a valid lookahead for the reduction to 'left' if it is in FIRST1(yi Ui) for some 'i'. We first look whether 'symb' is in any FIRST1(yi), but if not, and if yi =>* e, we would need the 'Ui' which are not directly available in our method. If yi =>* e, we must further go back, moving the marker in the production for 'Ai' to the left, until we reach the states with the marker before 'Ai'. Here the above argument applies again. Stated with Pager's terminology, we trace the lanes until it is clear whether 'symb' is in FIRST1 of the strings generated by the abbreviated form of the lanes. In algorithm EMITTERS it sufficed to compute a set of states which "emitted" the lookahead. Now we are forced to build a graph of lanes explicitly, since a state may be visited several times by different items during the tracing process. Definition 4.3.1-1: A graph of lanes is a 2-tuple GL = (nodes,edges), where (1) nodes c states x prods x NAT x ACT is a finite set of nodes (which corresponds to a set of items in states), (2) edges c nodes x nodes is a relation between nodes. We will use the following mappings on nodes: (3) nodsta : nodes --> states, nodpro : nodes --> prods, nodpos : nodes --> NAT, nodact : nodes --> ACT, nodsuc : nodes --> 2**nodes, nodpre : nodes --> 2**nodes. For n1 ∈ nodes we will combine the values of these mappings to 6-tuples of the form (sta, pro, pos, act, suc, pre).
81 For any n1,n2 ∈ nodes we require: (4) (n1 ∈ nodpre(n2) & n2 ∈ nodsuc(n2)) <==> (n1,n2) ∈ edges, (5) (*,nodpro(n1),nodpos(n1),*,*) ∈ nodsta(n1), (6) n1 ∈ nodpre(n2) <==> ∃ (*,nodpro(n1),nodpos(n1),shift,nodsta(n2)) ∈ staite(nodsta(n1)) "n2 is a transition successor of n1" | (nodsta(n1) = nodsta(n2) & nodpos(n2) = 1 & promem(nodpro(n1),nodpos(n1)) = prolef(nodpro(n2))) "n2 is a closure successor of n1" Algorithm LANE_GRAPH builds such a graph of lanes. It is given a set 'nodes' initially containing a single node (the "root") which corresponds to a conflicting reduction, and the conflicting lookahead 'symb'. LANE_GRAPH: proc(symb,nodes:=*,ORIGIN); dcl (source,target) c (nodes1 u nodes2) ,(is_new, is_first, is_null) ∈ BOOL ,NODINS proc; source := nodes; do while source ≠ Ø target := Ø for ∀ nodeb ∈ source (nodeb = (stateb:=*,prod:=*,pos:=*,act:=*,*,*)) do; if pos >= 2 then do; pos -:= 1; for ∀ statea ∈ stapre(stateb) do; NODINS(statea,prod,pos,nodea:=*,is_new:=*,nodeb); if is_new then target u:= {nodea}; end; end; /* pos >= 2 */ else do; /* pos = 1 */ for ∀ itema ∈ staite(stateb) (itema = (prolef(prod),proda:=*,posa:=*,shift,statec:=*)) do; NODINS(stateb,proda,posa,nodea:=*,is_new:=*,nodeb); NULLABLE(proda,posa+1,symb,is_first:=*,is_null:=*); if is_first then ORIGIN(nodea); else if is_null & is_new then target u:= {nodea}; /* else '= error' : do nothing */ end; /* ∀ itema */ end; /* pos = 1 */ end; source := target; end; end LANE_GRAPH;
82 If the marker is not before the first member, it is moved one position left, and all predecessors are inserted in the graph by the aid of NODINS. Otherwise, the items 'itema' which shift the left side of 'prod' are considered. Corresponding nodes are inserted by 'NODINS'. Algorithm NULLABLE is then used to determine whether (1) symb ∈ FIRST1(yi) <==> is_first = true, (2) yi =>* e <==> is_null = true. If (1), an origin for the lookahead is found, and one of the algorithms ORIGIN1, ORIGIN2 is called. If (2), and if the graph did not yet contain 'nodea', the lane must further be traced backwards. Algorithm NODINS is given a node 'nodeb' and an item (determined by 'statea', 'proda' and 'posa') which shall be inserted as a predecessor node 'nodea'. In 'is_new' we note whether the graph already contained that node. NODINS: proc(statea,proda,posa,nodea:=*,is_new:=*,nodeb); /* insert a node in the graph of lanes */ /* for 'nodes' see LANE_GRAPH */ if ∃ node ∈ nodes (node = (statea,proda,posa,*,*,*)) then do; nodea := node; nodsuc(node) u:= {nodeb}; is_new := false; end; else do; nodea := (statea,prod,pos,error,{nodeb},Ø); nodes u:= {nodea}; is_new := true; end; nodpre(nodeb) u:= {nodea}; end NODINS; Though the requirements for NULLABLE are easily stated, the algorithm itself is rather lengthy. In the following description a mapping stack: NAT --> prods x NAT x BOOL x BOOL is used. The first two projections stkpro: NAT --> prods stkpos: NAT --> NAT define a marked production, the third projection stkexp: NAT --> BOOL tells whether the marked symbol was already pushed onto the stack, while the last projection stknul: NAT --> BOOL is 'true' iff the marked symbol is known to be nullable.
83 NULLABLE: proc(proda,posa,symbol,is_first:=*,is_null:=*); dcl prod ∈ PRO, pos ∈ NAT, symb ∈ SYM; stack(0) := (*,*,true,false); /* dummy element */ top := 1; is_first := false; stack(top) := (proda,posa,false,false); do while top >= 1; if ¬ stkexp(top) then do; stkexp(top) := true; symb := promem(stkpro(top),stkpos(top)); if symb = symbol then do; is_first := true; goto done; end; else if symb ≠ eop then for ∀ prod ∈ sympro(symb) do; if ¬ ∃ i (1 <= i & i <= top & stack(i) = (prod,1,*,*)) then do; top +:= 1; stack(top) := (prod,1,false,false); end; end; /* for ∀ prod */ else do; /* symb = eop */ i := top - 1; /* look for the expanded nonterminal */ do while ¬ stkexp(i); i -:= 1; end; stknul(i) := true; end; /* symb = eop */ end; /* ¬ stkexp(top) */ else do; /* stkexp(top) */ if stknull(top) then do; stkpos(top) +:= 1; stknul(top) := false; end; else top -:= 1; end; /* stkexp(top) */ end; /* while top >= 1 */ is_null := stknul(0); done: end NULLABLE; The interesting marked production is stored in 'stack(1)', the information whether the marked portion is nullable is accumulated in the dummy element 'stack(0)'. Marked nonterminals are now expanded, and if they are nullable, the marker is moved right by one position, until eventually the end of the production will be reached. In this case, the "father"-symbol is nullable, too. If 'symbol' is in FIRST1 of the marked portion, an origin has been found, and the algorithm is terminated at once.
84
4.3.2 Processing of Conflicts ------------------------------ The following algorithm performs steps (1), (2) and (3) mentioned in the introduction to section 4.3: SELECT_PAIR: proc(state); /* select and process a pair of conflicting items */ dcl (real_la1, real_la2) ∈ BOOL ,(ORIGIN1, ORIGIN2) proc; do while ∃ item1 ∈ staite(state) (∃ item2 ∈ staite(state) & itesym(item1) = itesym(item2) & (itepro(item1) ≠ itepro(item2) | itepos(item1) ≠ itepos(item2)) & iteact(item2) = reduce ); symb := itesym(item1); if iteact(item1) = shift then nodes1 := {(state,itepro(item1),itepos(item1) ,shift,Ø,Ø)}; real_la1 := true; end; /* shift-reduce conflict */ else do; nodes1 := {(state,itepro(item1),itepos(item1) ,error,Ø,Ø)}; real_la1 := false; LANE_GRAPH(symb,nodes1:=*,ORIGIN1); end; if real_la1 then do; nodes2 := {(state,itepro(item2),itepos(item2) ,error,Ø,Ø)}; real_la2 := false; LANE_GRAPH(symb,nodes2:=*,ORIGIN2); if ¬ real_la2 then staite(state) \:= {item2}; else goto done; end; /* real_la1 */ else staite(state) \:= {item1}; end; /* while ∃ item ... */ conflict_set \:= {state}; done: end SELECT_PAIR; We take a pair of conflicting items, and for the reduce-item(s) we build the graph(s) of lanes by the aid of LANE_GRAPH. If this algorithm finds any origin for the lookahead of reduce_item 'i', it sets 'real_lai' to 'true' and calls algorithm ORIGINi for i=1,2. If no origin is found, the lookahead is deleted, and another conflicting pair is processed.
85
4.3.3 Splitting of States -------------------------- Finally we have to attack the remaining steps (4) and (5) of our general procedure. In example 4.2-1, it was easy to see how state 8 must be splitted to remove the conflicts. But how can this be seen in the general case? First we note that the presence of any conflict indicates that both conditions of Pager's strong compatibility (definition 4.2.4-2) must have been violated during the construction process. The presence of shared descendants (condition (2)) is a property of the core of the states, and since the set of cores depends only on the LR(0) parsing table, any LR(k) table for the grammar under concern will have states where some scanned strings share a descendant. The only possibility we have is to avoid the violation of condition (1): the context-sets must be made disjoint. If we imagine the states and the edges of the graphs of lanes passing through them, we have to look for bifurcations as shown in figure 4.3.3-1. --------------------------------------------------------------- x ... statec -----------+ | x ↓ ... statea ------> stateb ------> ... ------> root --------------------------------------------------------------- Figure 4.3.3-1: Bifurcation before Splitting While 'stateb' contains items which propagate the conflicting lookahead to both reductions, 'statea' and 'statec' contain only items which propagate it to only one of the reductions. In this case, the x-successor of 'statec' is not strongly compatible with the x-successor of 'statea', and therefore 'stateb' (and all states between 'stateb' and the root) should be splitted. The resulting situation is shown in figure 4.3.3-2. --------------------------------------------------------------- x ... statea ------> stateb1 -----> ... ------> root1 x ... statea ------> stateb2 -----> ... ------> root2 --------------------------------------------------------------- Figure 4.3.3-2: Situation after Splitting
86 The general case of course is not as simple as suggested by the figures. For a single conflict we have to look simultaneously at all pairs of lanes, and the conflict is removeable only if the pairs always bifurcate at some states. Moreover, the lanes may have cycles. The existence of bifurcations for all lanes is a necessary condition for a grammar to be LR(1): Lemma 4.3.3-3: Let TLA be a parsing table constructed by LALR(1) from a grammar G. If TLA contains a reduce-reduce conflict under symbol 'symb' in state 'root', and if there is a pair of lanes which do not separate (as indicated by figure 4.3.3-1), then G is not LR(1). Proof: Assume G were LR(1). Since the lanes do not separate, there must be a state 'statec' common to both lanes which is origin of at least one of the lanes. This state would have two configurations [A1 = x1 @ l1 y1, c1] and [A2 = x2 @ l2 y2, c2], where, for example, symb ∈ FIRST1(y1) & symb ∈ FIRST1(y2 c2). By the argument of the proof of lemma 4.2.2-1 it is clear that, even in the parsing table T constructed from G by Knuth's LR(1) method, there must be a state 'stated' containing two configurations [A1 = x1 @ l1 y1, d1] and [A2 = x2 @ l2 y2, d2], where symb ∈ FIRST1(y2 d2). But this state ensures that the conflict of 'root' also exists in some state of T. The "bifurcation states" 'stateb' are easily detected during the construction of the graph of lanes. Whenever LANE_GRAPH reaches an origin of a lane, it calls one of the algorithms ORIGIN1, ORIGIN2. ORIGIN1 is called when a lane for the first reduction is traced. It simply marks the node for the origin with action 'shift', while all successors of this node are marked with 'reduce'.
87 ORIGIN1: proc(origin); /* process an origin of the lookahead for the 1st reduction */ dcl (source,target) c nodes1; real_la1 := true; nodact(origin) := shift; source := nodsuc(origin); do while source ≠ Ø target := Ø for ∀ node ∈ source do; hidden_set u:= {nodsta(node)}; if nodact(node) = error then do; nodact(node) := reduce; target u:= nodsuc(node); end; end; source := target; end; end ORIGIN1; Symbol 'symb' is really a lookahead for the 1st reduction. The node for the origin is marked with 'shift', while all its successors are marked with 'reduce'. ORIGIN2 detects the bifurcation states while tracing a lane forwards (from the origin to the root), and it prepares the splitting process to be done by algorithm CONFLICT, below. ORIGIN2: proc(origin); /* process an origin of the lookahead for the 2nd reduction */ dcl (source,target) c nodes2; real_la2 := true; if ¬ ∃ node1 ∈ nodes1 (nodsta(node1) = nodsta(origin) & nodact(node1) ≠ error) then do; /* will the path join later? */ source := {origin}; do while source ≠ Ø target := Ø for ∀ nodea ∈ source do; for ∀ nodeb ∈ nodsuc(nodea) do; if ∃ node1 ∈ nodes1 (nodsta(node1) = nodsta(nodeb) & nodact(node1) ≠ error) then do; if nodact(node1) = reduce then /* look for an origin for the 1st reduction */ SEPARATE(nodea,nodeb); else removeable := false; end; else /* ∀ node1 (nodact(node1) = error) */ target u:= {nodeb}; end; end; source := target; end; end; else removeable := false; end ORIGIN2;
88 Symbol 'symb' is really a lookahead for the 2nd reduction. If we have found an originator for the second reduction, we will have to look whether the path from this node joins a path for the first reduction before the root (there the paths will always join). There are three cases: (1) The second originator already lies on a path for the first reduction - the conflict cannot be resolved. (2) The two paths join at 'node2', and there lies no originator for the first reduction between 'node2' and the root - the state corresponding to 'node2' can be splitted. (3) As for (2), but there lies an originator for the first reduction between 'node2' and the root - the conflict cannot be resolved. The latter two cases are distinguished by the following subalgorithm. SEPARATE: proc(nodea,nodeb); /* separate two incompatible nodes */ dcl (source,target) c nodes2; source := {nodeb}; hidden_set u:= source; do while source ≠ Ø /* look for an origin for the 1st reduction */ target := Ø for ∀ node ∈ source do; if ¬ ∃ node1 ∈ nodes1 (nodsta(node1) = nodsta(node) & nodact(node1) = shift) then target u:= nodsuc(node); else do; /* origin found */ removeable := false; goto done; end; end; source := target; hidden_set u:= source; end; statea := nodsta(nodea); stateb := nodsta(nodeb); for ∀ item ∈ staite(statea) (item = (*,*,*,shift,stateb)) do; iteact(item) := newsh; end; successors_set u:= {statea}; done: end SEPARATE; All states which must be splitted are collected in 'hidden_set'. If the lanes are separable, 'removeable' is not set to 'false', and the deletion of the edge from 'statea' to 'stateb' is prepared: 'statea' is noted in 'successors_set', and the invalid shift-actions are marked with 'newsh'.
89 When computing the new successors it must be avoided that the lanes join again, i.e. that the conflict reappears. Therefore a set 'sic_set' is introduced which provides information on the states which are strongly incompatible (cf. definition 4.2.4-2): (x,y) ∈ sic_set means that a conflict would reoccur if 'y' is chosen as successor of 'x'. This set is maintained and used in a modified version of algorithm COMPATIBLE called by SUCCESSORS during the splitting phase (cf. section 3.4.1): COMPATIBLE: proc(statea,statex,statey:=*); /* Pager's strong compatibility criterion */ dcl set c STA; for ? (*,prod1:=*,pos1:=*,*,*) ∈ staite(statex) (pos1 >= 2); do; set := pronuc(prod1,pos1); for ∀ item ∈ staite(statex) (item = (*,prod:=*,pos:=*,*,*) & pos >= 2 & prod ≠ prod1 & pos ≠ pos1) do; set n:= pronuc(prod,pos); end; end; set \:= {state | state ∈ set & stanum(state) ≠ stanum(statex)}; | for ∀ state ∈ set n hidden_set do; | sic_set u:= {(statea,state)}; | end; | set \:= {state | (statea,state) ∈ sic_set}; if set = Ø then do; for ∀ (*,prod:=*,pos:=*,*,*) ∈ staite(statex) do; pronuc(prod,pos) u:= {statex}; end; statey := statex; end; else for ? state ∈ set do; statey := state; end; end COMPATIBLE; First all states with the same core as 'statex' are collected in 'set' (in the non-LALR case there may be more than one such state). The successor of 'statea' will be not strongly compatible with any of these states which is in 'hidden_set', since (1) it shares a descendant (because of the conflict) and (2) it shares a context (the conflicting lookahead) with some configuration in the states. This information is stored, and another successor is chosen.
90 After all these preparations we are now ready for CONFLICT, the algorithm which tries to remove (NQ-) LALR(1)-conflicts by splitting of some states. CONFLICT: proc; dcl removeable ∈ BOOL, hidden_set c STA ,sic_set c STA x STA; for ∀ state ∈ conflict_set do; next_pair: removeable := true; hidden_set := Ø successors_set := Ø SELECT_PAIR(state); if state ∈ conflict_set then do; if removeable then do; for ∀ statec ∈ successors_set do; for ∀ itemc ∈ staite(statec) (iteact(item) = newsh) do; EDGEDEL(statec,stateb); end; end; /* ∀ statec */ LAGAR; SUCCESSORS; LAGET; LAPUT; end; /* removeable */ else do; /* ¬ removeable */ for ∀ statec ∈ successors_set do; for ∀ itemc ∈ staite(statec) (item = (*,*,*,newsh,stateb:=*)) do; iteact(itemc) := shift; end; end; /* ∀ statec */ hidden_set := Ø successors_set := Ø goto done; end; /* ¬ removeable */ goto next_pair; end; /* if state ∈ conflict_set */ end; /* for ∀ state ... */ done: sic_set := Ø end CONFLICT; The condition after the call of SELECT_PAIR is necessary because this algorithm deletes 'state' from 'conflict_set' iff 'state' contains no or only NQLALR(1)-conflicts. If the condition is 'true', an LALR(1)-conflicts was found, and if it is 'removeable', the conflicting lookahead will be deleted for the second reduction by LAGAR as a consequence of 'EDGEDEL(statec,stateb)'. If the conflict is not 'removeable', the preparations for splitting are undone, and CONFLICT terminates immediately without inspection of other, perhaps removeable conflicts. It seems appropriate finally to show the correct working of the algorithm in more detail:
91 Theorem 4.3.3-4: Applied to a parsing table T3 constructed from a current grammar G by the algorithms of section 3, (1) algorithm CONFLICT terminates and (2) produces a conflict-free parsing table T4 if G is LR(1), and in this case (3) the number of states in T4 is minimal as long as productions are not deleted by REORGANIZE. Proof - General Remarks: The property of all conflicts in a table T3 to be 'removeable' is not affected by the splitting process: splitting will duplicate lanes, but no new lookaheads are introduced. If 'removeable' turns out to be 'false' after some splitting steps, this could have been seen when SELECT_PAIR were applied to the corresponding conflict first. It should be clear that CONFLICT really removes all conflicts in a particular 'state'. The conflicts also cannot be reintroduced in 'state', because the modified version of COMPATIBLE will not allow conflict-generating lanes to lead again to 'state'. The lanes provide complete information on contexts, therefore we may again consider configurations instead of items. In the following 'CFS(state)' is used to denote the set of configurations of 'state'. Every splitting of a state 'state4' results in two states 'state41', 'state42' such that CORE(state41) = CORE(state42) = CORE(state4), and, for example, state41 ∈ hidden_set & state42 ¬∈ hidden_set. The new states each lost the configuration(s) which is (are) responsible for the lookahead of the conflicting reductions: CFS(state41) c≠ CFS(state4) & CFS(state42) c≠ CFS(state4) & CFS(state41) ≠ CFS(state42) Furthermore, for i=1,2 there are states 'statei' in the parsing table T constructed from G by LR(1) with CORE(statei) = CORE(state4i) & CFS (statei) c CFS (state4i), i=1,2 (otherwise a parser using T4 would announce 'error' for some w ∈ L(G)).
92 Proof of (1): The algorithm terminates at once if 'removeable' is set to 'false'. Therefore we may now assume that 'removeable' is always 'true'. The number of states with different configuration sets is finite for a finite grammar. Therefore an infinite execution of the loop over 'conflict_set' would create at least two different states with identical configuration sets. These states must be strongly incompatible because COMPATIBLE did not identify them, but then the two states must have different configuration sets. This contradiction yields (1). Proof of (2): From lemma 4.3.3-3 we know that an LR(1) grammar ensures bifurcation-points for all pairs of lanes, i.e. 'removeable' is always 'true'. Therefore CONFLICT will be terminated by exhaustion of 'conflict_set', but that implies a conflict- free resulting parsing table. Proof of (3): COMPATIBLE only creates a new state if the existing states are all stronly incompatible with the desired successor. Therefore the resulting table will not contain any different, but strongly compatible states, and (3) follows from Pager [1977 PGM]. Deletion of productions could, however, remove the cause for an incompatibility, and the algorithms will not reidentify the states in this case - this is the reason for the restriction in (3).
93
5 Implementation ----------------- The algorithms in sections 3 and 4 were expanded to PL/1 procedures in a rather straightforward way. The sets and mappings were represented by arrays or, if the deletion of elements is necessary, by linked (and sometimes ordered) lists. All arrays were confined to one dimension because of storage problems. For example, marked productions were represented by a single integer. The avoidance of dimensions >= 2 perhaps influenced the design of the entire parser constructor. The quantors '∀' and '∃' were expanded to WHILE-loops over some set. The operations 'u:=' and '\:=' result in a processing of two ordered linked lists. Deletion of productions (mainly algorithms PRODEL and ITEMMARK) and the splitting of states (algorithm SEPARATE) are not yet implemented. Table 5-1 gives an impression of the size of various parts of the implemented system. +-----------------------------------------------------------+ | 1: section 3, NQLALR(1) 2008 | | 2: section 4, LALR(1) 415 | | 3: scanning, test-printout etc. 1748 | +-----------------------------------------------------------+ | 4: 1+2+3, running incremental constructor 4171 | | | | 5: transformator constructor 1087 | | 6: parser, transformator, output 2431 | +-----------------------------------------------------------+ | 7: 4+5+6, running EXTRA system 7689 | +-----------------------------------------------------------+ Table 5-1: Number of PL/1 source lines for components of the EXTRA system The constructor was initially tested with a set of some twenty small "artificial" grammars from the literature which contain a number of peculiarities (empty right sides, left and right recursion, NQLALR(1)-, LALR(1)-, LR(1)-conflicts etc.). A major claim of this work is the feasibility of the incremental construction in practical applications. Therefore the constructor was also used to construct parsing tables from ten "natural" grammars of small to medium size. The grammars are too long to be reproduced in this work. As the names indicate, most of the grammars describe programming languages or subsets thereof. Two criteria for the selection of a grammar were unambiguity and the top- down property explained below. Table 5-2 gives several figures counted from these grammars:
94 sym number of symbols in the grammar (terminals and nonterminals) pro number of productions (right sides of rules) gram length of the grammar, i.e. sum(∀ p ∈ prods) (prolng(p) + 1) sta0 number of states of the LR(0) (and hence SLR(k), NQLALR(k), LALR(k)) parsing table ite number of items (including reduce-items for nonterminal lookaheads) +-----------+---------------------------------------+ | | sym | pro | gram | sta0 | ite | +-----------+---------------------------------------+ | BASIC | 175 158 418 234 4979 | | FORTRAN77 | 239 305 964 513 3488 | | MIDICAP | 286 310 887 459 8935 | | MINCOBOL | 327 273 991 586 5705 | | MINPL1 | 233 185 658 400 5343 | | PAL | 110 81 267 157 2903 | | PLM | 138 131 404 219 3572 | | SPL | 145 135 395 220 3356 | | WIRTH | 42 26 78 43 231 | | XPL | 121 104 324 182 2524 | +-----------+---------------------------------------+ Table 5-2: Size of the grammars and their parsing tables Finally the constructor was instrumented to measure the times used for the construction. The grammars were all processed twice. First, all productions were inserted en-bloc, and the table was reorganized only once. Second, the table was reorganized every time after the insertion of a single production. The latter measurement is fair only if the grammar is inserted "top-down", i.e. if the left side of every production just inserted is already reacheable from the start-symbol. Otherwise the insertion of the production would be delayed, and later greater portions of the grammar would be inserted en-bloc. The productions of all tested grammars are ordered such that this property is fulfilled.
95 Table 5-3 below shows the following figures: pro number of productions inserted t∀ time for one reorganization after the insertion of all productions t1 time for 'pro' reorganizations after the insertion of every single production t1/pro average time for the insertion of one production t1/t∀ "fragmentation factor" All times are given in CPU-seconds on a Siemens 7.748 computer which is nearly half as fast as an IBM 370/158. The times include initialization of the system, reading of the grammar, print-out and so on. +-----------+-------+-------+-------+------+-------+ | | pro | t∀ | t1 |t1/pro| t1/t∀ | +-----------+-------+-------+-------+------+-------+ | BASIC | 158 30.0 242.5 1.5 8.1 | | FORTRAN77 | 305 33.8 396.2 1.3 11.7 | | MIDICAP | 310 86.5 751.0 2.4 8.7 | | MINCOBOL | 273 27.9 339.1 1.2 12.2 | | MINPL1 | 185 72.3 335.4 1.8 4.6 | | PAL | 81 66.6 288.3 3.6 4.4 | | PLM | 131 34.9 260.0 2.0 7.4 | | SPL | 135 35.1 225.9 1.7 6.4 | | WIRTH | 26 4.8 36.6 1.4 7.6 | | XPL | 104 27.7 182.2 1.8 6.6 | +-----------+--------------------------------------+ Table 5-3: Construction times
96
6 Discussion ------------- A new way has been shown for the construction of LR(1) parsing tables. In contrary to the usual methods the parsing table is not simply constructed, but it results from the modification of an existing table. The modification process is smooth, i.e. small changes in the grammar cause only small changes in the parsing table. Because of this property the constructor is well suited for a translator system with syntactical extension facilities. The incremental approach should also save much construction time during the stepwise development of a large LR(1) grammar. Incremental construction has been done previously by Vidart [1974] and Heindel and Roberto [1975], but for a less general class of grammars, namely LL(1). LR(1) today seems to be the largest class of grammars which can be parsed efficiently, and an incremental constructor for this class was described in this work - as to my knowledge for the first time. The measurements reported in section 5 indicate that the implemented version of the algorithms indeed has the postulated steadiness property. The average time taken for the insertion of a single additional production is rather independent of the size of the grammar, and even if the incremental development is carried to extremes (i.e. reorganization after every single production), construction time increases only by a factor of 5 to 13. The times for an ordinary en-bloc construction are also quite competitive with those of several other LALR(1) constructors in use at the University of Dortmund. Section 3 is based on the NQLALR(1) method for the computation of lookaheads. Though I recognized the NQLALR(1)-specific problems in June 1979, DeRemer and Pennello [1979] were the first who documented them. In the paper they present another, correct LALR(1) method, and perhaps this method could be used in an incremental constructor, too. In contrary to the global relations used by DeRemer and Pennello the basic step of the NQLALR method is a relatively local operation, so that the effect of deletions and insertions may remain under control. This is one of the reasons why I adhered to NQLALR. The second reason is that the algorithms for NQLALR are relatively simple: it is not necessary to store any contexts with shift- configurations resp. -items, and the rather complicated function FIRST1 is avoided. The last reason is that the NQLALR problem, if it ever occurs in a grammar, can be solved by a correct tracing of the particular lookahead. Basically the tracing process is that of Pager [1977 LTA], but I ignored his efficiency considerations, because
97 they again rely on global properties of the lanes. I assume that in general NQLALR(1)-conflicts are rare in natural, unambiguous grammars, and therefore the tracing of only a few lanes should be necessary. Lane-tracing invents to do the last step - state splitting in order to achieve general LR(1). This step is also mentioned by Pager [1977 LTA], but there the algorithm of Pager [1977 PGM] is used, and only a few details are given. Algorithm CONFLICT incorporates the splitting step in the incremental construction, and its termination and correctness is proved at the end of section 4. Since several parts of the algorithms in section 4 are automatically skipped if they are not necessary, the attribute "incremental" may also be applied to the way in which the constructor exploits more and more general classes of grammars. This design of a general LR(1) parser constructor seems to be another, new result of this work. Sometimes the reader may have been astonished at the particular realization of an algorithm. Of course there may be other and perhaps much more efficient possibilities. When making suggestions for alterations and improvements the reader should note, however, the strong mutual connections between all data structures. For example, successors cannot be changed before all invalid lookaheads have been deleted. New lookaheads cannot be determined before all successors are recomputed. Some peculiarities are caused by the desire to avoid mappings with two or more arguments, because those would result in storage and retrieval problems for large grammars. The amount of detail in the two algorithmic sections might be another source of critics. I had the following reasons to choose this particular form of the presentation: (1) I showed the feasibility of the incremental approach by measurements obtained with an actual implementation, therefore the description should not be too far away from that implementation. (2) This work should serve as a guide for other implemen- tations; in my opinion other descriptions in the literature (e.g. LaLonde [1971], Pager [1977 LTA], Vidart [1974]) are too imprecise for this purpose. (3) Other possible implementors should get an impression of the size of the problem. Nevertheless I thank the reader for the patience he showed until here.
98
List of Algorithms ------------------ Algorithm defined in section --------- ------------------ ALLOC 3.4.1 BACKWARD 3.2 CLOSURE 3.3 COMPATIBLE 3.4, 4.3.3 CONFLICT 4.3.3 DELTA 3.1 EMITTERS 3.2 EXTRA 3 FORWARD 3.2 FREE 3.4.1 GARBAGE 3.4.3 INIT 3.1 ITEMINS 3.3.2 ITEMMARK 3.3.1 LADEL 3.2 LAGAR 3.2 LAGET 3.5 LAINS 3.5 LANE_GRAPH 4.3.1 LAPUT 3.5 NODINS 4.3.1 NULLABLE 4.3.1 ORIGIN1 4.3.3 ORIGIN2 4.3.3 PARSE 3.1 PRODEL 3.3.1 PROEDIT 3.3 PROINS 3.3.2 RECEIVERS 3.2 REORGANIZE 3 SELECT_PAIR 4.3.2 SEPARATE 4.3.3
99 References ---------- Aho, A.V., and Ullman, J. [1972] The theory of parsing, translation, and compiling. Two volumes. Prentice Hall, Princeton, N.Y. (1972) Aho, A.V., Johnson, S.C., and Ullman, J. [1975] Deterministic parsing of ambiguous grammars. Comm. ACM 18, 8 (Aug. 1975) 441-452 Aho, A.V., and Ullman, J. [1977] Principles of compiler design. Addison-Wesley, London (1977) Anderson, T., Eve, J., and Horning, J.J. [1973] Efficient LR(1) parsers. Acta Inf. 2 (1973) 12-39 [ANSI X3.23-1974] American national standard programming language COBOL. American National Standards Institute, New York (1978) Backus, J. [1978] Can programming be liberated from the von Neumann style? A functional style and its algebra of programs. Comm. ACM 21, 8 (Aug. 1978) 613-641 Brown, P.J. [1967] The ML/1 macro processor. Comm. ACM 10, 10 (Oct. 1967) 618-623 Brown, P.J. [1974] Macro processors. John Wiley & Sons, London (1974) Celentano, A. [1978] Incremental LR parsers. Acta Inf. 10 (1978) 307-321 Cheatham, T.E. [1966] The introduction of definitional facilities into higher level programming languages. Proc. AFIPS Fall Joint Computer Conference, Vol. 29 (1966) 623-637 Christensen, C., and Shaw, C.J. (Eds.) [1969] Proceedings of the extensible languages symposium. ACM SIGPLAN Notices 4, 8 (Aug. 1969) Cole, A.J. [1976] Macro processors. Cambridge University Press, Cambridge (1976) DeRemer, F.L. [1969] Practical translators for LR(k) languages. Ph.D. Thesis, Electrical Engineering Department, M.I.T., Cambridge, Massachusetts (1969) DeRemer, F.L. [1971]
100 Simple LR(k) grammars. Comm. ACM 14, 7 (July 1971) 453-460 DeRemer, F.L. [1974] Transformational grammars. In: Compiler Construction - An Advanced Course. Springer Lecture Notes in Computer Science, Vol. 21, Springer-Verlag, Berlin Heidelberg New York (1974) 121-145 DeRemer, F.L., and Pennello, T.J. [1979] Efficient computation of LALR(1) look-ahead sets. Proc. SIGPLAN Symposium on Compiler Construction, Denver, Colorado, 1979. ACM SIGPLAN Notices 14, 8 (Aug. 1979) 176-187 Errata: ACM SIGPLAN Notices 15, 2 (Feb. 1980) 29 Earley, J. [1970] An efficient context-free parsing algorithm. Comm. ACM 13, 2 (Feb. 1970) 94-102 Fischer, G. [1979] EXTRA - ein erweiterbarer Übersetzer (In German: EXTRA - an extensible translator). Proceedings GI - 9. Jahrestagung, Bonn Oct. 1979. Informatik Fachberichte 19, Springer Verlag, Berlin Heidelberg New York (1979) 240-251 Griswold, R.E., Poage, J.F., and Polonsky, I.P. [1971] The SNOBOL4 programming language. Prentice Hall, Inc., Englewood Cliffs (1971) Harrison, M.A., and Havel, I.M. [1974] On the parsing of deterministic languages. JACM 21, 4 (Oct. 1974) 525-548 Heindel, L.E., Roberto, J.T. [1975] LANG-PAK - an interactive language design system. American Elsevier Publ. Company, Inc. New York, London, Amsterdam (1975) [IBM 1973] OS PL/I Checkout Compiler: TSO User's Guide. IBM Form No. SC33-0033 Knuth, D.E. [1965] On the translation of languages from left to right. Inf. Contr. 8 (Oct. 1965) 607-639 Knuth, D.E. [1971] Top-down syntax analysis. Acta Inf. 1 (1971) 79-110 Korenjak, A.J. [1969] A practical method for constructing LR(k) processors. Comm. ACM 12, 11 (Nov. 1969) 613-623 LaLonde, W.R. [1971] An efficient LALR parser generator. Technical Report CSRG-2, Department of Computer Science, University of Toronto, Canada (1971)
101 LaLonde, W.R., Lee, E.S., and Horning, J.J. [1971] An LALR(k) parser generator. Proc. IFIP Congress 71. North-Holland Publ. Company, Amsterdam (1972) 513-518 Leavenworth, B.M. [1966] Syntax macros and extended translation. Comm. ACM 9, 11 (Nov. 1966) 790-793 Lewis, P.M. II and Stearns, R.E. [1968] Syntax directed transductions. Journ. ACM 15, 3 (July 1968) 465-488 McKeeman, W.M., Horning, J.J., and Wortman, D.B. [1970] A compiler generator. Prentice Hall, Inc., Englewood Cliffs (1970) Metzner, J.R. [1979] A graded bibliography on macro systems and extensible languages. ACM SIGPLAN Notices 14, 1 (Jan. 1979) 57-68 Mickunas, M.D., Lancaster, R.L., and Schneider, V.B. [1976] Transforming LR(k) grammars to LR(1), SLR(1) and (1,1) bounded right context grammars. Journ. ACM 23, 3 (July 1976) 511-533 Mooers, C.N., and Deutsch, L.P. [1965] TRAC, a text handling language. Proc. 20th ACM National Conference (1965) 229-246 Pager, D. [1977 LTA] The lane tracing algorithm and ways of enhancing its efficiency. Inf. Sci. 12, 1 (1977) 19-42 Pager, D. [1977 PGM] A practical general method for constructing LR(k) parsers. Acta Inf. 7 (1977) 249-268 Rüthing, H., and Schönhold, M. [1980] Syntax-gesteuerte Übersetzung mit Hilfe von LR(1)-Analysatoren. (In German: Syntax-directed translation with the aid of LR(1)-analyzers.) Diplomarbeit, Department of Computer Science, University of Dortmund, (Feb. 1980) Schneider, H.J. [1975] Syntax-directed description of incremental compilers. Lecture Notes in Computer Science, Vol. 26, Springer-Verlag, Berlin Heidelberg New York (1975) 192-201 Schuman, S. (Ed.) [1971] Proceedings of the international symposium on extensible languages. ACM SIGPLAN Notices 6, 12 (Dec. 1971) Strachey, C. [1965] A general purpose macro processor. Comp. Journal 8 (1965) 225-241
102 Vidart, J. [1974] Extensions syntaxiques dans une contexte LL(1) (In French: Syntactical extensions in a LL(1)-context). These pour obtenir le grade de docteur du troisieme cycle, specialite: informatique. Universite de Grenoble, France (1974) Waite, W.M. [1973] Implementing software for non-numeric applications. Prentice Hall, Englewood Cliffs (1973) Wegbreit, B. [1970] Studies in extensible programming languages. Ph.D. Dissertation, Harvard University, Cambridge, Massachusetts (1970) Wirth, N. [1977] What can we do about the unnecessary diversity of notations for syntactic definitions? Comm. ACM 20, 11 (Nov. 1977) 822