Cocos: "Compiler Construction Set", part 3 (implementation approach)
In a couple of
previous
posts
I described an idea for a new parser generator (or, better, framework).
To sum it up, it has the following goals:
- Allowing the creation of truly modular grammars, where reusing a grammar inside another has a clear result, with no risk of conflicts.
- Being able to exploit this modularity also at run time (which of course means compiler run time, but for dynamic languages could mean program run time), to allow extensible languages.
- Also clearly (and safely) separate the language definitions from the actions to be taken when parsing, so that the same grammar can be easily reused in different tools working on the same language.
- Have a good performance.
- Being intuitive and easy to use.
- Have very good support for error reporting and recovery.
- In general, being so good that it leaves you no excuses for using a different tool :-)
- Target just the .NET world (yes, it is a restriction, but I'm fine with that).
Given the above goals, I'll apply the following implementation strategies:
-
Make so that each grammar, when "ready", becomes an assembly, and is usable as such. This is extremely important for modularity, and has an interesting implication:
-
the syntax used to specify grammars is irrelevant, what really matters is that each grammar is a software component that respects the rules of the framework. Therefore, I don't really care about inventing or using a [meta]language to express grammars, I only want to specify the framework rules. Quite likely, any reasonable .NET language (C#, Boo, VB...) will be able to produce assemblies that are valid grammars with minimal effort on the developer side, and that's OK as a starting point.
Therefore, a "compiler compiler tool" will not be needed in the first prototype.
-
As a side point, the so called "semantic actions" (the code that must be executed when the grammar rules match) should typically be put in an assembly separated from the grammar one, to allow easy (and proper) reuse of the grammar in a different context. Therefore the framework should clearly specify how the interface between grammars and actions works.
-
I am settling on PEGs as the theoretical foundation of the framework because it seems to me that they are the only formalism that fulfills the requirements of simplicity, power and most of all modularity of the grammars. This, of course, raises the issue of the performance problem introduced by backtracking...
-
...which I plan to address in the following ways:
-
Typically, PEG proponents advocate (and like!) the fact that with PEGs you can merge the scanner and the parser. I disagree with this, not because it cannot be done, but because IMHO it hurts more than how it helps. AFAIK, historically the lexer is a performance critical part of a compiler, because it wrestles with each and every character of the input stream. Now, consider what happens when a PEG backtracks: it has to reparse all the input section already processed by the failing rule, which also means re-tokenize it again, one character at a time... at this point it is clear why one resorts to memoization: all those token matches are saved in the memoized state. What I guess is that if we keep the traditional layering of parsers, PEGs will naturally perform better, because the scanner will do very small backtracks, and then the parser will work at the token level, so its backtracks will be much cheaper.
Mind you, it's not that Cocos should enforce this separation! It's simply that it should be a good practice.
Moreover, keeping these layers separate should also keep the parsers more modular: for instance, in a language with a preprocessor of some sort (also like the C# one), it should be natural to have a grammar for the preprocessor, and on top of that the typical scanner and parser, with the three pieces working as separated (and reusable) modules in the compilation pipeline.
-
The "separation of layers" described above should have another consequence related to performance. Remember that scanners are typically done with
regular grammars,
which are much simpler (and more efficient) than full blown PEGs.
Yet, if you look at a
regular expression,
you can almost "read it as a PEG": the two formalisms "look" similar on the surface when the described language is regular.
And of course regular languages are a subset of what PEGs can recognize. Moreover, regular expressions are composable: joining them in any way (either with alternation or in sequence) still gives a pattern with no ambiguities. Therefore I plan to have support for regexs in Cocos, either explicitly or "automatically" (in the sense that the parser generator, if it detects that a rule is in fact a regex, will produce a
FSM
to match it instead of using the backtracking PEG logic).
-
Another way to reduce backtracking is to perform a sort of "automatic refactoring" of alternative rules when they start with the same sequence: it can happen that the language designer does not want to do the refactoring himself (maybe because that common sequence is useless as a standalone language construct, or maybe because it needs different actions), but doing it internally will
avoid useless backtracking.
-
Yet another thing that could be done is allowing the parser generator to recognize more "optimizable rules", and produce code that deals with them accordingly (for instance with a machine with a stack of states). If necessary, these kind of strategies can be added later.
-
Finally, I plan to allow arbitrary "syntactic actions" inside rules, which means arbitrary code that can dictate if the rule succeeds or fails. The rationale is that Cocos will not limit the programmer in any way: the parsers it produces are recursive descent, just like the typical hand made ones, it simply wants to be a tool to take all the "cruft" out of the parser, but the idea is that the developer is simply writing code to match a language: if "custom" code is needed, so be it. In this sense, the performance problem is put in the hands of the developer (where it belongs, IMHO): if one designs a language that absolutely needs unlimited lookahead, every solution will need backtracking and perform poorly, there's nothing that a tool can do to avoid it. So, Cocos will provide ways of writing clean and modular parsers, which should perform more or less like hand written recursive descent parsers, and that's it.
The above was about dealing with PEG's backtracking illness, now let's move on with the implementation ideas...
-
The framework should be generic, in the sense that its parsers will be able to operate on "symbols" of arbitrary types. This means the Unicode alphabet, or the ASCII one (why not sometimes?), or arbitrary tokens (when using a separate lexer), or anything that supports the basic requirement of being able to match what the grammar is looking for (so, any user defined type). This will allow Cocos to be used also for tree pattern matching, mostly like ANTLR does. This kind of generality should be easy to obtain basing the framework on generic types and interfaces.
-
Then, the actions to be executed when a rule matches should be specifiable at will (of course!), but there should also be some support for the most common cases. For instance, if one wants to build an AST it should be possible to have it typed (likely using some kind of factory), but it should also be possible to have an untyped AST "out of the box" with little or no effort.
Another implementation issue is: it's OK that grammars should be encapsulated in assembly files, but how should the grammar be represented in practice?
In this sense, we have a couple of major alternatives. Since each grammar rule is, in the end, a function that can try to match an "input stream", we could directly represent rules with methods.
With this approach, each rule would be the code that matches the language construct it describes. While this approach is interesting, it has a couple of major drawbacks: actually emitting that code is not easy before we have the framework in place, and (most important) examining rules is really cumbersome (one should parse the CIL instruction stream).
Another approach is to have methods that return data structures that in turn represent the rules.
A project that choose this approach is Irony, and I am planning to do something similar (one of the main differences being that Irony is not based on PEGs, so its grammars lack the composability I'd require).
Since PEGs do not require state tables, in the first implementation Cocos would not even require a "compiler-compilation" phase: a simple algorithm that interprets the rule structure will do the job. Only later, when we want to refine the system, we could write an assembly processor that optimizes rules into code (or maybe builds the FSM in the "regular language" case).
If each rule is an object we could run into problems when initializing recursive rules (where the object graph has cycles), but the problem can be solved using lazy initialization, maybe with some
help from reflection (in a way similar to what
Newspeak does).
The final issue that needs "sketching" is the interface between the rules and the actions to execute when they match. But, since each grammar rule will be an object, the easiest thing seems an event based interface, so that each rule can easily notify a match (or maybe also other events).
An event based approach could also have the side effect of providing debugging hooks for free: initially we could just implement a simple set of tracing actions, and later an application that loads a grammar and instruments it with tracing events that can potentially stop at certain conditions, and then lets the user inspect the parser state (maybe visually).
Given all the above, Cocos is taking shape, at least on paper...
now it's time to do it in practice!
Next week is the second Novell "hack week", so I'll spend it coding, and see what happens :-)