Massimiliano Mantione's Blog

RSS

../../massi.rss2

Email

massi@ximian.com

Docs:

I'm working on...

04 Feb 2008 (Permalink)

Cocos: "Compiler Construction Set", part 2 (state of the art)

So, let's dive into technology for modular compiler front ends...
...only to bang our head against the bottom of an empty pool!

Let's face it, there is still nothing like that really usable in production at the moment.

The language that comes nearer to the extensibility idea is likely Boo, and I absolutely love it, but it is extensible within the syntax already recognized by its parser, while Cocos has the broader goal of providing a truly extensible parser.

And by the way, languages of the Lisp and Scheme family do not qualify: they surely operate at the highest possible level of abstraction, but in my eyes they do so at the expense of not being "languages": they are a serialized form of the parse tree of a (hypothetical) human readable language. I mean, all those parenthesis are there to make so that code is data, and data can be handled as code, which is fine, but IMHO it makes writing code less practical. It is very elegant at the theoretical level, but I really dislike it in practice: I don't want the clutter of all those parenthesis when a machine could (and should!) do the bookkeeping for me. I want languages that bend to my needs, handling the code representation I want, whatever it is, because in practice syntactic sugar is good (both for readability and "wrist friendliness" when writing).

So, how do we build an extensible and modular parser?
Well, how do we build scanners and parsers in the first place? With parser generators!

Modern production-ready parser generators typically start from the formal description of a context-free grammar, and then try to derive a parser for it. The parser can be LL, LR, LALR, or a variant of those, but what matters is that the process is backwards: the programmer does not specify the parsing algorithm, but the abstract definition of a set of symbol sequences for which the parser generator tries to produce a parsing algorithm.

There can be ambiguities, and they must typically be handled in ad-hoc ways (or, when possible, eliminated refactoring the language description, which typically makes it uglier). Moreover (which is worse) language descriptions are not naturally composable: you cannot freely combine them and expect to obtain a parser for the "composite language" out of the box ("shift-reduce conflict" should ring a bell here!). And this kills our goal of having a generator for modular parsers.
This, by the way, rules out wonderful frameworks like ANTLR (which I really like, but as far as I know has no notion of "extendable parser").

A different formalism to describe languages is a "parsing expression grammar", or PEG. On the surface a PEG looks like a language description in BNF form, but its meaning is totally different: it is not "backwards" (the abstract description of a set of symbol sequences from which a tool will derive a parsing program), it is the direct description of the parsing program. Thanks to this, PEGs are composable without ambiguities, and are also very intuitive when you write them.

This, of course, comes with a price: they need to backtrack when a "rule" fails, and this makes the matching time exponential, which is bad. There is a technique (memoization) to cope with this issue, but IMHO it is a bit cumbersome, and has a very high memory cost, so I do not consider it the solution to make PEGs usable in all contexts.

This does not mean that all hopes are lost! PEGs look so promising (and are so nice) that a lot of projects are using them. There are various parser frameworks for programming languages like Ruby, Haskell and others (a good list is here). Some of these projects use memoization, others just accept the performance penalty, but none that I know of has found an effective way to solve the issue (and neither has goals as broad as Cocos). One interesting project is Katahdin, which like Cocos aims at having a parser that can be changed dynamically. But probably the most interesting is Newspeak, which has also been presented at Lang .NET 2008.

In fact, I've been thinking about Cocos for far too long, framing the issue, solving design problems, gathering ideas... now I'm seeing that "the time is right", there are several projects moving in this direction, so something good should come out in the end. Newspeak in particular seems to be very well thought out (of course, IMHO).

In the next post I'll focus on what should make Cocos unique in this landscape (otherwise it would be pointless doing it!), and how I plan to actually implement it.

All entries
This is a personal web page. Things said here do not represent the position of my employer.