Massimiliano Mantione's Blog

RSS

massi.rss2

Email

massi@ximian.com

Docs:

I'm working on...

14 Feb 2008 (Permalink)

Hack week interlude!

On Monday and Tuesday I have been implementing this crazy idea, but then I got stuck in a bad design problem.
The idea is a long term thing anyway... I have never thought to be able to do something related to compiler and language design this week, I only wanted to have a flexible parsing framework, which eventually will be used in compilers.
But even just that has proved being a bit hard. On Wednesday I was tempted to throw in the towel, and do something different (like helping with the regex engine reimplementation, which likely will be the hack week hit in the Mono project).

Then, I've been fascinated by this post by uber-hacker lupus.
It's obviously a hack, but a clever and cute one.

And it has something in common with the idea of extending compilers... so, I took a one day break to see if it was possible doing it in a different way, without changing the mcs C# grammar, but instead modifying mcs to accept arbitrary "extensions" that put into it code that is executed at compile time, and can in turn emit arbitrary code inside methods.

Thanks to Marek Safar, that was ready to help me when I could not understand what mcs was doing... here is the code I can now write:

using System; using Mono.CSharp.Extensions; #pragma compiler extensions enable namespace Test { public class Program { public static void Main () { IL.Ldstr ("Hello, world!"); IL.Call ("System.Console", "WriteLine", "System.String"); } } }

Of course, if one really wanted to insist in using IL syntax, he could write an extension that allows code like this:

using System; using Mono.CSharp.Extensions; #pragma compiler extensions enable namespace Test { public class Program { public static void Main () { IL.Emit (@" ldstr ""Hello, world!"" call void class [mscorlib]System.Console::WriteLine(string) "); } } }

The point is, once you have hacked mcs to allow extensions, you can go wild and write your own :-)
And before anyone tells me: yes, Lisp and Scheme macros are better at this.
The mcs patch is minimal: the diff is 178 lines, with exactly 124 lines added, and more than one third of the patch deals with putting in place the "#pragma" to activate the detection of extensions (if you don't enable the pragma mcs behaves normally).
Then, each extension has its own implementation, in an assembly clearly separated from the compiler, and which (of course!) is only needed at compile time and not at run time. For instance, the "inline IL" compiler extension looks like this:

using System; using System.Reflection; using System.Reflection.Emit; using Mono.CSharp; #pragma compiler extensions enable namespace Mono.CSharp.Extensions { public class IL { [CompileTimeInvocationAttribute("EmitLdstr")] public static extern void Ldstr (string s); public static void EmitLdstr (CompileTimeInvocation invocation, EmitContext ec) { ec.ig.Emit (OpCodes.Ldstr, ArgumentAsStringLiteral (invocation.Arguments [0])); } [CompileTimeInvocationAttribute("EmitCall")] public static extern void Call (string t, string m, string a); public static void EmitCall (CompileTimeInvocation invocation, EmitContext ec) { Type type = ArgumentAsType (invocation.Arguments [0]); string methodName = ArgumentAsStringLiteral (invocation.Arguments [1]); Type[] argumentTypes = ArgumentAsTypeArray (invocation.Arguments [2]); MethodInfo method = type.GetMethod (methodName, argumentTypes); ec.ig.Emit (OpCodes.Call, method); } ................ The idea is the following: each call to a method marked "CompileTimeInvocationAttribute" will not be emitted. Instead, the code of the method named in the attribute will be executed inside mcs in the "Emit" pass, and it has the responsibility of emitting the needed code.

Now, to be fair: this is a hack, and a bad one as well: extensions can use the internal mcs API, so changing mcs can break them. In practice, one is writing code that technically should be placed inside the compiler, but he is allowed to do it in a separated assembly.
Anyway, it works well, it's been fun, and I've learned a lot doing so!

And Cocos? Well, it's not dead at all. Taking the break allowed me to solve the design issue, and it's shaping up well. It still has the potential of being a good parsing framework. Particularly, it should make it very easy to re-parse only fragments of the program when they change (think when you are in an editor inside an IDE). And I am succeeding in keeping the semantic actions separated from the grammar, so that the same grammar can be reused in different tools. Only, it will not work by Friday :-(
So, more hacking ahead :-)

09 Feb 2008 (Permalink)

Cocos: "Compiler Construction Set", part 3 (implementation approach)

In a couple of previous posts I described an idea for a new parser generator (or, better, framework).
To sum it up, it has the following goals:

Given the above goals, I'll apply the following implementation strategies:

Another implementation issue is: it's OK that grammars should be encapsulated in assembly files, but how should the grammar be represented in practice?
In this sense, we have a couple of major alternatives. Since each grammar rule is, in the end, a function that can try to match an "input stream", we could directly represent rules with methods. With this approach, each rule would be the code that matches the language construct it describes. While this approach is interesting, it has a couple of major drawbacks: actually emitting that code is not easy before we have the framework in place, and (most important) examining rules is really cumbersome (one should parse the CIL instruction stream).
Another approach is to have methods that return data structures that in turn represent the rules. A project that choose this approach is Irony, and I am planning to do something similar (one of the main differences being that Irony is not based on PEGs, so its grammars lack the composability I'd require).
Since PEGs do not require state tables, in the first implementation Cocos would not even require a "compiler-compilation" phase: a simple algorithm that interprets the rule structure will do the job. Only later, when we want to refine the system, we could write an assembly processor that optimizes rules into code (or maybe builds the FSM in the "regular language" case).

If each rule is an object we could run into problems when initializing recursive rules (where the object graph has cycles), but the problem can be solved using lazy initialization, maybe with some help from reflection (in a way similar to what Newspeak does).

The final issue that needs "sketching" is the interface between the rules and the actions to execute when they match. But, since each grammar rule will be an object, the easiest thing seems an event based interface, so that each rule can easily notify a match (or maybe also other events).
An event based approach could also have the side effect of providing debugging hooks for free: initially we could just implement a simple set of tracing actions, and later an application that loads a grammar and instruments it with tracing events that can potentially stop at certain conditions, and then lets the user inspect the parser state (maybe visually).

Given all the above, Cocos is taking shape, at least on paper... now it's time to do it in practice!
Next week is the second Novell "hack week", so I'll spend it coding, and see what happens :-)

04 Feb 2008 (Permalink)

Cocos: "Compiler Construction Set", part 2 (state of the art)

So, let's dive into technology for modular compiler front ends...
...only to bang our head against the bottom of an empty pool!

Let's face it, there is still nothing like that really usable in production at the moment.

The language that comes nearer to the extensibility idea is likely Boo, and I absolutely love it, but it is extensible within the syntax already recognized by its parser, while Cocos has the broader goal of providing a truly extensible parser.

And by the way, languages of the Lisp and Scheme family do not qualify: they surely operate at the highest possible level of abstraction, but in my eyes they do so at the expense of not being "languages": they are a serialized form of the parse tree of a (hypothetical) human readable language. I mean, all those parenthesis are there to make so that code is data, and data can be handled as code, which is fine, but IMHO it makes writing code less practical. It is very elegant at the theoretical level, but I really dislike it in practice: I don't want the clutter of all those parenthesis when a machine could (and should!) do the bookkeeping for me. I want languages that bend to my needs, handling the code representation I want, whatever it is, because in practice syntactic sugar is good (both for readability and "wrist friendliness" when writing).

So, how do we build an extensible and modular parser?
Well, how do we build scanners and parsers in the first place? With parser generators!

Modern production-ready parser generators typically start from the formal description of a context-free grammar, and then try to derive a parser for it. The parser can be LL, LR, LALR, or a variant of those, but what matters is that the process is backwards: the programmer does not specify the parsing algorithm, but the abstract definition of a set of symbol sequences for which the parser generator tries to produce a parsing algorithm.

There can be ambiguities, and they must typically be handled in ad-hoc ways (or, when possible, eliminated refactoring the language description, which typically makes it uglier). Moreover (which is worse) language descriptions are not naturally composable: you cannot freely combine them and expect to obtain a parser for the "composite language" out of the box ("shift-reduce conflict" should ring a bell here!). And this kills our goal of having a generator for modular parsers.
This, by the way, rules out wonderful frameworks like ANTLR (which I really like, but as far as I know has no notion of "extendable parser").

A different formalism to describe languages is a "parsing expression grammar", or PEG. On the surface a PEG looks like a language description in BNF form, but its meaning is totally different: it is not "backwards" (the abstract description of a set of symbol sequences from which a tool will derive a parsing program), it is the direct description of the parsing program. Thanks to this, PEGs are composable without ambiguities, and are also very intuitive when you write them.

This, of course, comes with a price: they need to backtrack when a "rule" fails, and this makes the matching time exponential, which is bad. There is a technique (memoization) to cope with this issue, but IMHO it is a bit cumbersome, and has a very high memory cost, so I do not consider it the solution to make PEGs usable in all contexts.

This does not mean that all hopes are lost! PEGs look so promising (and are so nice) that a lot of projects are using them. There are various parser frameworks for programming languages like Ruby, Haskell and others (a good list is here). Some of these projects use memoization, others just accept the performance penalty, but none that I know of has found an effective way to solve the issue (and neither has goals as broad as Cocos). One interesting project is Katahdin, which like Cocos aims at having a parser that can be changed dynamically. But probably the most interesting is Newspeak, which has also been presented at Lang .NET 2008.

In fact, I've been thinking about Cocos for far too long, framing the issue, solving design problems, gathering ideas... now I'm seeing that "the time is right", there are several projects moving in this direction, so something good should come out in the end. Newspeak in particular seems to be very well thought out (of course, IMHO).

In the next post I'll focus on what should make Cocos unique in this landscape (otherwise it would be pointless doing it!), and how I plan to actually implement it.

03 Feb 2008 (Permalink)

Cocos: "Compiler Construction Set", part 1 (framing the problem)

In this post I'll be talking about a project I'd like to start in the near future (quite likely the next Novel "hack week"), which, like most free software projects, will come into existence because I have an itch to scratch. It is a very experimental project, and while I think my ideas are sound and useful I don't really know how it will go in the end. Nevertheless, this itch needs to be scratched!

Particularly, I have several gripes with modern compiler technologies (and programming languages).

Starting with languages, there's far too many of them. Not that this is a bad thing in itself: there are good reasons for it, and I understand them well. I also happen to use different languages in different contexts myself, and don't regret it at all.
What I dislike is the difficulty we have integrating them.
In this sense, the .NET framework is a big step in the right direction, because with its language independence it allows you to seamlessly mix and match components written in different languages. All you need is a compiler for your favourite language that can emit a .NET assembly.

And here comes the first gripe about compiler technologies: writing compilers is still needlessly hard. Of course a full blown compiler is a complex piece of software, and I don't expect this inherent complexity to magically go away. But let's face it, most programming languages share a lot of basic constructs, and it's a shame that efery time we write a compiler we typically redo everything from scratch.

My next gripe with programming languages integration is that until now it is typically done without mixing them in the same source file. And when it's done that way, only one compiler drives the compilation, and just embeds strings that will be evaluated at run time by different compilers (like when we embed SQL). This misses completely the integration between languages in a unified context, with proper type and semantic checks.

Now, LINQ is an answer to this problem in the SQL case. But let's look at this answer: it took a lot of effort to have LINQ in the first place, requiring major compiler updates, and it solves the problem only for one domain: data access and filtration. So, my next gripe with compiler technology is that I would like a truly extensible compiler, so that if I feel the need to write "from some-source select some-access-code where some-predicate" I can do it right away, extending the language myself.
Or if I want to put there an XML literal, the XML will be parsed at compile time, its structure will be checked against the needs of the code that will work on it, and no string with the full XML will be embedded in the resulting executable at all.
Or, more generally, if I need a small DSL (Domain Specific Language) I can implement it with minimal effort, then write snippets of my DSL directly inside my code, and everything will just work. These language extensions should then be distributed and handled just like libraries, in the sense of software components: simply having the required assembly file available at compile time should be enough for the compiler to use the extension.

This way the compiler would become truly modular. And this modularity should be exploited also in another direction: to avoid rewriting the same compiler all over again. Nowadays, if you want a full blown C# development environment, you start with a C# compiler. Then you need an IDE, which also must be able to (partially) understand the language, so you write a (possibly half assed) compiler for the IDE's needs. And then, when you need a debugger, it needs to understand language expressions, so you write another one. And if you need one more source code analysis tool... you are back to square one.
All of these (partial) compilers typically share nothing at all. Since they have different needs, it is perfectly OK for them to be different pieces of software, but it seems strange to me that they cannot reuse anything: at the very least the front ends should be a common module.
But this, with current compiler technology, is a dream: it's non impossible, but it would be so hard that we just don't do it.

This is why I am focusing on compiler technology and not programming languages: I know that the current babel of languages is not going away anytime soon, even only because different people have different syntactic tastes. However, I have the suspect that the current babel is made much worse by the fact that we are not able to integrate the different languages properly, and most of all because when we have a language we cannot adapt it to our real needs, so we jump to a different one that (not being properly integrated with the rest) increases the mess.

So, to sum it up, my main gripe with compiler technology is that compilers are not modular enough, which decomposes into:

If we stick to the .NET world, the situation for back ends is actually not that bad. For the static case, Cecil is a very good start for the low level part, and Milo should be perfect when it's ready. And for the dynamic case, the DLR seems there to help us. Therefore, I will concentrate my gripes on compiler front end technology.

On the next blog post...

03 Jan 2008 (Permalink)

Comparing profiling APIs

Implementation troubles

Lately I've had some troubles implementing heap-shot-like functionaliy in the new profiler. I wanted to have it working before committing the code, but as you'll see this will take some time, so I will commit it earlier...

The reasons for the port are, essentially, performance (heap-shot uses the Mono runtime API to walk every field of every object, and takes-releases a lock at every allocation), and having a 64-bit port of heap-shot.

Integrating it in the new profiler would allow me to reuse all the event logging and file writing machinery that is already in place there, which is fairly efficient (or at least is supposed to be so!). However, all this machinery had some problem when applied to heap profiling. The logging profiler maintains a global hash table of mappings from MonoClass* values to numeric unique IDs. The table is used so that the external file reader does not depend on raw pointer values, and also because writing IDs take less space than writing large values. This table, however, is not kept up to date all the time (which would require taking a lock at every profiler event). Instead, each time it is necessary to flush the events to disk, the event buffers are scanned and the table is completed (and flushed!) just before writing the event block.

This is perfect for the logging profiler, but causes lots of bad interactions when working inside the GC profiling event handlers. In this context, all the threads registered with the runtime are stopped. Therefore, if the GC hook tries to update the table, it has to take the profile lock, but it cannot because one of the stopped threads could have taken it (and, being stopped, it's not going to release it until the collection stops...).

In the last week, trying to fix all the profiler bugs before committing, I've been dealing with various kinds of deadlocks, or troubles because the table was not complete. I managed to implement the MonoObject* buffers between the allocation and GC hooks in a completely lock-free way, and split the snapshot and file writing jobs so that the snapshot (which must happen inside the GC hook) does not strictly need the mapping table.
However, I overlooked a basic fact: I implemented the object scanning in the heap snapshot using per-class bitmaps that describe the places where references are in the object (for efficiency), and I decided to store those bitmaps in the class-ID mapping table (because it seemed reasonable: it is a table where the MonoClass* is the key, it seemed stupid keeping two such tables inside the profiler!).

Of course, this conflicts with the assumption that the GC hooks cannot use that table... of course there are workarounds for this, but after putting so may workarounds in place, I started thinking that maybe the profiler could get some more help from the runtime. After all, it is this separation of the profiler from the runtime that is causing the issue, so maybe extending the profiling API could give us a more elegant solution.

In this context, it makes sense to see what the other two major VMs out there (the Sun JVM and the Microsoft .NET CLR) are doing. So (again, on Paolo's suggestion) I examined their profiling APIs.

The Microsoft CLR profiling API

Online resources for MS .NET profiling API are here and here. But the real reference is the "Profiling.doc" file distributed with the .NET SDK.

Comparing this profiling API to the Mono one, the first thing we we see is that they are very similar. In the bulk of the load-unload (and method enter-exit) events, there are just minor differences:

There is, however, a feature which would be really handy for me in implementing a proper heap (and CG) profiler: they have "runtime suspension" callbacks. In practice, a profiler is notified of the moment when the GC will suspend all the threads which are registered with the runtime: it is guaranteed that none of those threads will execute managed code after that callback. Of course, also the "reverse" callback is present (for when the suspend action is done).
This would be really useful for my heap-shot implementation: the suspend hook would make sure that no thread still has the profiler lock before the GC starts, and when the suspend is done that the mapping table is ready. This way the GC hook could safely use the table to access the bitmaps that describe where each object has reference values.

But the major difference related to heap profiling is the fact that the .NET CLR actually already implements a heap-shot equivalent. And, thinking about it, this really makes sense: in my implementation I am in fact duplicating some work that the GC already does, like:

And it's not over: they also report all the roots they find on the stacks (and registers), which our heap shot will have a very hard time doing, while the GC already does it anyway.

So, we should decide what to do:

The Java profiling API

The online reference for the JVM profiling API is here. However, there has been a transition to the "tooling" API, JVMTI, so I will only consider this last one.

Comparing the JVMTI profiling API to the Mono one, we see many more differences, because the whole approach is different.

First of all, there are no fast callbacks for method enter/exit and things like that. Instead, the profiler can instrument the application bytecode directly, and the profiler hooks can be implemented in managed code, to avoid the transition costs and eventually inline the callbacks if they are small enough.

Then, there is the (complex) concept of "JVMTI context", how it must be used, and especially how references must be managed. And the JVMTI is meant to implement several kind of tools, including debuggers, so the API is naturally more comprehensive than the Mono profiler API.

For the purpose of this comparison, it suffice to say that the heap analysis functions are really complete, and also look relatively heavy, with many callbacks and the possibility to arbitrarily tag individual objects and classes and use the tags as filters for the callbacks.

The JVMTI reference explicitly states that they presumed that a "batch" approach (working with arrays of references) would have been more efficient (in the sense of having a better throughput), but tests proved that it's not the case, so they resorted to the more flexible callback approach.

In the end...

We should decide what to do in our profiler :-)

19 Nov 2007 (Permalink)

Mono and oprofile profiler news

Lately I've been working on the Mono profiler.
Most of the work has been on implementing a new profiler, a prototype implementation can be found here (but beware, that patch no longer works in current svn).

To sum it up, this is not a revolution because the profiling API stays the same, it's just a reimplementation of the current default profiler.

This is motivated by a few issues we have with the current profiler:

The main differences of this new profiler:

These days I'm working on a managed implementation for the decoding library (which processes the files produced by the profiler), I will show it in a BOF at the Mono Summit 2007.

But the real topic of this post is another profiler related work: we can finally integrate oprofile with Mono!
Specifically, managed method names will show up in the opreport output, as shown in this screenshot of the oprofile-viewer GUI: OProfile Viewer

One nice aspect of oprofile is that it works automatically, without requiring instrumentation of the application code. And it can profile the whole system, including the kernel, so one can have a global view of how the application spends its time. Moreover, it can also profile things like cache misses, branch predictors behavior, and in general everything that's measured by the CPU performance counters.

The only thing is needs to provide a meaningful output is that the elf files of the profiled programs provide symbol names for their functions... which obviously is not true for jitted code.
However, there is interest is using oprofile also for jitted code, so to cope with this, the oprofile developers put in place a solution: generate fake elf files for the anonymous memory regions where jitted code resides ("anonymous" because they are not related to any file on disk), so that the oprofile reporting tools will find the symbols they need in these files, and work unmodified. Of course, these symbol names must come from the JIT itself. For this purpose, oprofile now exports a very simple API, and a library that contains the functions that the JIT must invoke to emit the symbols. Their work was mostly oriented to Java VMs, but the discussion on the oprofile mailing list kept it VM-neutral.

With time, this code in oprofile has matured, and at the beginning of October they created a CVS branch that tracks these efforts, so that now it has been very easy to fetch it and make it work with Mono. There were rough edges, but thanks to Philippe Elie and Maynard Johnson on the oprofile IRC channel we sorted them out quickly, so now it works!
The only downside is that call graphs are not supported for now. I won't enter in the details, but it will take a while to have them working (and BTW, they will likely work more for Mono than for Java, because due to the mixed JIT-interpreted nature of the JVM they don't really have complete stacks that oprofile can examine).

So, the oprofile code is already available in their "JIT_SUPPORT" cvs branch; I will likely commit the Mono code in the following days or during-after the summit... but for the truly impatient, here it is!

As you can see, it's really, really easy... enjoy :-)

29 Oct 2007 (Permalink)

Unforeseen absence

During last weeks my son's health got really bad. He had been suffering due to his adenoids in the last year, but we wanted to avoid surgery if possible...
The situation got better during summer, but now it is not bearable anymore. Among other things, during the night at times he simply cannot breath at all... so the doctors (and we) decided that he'll have surgery urgently, to take his adenoids and tonsils off.

For reasons described here (sorry, it is in Italian, it essentially describes a new and better way to perform that surgery) we will go here, which is about 200 Km from where we live.

Therefore, I will simply disappear for a few days... and hopefully this will solve my son's problems!

Updated November 1st, 2007
We are back from the hospital, and everything went well.
For the first time since more than one year, Michele's oxygen saturation is at 99% most of the time, and absolutely never below 95% (before he had frequent apnea periods when the saturation dropped below 70%). The convalescence is rather painful for him, which is not a nice thing to watch, but it is really worth it... and this time we know that in a week everything will be over!

10 May 2007 (Permalink)

Trimming VTables in the Mono runtime

As everybody knows, a "VTable" (short for "virtual table") is a data structure used to lookup methods when they are called using dynamic binding (which in Java, C++ and .NET is called doing "virtual calls").
In C++ they are generated by the compiler, while in Java and .NET they are managed by the runtime.

Whet is less known is that Java and .NET, allowing multiple inheritance from interfaces (in the sense that every class can implement several interfaces), make building VTables a bit tricky.

The problem is not listing all the methods which a class implements and can be called with a CALLVIRT opcode, which is easy. The real issue is that the CALLVIRT opcode, as implemented in the JIT, must find the required method fast: nobody wants method call operations to be inherently slow.

It's here that the multiple inheritance issue shows up: typically, a VTable is just an array of function pointers, and each method has its slot. The CALLVIRT opcode must produce a call instruction that picks the right slot. If the this pointer is typed (statically, at compile time) as a class, things are easy, because the slot positions are kept constant along the inheritance hierarchy (just like the layout of fields), thanks to the single inheritance between classes.
For instance, if method M of class C has slot 3, it will have slot 3 in every class derived from C, so the CALLVIRT opcode just needs to access the VTable at that slot and the correct method will be called.

Now, consider interfaces: method M in interface I could be implemented at slot S1 in class C1, slot S2 in class C2... in practice, it could be in a different slot in every class that implements I. Therefore, the CALLVIRT opcode must use some sort of lookup mechanism to find the right slot for any given instance of a class that implements I.

In this paper the issue is explored in detail, and several solutions are explained. In any case, the problem is the tradeoff between time and space: you can have very fast VTable access (constant time, like array accesses), or you can have small VTables, but you cannot have both.

The solution used in Mono until now is conceptually similar to the "Selector Indexed Tables" described in the paper (even if very different in practice). For each interface we generate a unique numeric ID at runtime (let's call it UIID). In the VTable, beside the space for the class methods, we reserve one "region" for each implemented interface, where the pointers to the interface methods are replicated; inside each of these regions the pointers have exactly the same order as the method declarations in the interface itself. Then we have, beside the proper VTable, a table that maps, for each UIID, the offset of the corresponding "interface region" inside the VTable (we call this table "interface_offsets").

This way, the CALLVIRT opcode must access interface_offsets using the UIID to find the right VTable region (simple array access), displace it with a constant corresponding to the method slot in the interface (a simple addition), and the method is found.

The downside is that "interface_offsets" is an array, and it must be as large as the highest UIID of the interfaces implemented by the class (and is then typically full of "holes", unused slots corresponding to UIIDs of unimplemented interfaces).

As Miguel noted, the use of generic types makes the number of interfaces in the runtime exceptionally high. Every class which implements even just one interface is then forced to have a very large "interface_offsets" table, only because its interface happens to have a high UIID. The more interfaces are generated instantiating generic types, the higher the UIIDs... In an application like Banshee, the runtime ends up allocating one Mbyte of RAM only for tables which are mostly empty, to support the execution of one Mbyte of jitted code.

And the situation is even worse than that!
Actually, the runtime maintains two kinds of VTables: one for "metadata handling", inside MonoClass, which is an array of MonoMethod pointers, and another for the JIT, which is MonoVTable and which contains the actual pointers to the machine code (and which for this reason is usually specific to each application domain). Needless to say, both tables need their related "interface offset mapping" table, even if the "mono_stats" only keep track of the memory allocated for the second kind of VTables (the "proper" ones, MonoVTable). So interface offset tables waste their space twice.

To put a remedy to this, we decided to implement IMT (Interface Method Tables), as described in the paper linked above. Briefly, IMT replaces "interface_offsets" with a small fixed size hash table. Note that the key hash functions are computed at JIT time, and not at run time, so method calls, in the absence of collisions, are roughly just as fast as with the current implementation.
Of course the tricky part is handling collisions. About this, one must realize that the IMT contains pointers to code, not data, and it is used as a jump table by the JIT. So, while an hash table, as a data structure, handles collisions with lists, the IMT handles them with thunks of code which have the capability of resolving the collision, and which the runtime must generate when it fills up the IMT slots.

In a preparation for implementing this, in current SVN the "interface_offsets" field is gone from MonoClass (which means, the "metadata handling" virtual tables do not waste memory anymore). It has been replaced by a pair of compact arrays (which have the same function), and a bitmap used for "ininst" checks by the JIT.

The next step is actually implementing IMT in MonoVTable, which I'll do in the following weeks...

09 May 2007 (Permalink)

I acknowledge that I have stayed...

...without blogging technical content far too long!

This is going to change soon (I mean, these days... crap, seeing Miguel sort of forced to explain what I'm doing because I am not blogging is really shameful).
Meanwhile, I announce a couple of Mono events here in Italy: I'll be present in OpenLabs on May 21st, and Novell organizes a Mono Day on May 23rd.

Both events can be a good occasion to meet some friend!

My next blog entry will be about IMT, and then I'll explain what's going on with the new regalloc...

12 Feb 2007 (Permalink)

This period shows no sign of stopping...

...but I should be over it :-)

This blog entry is mostly for all my friends that sow me "disappear" suddenly, from IRC and the web in general.

To say it flatly, I've been a bit "burn out".
I had too many things to face, all in a row, I did some mistake and they all summed up, so that in the end I lost a lot of energy...
I't pointless making a list of what happened since September (it would be too long, really!), but it involves:

Of course I know there are guys in situations much worse than mine.
Anyway, the major symptom of my loss of energy was that my working performance gradually dropped to zero. Which means, I tried to get things done, but had no results.

I finally decided to have a period of pure vacation, and generally take care of myself and get some rest.
The vacation helped a lot, in spite of the fact that, again, both my son and my wife were ill, so I didn't rest that much. But what helped me the most, ironically, is that I got ill last Friday! Whoever knows me also knows that I seldom feel bad (in the last fifteen years or so it happened three times, and before I hardly remember the details...), but this time it has been useful: mostly thanks to my wife, I spent three days literally sleeping, resting, and doing nothing at all!

So, now I feel better... today I am still officially in "resting mode", but I turned on the notebook anyway, and started working off my mail backlog.
The only sore point is that my son is still ill, and yet nobody understands what he really has. From time to time he seems to heal, the remedies seem to work, but then his situation regularly gets back to bad. This will keep us busy, and obviously a little worried, but at the very least I feel better :-)

01 Dec 2006 (Permalink)

These things are part of life.

Today, in the early morning (2am), my mother passed away.
We already had a sort of miracle before, but these things can only be postponed, not avoided.

Anyway, given that we cannot decide when these things happen, we should care about the how, which is IMHO much more important.
In this sense, I cannot imagine how it could have been better: my mother has been serene until her last breath, she was conscious, she could recognize us, and we had the chance to stay near her all these days and nights, here in Sicily. And, even more important, she was serene knowing what was happening to her, and she just accepted that as a fact of life. She really passed away "in God's Grace", which is marvelous: we could never ask for more.

Of course, faith helps: it helped her, and it helps me and my wife. And praying gives that kind of peace that you generally cannot get anywhere else. Also, I know I'll meet her again... it's not that I "believe" it, I know it just as I "know" that I have two hands, or that the sun will rise tomorrow. And she already gave us signs of her presence.

In the end, I want to thank everybody that stayed close to us. It's good feeling that we are not alone.

We will give her the last farewell tomorrow, and her body will rest here, in the village where she was born.
We will likely come back home (I mean near Milan) at the end of the following week, all together (me, my wife, my son and my father).

25 May 2006 (Permalink)

Chasing elusive JIT bugs...

%*@#!
Back from Sicily, I debugged the "last" issue related to deadce, but as soon as I committed the fix and re-enabled the five options described here, more amd64 specific issues surfaced.

The problem is that I am not able to reproduce the bugs in any way on the 64 bit machine I have access to.
I really need the bug reporters to help me in finding the problem.

But, since instructions on how to isolate the source of JIT problems without having JIT knowledge could be interesting to many, instead of just mailing them privately I'll publish them here (and then copy them to the wiki, too).
The only knowledge that is assumed is that you can rebuild mono from source, you have some C programming knowledge so you know how #define works, and you can search the code with grep.

So, assume you have a test case (even a whole program) that always fails with a specific JIT option (or combination of options), and always works without that option.

The first thing to do, when a JIT option causes trouble, is identifying one single method that is compiled incorrectly when the "bad" option is enabled, and correctly when it is disabled. This method could be the top one in the stack trace in case of crash, but often it is not so, and anyway to avoid wasting time debugging correct code it would be nice to have a way to tell for sure if you have really found the right method.

For this purpose, I often instrument the JIT so that it applies certain options only to single methods, or anyway to a subset of the program's methods (by the way, to get a list of the compiled methods it is enough to execute the test giving one "-v" flag to "mono", and grepping for " emitted at ").

Inside the mini directory, you can grep for macros like "MONO_APPLY_DEADCE_TO_SINGLE_METHOD", "MONO_APPLY_TREE_MOVER_TO_SINGLE_METHOD", "MONO_APPLY_SSAPRE_TO_SINGLE_METHOD", "MONO_INLINE_CALLED_LIMITED_METHODS", and "MONO_INLINE_CALLER_LIMITED_METHODS" (these last two only since r61101). Each of them, when true, enables code that compares the current method full name with what's found in a specific environment variable, and applies the option only if the strings match.

The macros with "single" expext an exact match, while the ones with "limited" have a "starts with" behavior, which is more useful when searching "manually" (I sometimes use perl scripts to automate the search for the failing method), but anyway it is really easy to change this behavior.

So, suppose that "-O=inline" causes a problem. You could set "MONO_INLINE_CALLER_LIMITED_METHODS" to 1, rebuild, and start running the test setting the environment variable "MONO_INLINE_CALLER_METHOD_NAME_LIMIT" initially to "System", to see if the failing method is in the System namespace, and then refining the search.
Keeping a sorted list of all the methods that were jitted when the test passed can be useful to direct the search. Note that I said "when the test passed", and not "failed", because the failure could have cut the logs and made the list incomplete...

At some point, you'll eventually find one single method that changes the test result. Then, the best thing you can do is running the test with lots of logs enabled (ideally put the "-v" option five times on the command line, and enable all the relevant debugging macros like "DEBUG_DEADCE" or "DEBUG_ALIAS_ANALYSIS").

Be warned that the resulting logs can be huge!
There are two basic ways to produce smaller logs: just compile the failing method with "mono --compile ...", or put an "if" at the beginning of "mini_method_compile" that sets "cfg->verbose_level" to 5 if the method name matches (at that point you can just hard code the string in the source!).
I prefer the second method because it does not alter the test run (so that if the error was dependent on something else, like having a class constructor already executed, you are in the same condition, otherwise you could end up chasing ghosts).

When you have those logs, please send them to me, they are generally enough to understand what went wrong :-)

19 May 2006 (Permalink)

Back from Sicily.

As somebody knows, I've made a trip to Sicily lately... and I'm back today. It's not been exactly a "vacation"... basically, my mother is badly ill (a tumor, the doctors clearly said that she will not recover, and explained us how she will likely die, they just don't know when).
She was born in Sicily, and wanted to see her brother, her sister, and many other persons there... so she and my father made a trip there, and I followed them with my wife and my son.

The result has been... probably some of the happiest days we ever had together! And the nice thing is that she also feels much better.

Now, how much better is difficult to say... certainly we've all seen her doing things we never thought she could do again.
I do believe in miracles, but before saying something is really a miracle I must have hard facts supporting it. Next Tuesday we will have the results of some medical exams she had done a few days ago, then we will see what's really going on there...

Until then, I'm just really happy of the time we had.
Even only this is already a miracle!

13 Apr 2006 (Permalink)

Chasing a ghost, and Microsoft JIT inlining behavior

Finally, the tree mover (and all the optimizations described here) is in svn, but that optimization set is not enabled by default.

The problem (which showed up as soon as I committed the patch) is that inlining breaks several pieces of code, most notably MonoDevelop.

All the issues are related to calls to MethodBase.GetCurrentMethod(), Assembly.GetExecutingAssembly(), Assembly.GetCallingAssembly() and any of the StackFrame constructors (thanks to lupus for this list!), and are related to the fact that inlining a method alters the chain of stack frames.

To fix this in a reliable way I wrote a small program to use as regression test, you can find it here. It must be split in two assemblies otherwise most of the tests don't make sense, but it is really easy to understand.
It also immediately shows the problems: with Mono from current svn, executing with "-O=-inline" prints five "PASSED" messages, which turn into five "FAILED" ones with "-O=inline".

So far, so good... Initially it seemed that a good solution would be checking if any of those methods was called, and if it was, refuse to inline the caller. However, after a few trials and errors, I found out that the proper fix is inlining only leaf methods (I mean, methods which don't call any other method), with the exception of internally generated wrappers, which can be inlined safely.
The rationale is the following: suppose you have a caller method CM, in assembly A, and a small called method SM, in assembly B. SM, then, calls another large method, LM (which for simplicity we think will never be inlined), which in turn calls for instance Assembly.GetCallingAssembly(). Now, if SM is inlined, it looks like if LM is called from CM (from assembly A), and not from SM (from assembly B): this makes so that the return value of Assembly.GetCallingAssembly() inside LM changes.
Note that inside SM there is nothing hinting for this, just a call to a method (LM). Finally note that this is true also if LM is a constructor, which means that also methods which call constructors cannot be inlined (which means any method that creates objects!).

To be fair, there is another reasoning we can follow which allows us to inline more methods: it is not the fact that a method merely contains a call or new CIL opcode that should prevent inlining. Instead, it is the fact that in the jitted code there's a call to potentially unknown code (either methods or constructors), which in turn could depend on the "correctness" of the chain of stack frames. This way of reasoning would allow us to inline nested calls, as long as none of them ends up calling a not-inlined method (or an inlined wrapper, because it will in the end call another piece of unknown code!).

But, regardless of the strategy we'll choose, fixing this is not hard, and in fact I did it quickly.
But even when in the regression test program I got all tests passing, the MonoDevelop splash screen was still not appearing, and scary warnings were logged every time I had inline enabled... To make a long story short, I wasted way too much time chasing a ghost: I was convinced that inline was doing something wrong (the problem was there only when inline was enabled), and I could not find what it was.
Then, trying to reproduce the problem with a smaller program, I got it: inline was doing nothing wrong, it was just exposing to the JIT the load of a static field a bit earlier. This static field was the SplashScreemForm singleton in MonoDevelop, and since the class is "beforefieldinit", the JIT initialized the field on the spot. SplashScreemForm is a Gtk.Window, so it got created before the GTK# initialization, and found itself without a display... bad thing.
However, the proper fix is just adding an empty static constructor to SplashScreemForm, so that it gets the strict field init semantics. The last line logged by the sample program I linked above shows exactly this kind of behavior.
And anybody making a GTK# Window a singleton should take note to remember to remove the "beforefieldinit" from the class, and access the singleton only after the GTK# Application has been initialized (I did the same mistake trying to reproduce the bug!).

So, in the end we'll have a JIT that always respects the structure of the call stack, but before you write code that needs this, think twice: the Microsoft runtime is not so strict ;-)
If you execute the test program on FX 1.1 or 2.0, you'll see that both of the Assembly.GetCallingAssembly() related tests fail (thanks to Sebastien for running the tests for me). The code pattern reproduced in the second one (in constructors) is sometimes used to load images from resources when building graphical objects, so code like this could work on Mono, and be broken on Windows...

27 Mar 2006 (Permalink)

Almost Easter, and still moving trees around...

Lately I've been mostly working on the "tree mover", which is described in this post by Miguel.
As you can see, it is not an optimization by itself, it is a code transformation necessary to help our current register allocator, and in fact the plan is not to use it alone, but to help inlining. In practice, to get the full benefits five more optimizations must be enabled starting from the current default in the Mono JIT:

As you can see, all these passes are meant to work together in a "pipeline". After a lot of hammering on them, I have seen concrete results on some real and complex benchmark, like XMLMark.
Working on this benchmark, first of all I modified the ini files, to be able to see exactly what the JIT was affecting; so I created five ini files, one for each XMLMark functionality (each functionality gets weight 100, and all the others stay at 0):

I also had to set the number of agents to 1 to get reliable results (the multithreaded nature of XMLMark was getting in the way, and after all I was interested just in the effects of JIT settings).

Then, I tested with a lot of option combinations:

Options
LabelOptions
ALLall
ALL-NO-SSAall,-ssa,-ssapre,-abcrem
FASTconsprop,copyprop,treeprop,deadce,inline
NO-DEADconsprop,copyprop,treeprop,-deadce,inline
NO-INLINE-DEADconsprop,copyprop,treeprop,-deadce,-inline
NO-INLINEconsprop,copyprop,treeprop,deadce,-inline
NO-INLINE-TREEconsprop,copyprop,-treeprop,deadce,-inline
NO-PROP-consprop,-copyprop,-treeprop,deadce,inline
NO-PROP-INLINE-consprop,-copyprop,-treeprop,deadce,-inline
NO-TREE-DEADconsprop,copyprop,-treeprop,-deadce,inline
NO-TREEconsprop,copyprop,-treeprop,deadce,inline
NOTHING-consprop,-copyprop,-treeprop,-deadce,-inline

Of these combinations, "NOTHING" is the current JIT default, "FAST" is the full set I'm testing (which ideally will become the default), "ALL-NO-SSA" means "everything the JIT can do without using SSA", and all the "NO-something" are the "FAST" set without something, to see if all the options are really necessary to get the best results.

These are the numbers I got:

XMLMark results, in transactions per seconds
XMLMarkDOM-SELDOM-MODDOM-STRDOM-SERSAX-SEL
ALL53.4050.20 50.9554.6015.40
ALL-NO-SSA53.4050.50 51.1055.7515.25
FAST53.7550.85 51.4556.0015.95
NO-DEAD52.5049.85 50.0554.1013.15
NO-INLINE-DEAD47.8545.45 45.1049.2013.10
NO-INLINE50.1547.05 46.2551.2015.45
NO-INLINE-TREE50.8547.55 47.8552.1515.35
NO-PROP53.0550.55 50.4554.7015.45
NO-PROP-INLINE49.6046.90 47.1050.9515.25
NO-TREE-DEAD51.9049.20 49.9053.9015.75
NO-TREE52.7049.90 50.2554.9015.75
NO-TREE52.7049.90 50.2554.9015.75
NOTHING51.0547.70 47.8551.7515.00

Here are the same data in graphical form:

XMLMarl results chart

To make some sense of it, here's a chart of the gains against the "NOTHING" options set, as percentages:

XMLMarl gains chart

Some easy observations:

The (small) downside of these gains is that they cost about 25% more JIT time (this is the cumulative overhead of the "FAST" set). This is clearly visible in programs which exercise a lot of code but are very fast, like a mcs bootstrap: in this case the "FAST" set loses 1.33% (it takes 3.86 seconds instead of 3.81). However, uning a custom mcs driver that performs a second compilation (a sort of "hot run", because all the code is already jitted), and making sure that also the "cold run" had all the source files in the system buffer cache, shows that the "FAST" set gains 1.85% against the current default. This means that longer compilations should not suffer at all from the JIT overhead.

Finally, these results are on x86... on amd64 I noticed the need for further tuning in the linear regalloc spill costs: the gains are still there, but in some cases they are smaller. And also other benchmarks need some more tuning.

Anyway, this code should hit svn "real soon now", and be part of Mono 1.2 (so that nobody should worry about the overhead of wrapping fields with properties anymore!) .

Next in my todo list is getting rid of the SSAPRE issues we have...

13 Mar 2006 (Permalink)

So you want to tune the JIT... but where are the knobs?

Certain optimizations are always a win, so, if they don't take too much JIT time, they should be always turned on. Examples are constant folding, using intrinsic functions, reorganizing branches, or also dead code elimination (again, if the JIT time is small, which for now in Mono means without using SSA).

Other optimizations, however, are more problematic... sometimes they are good, other times they don't optimize at all: they make worse code. In this case, some tuning is needed: the JIT must decide where the optimization is useful, and where it is useless or even harmful.

A typical example of this in Mono is SSAPRE, which looks good on paper (it eliminates redundancies), but actually makes a tradeoff between the cost of space (to store precomputed values) and computations, and this tradeoff is not always obvious.
So far I've seen the following places in the JIT where this kind of tuning is important:

So we can (very roughly) consider each of the parameters described above as a knob we can use to "tune" the JIT: changing the threshold used to take a decision, the JIT behavior is altered.
And obviously there should be an optimal tuning point...

But things are not so simple.
First of all, it can happen that one knob influences another. For instance, more inlining can increase the pressure on the regalloc, which means that the behavior of the regalloc knob will react differently. Then, not all CPU architectures can react to the tuning in the same way. One of my latest posts to mono-devel-list is an example of both things: I changed the way in which the spill cost takes into account the BB nesting level from "(1 + (bb->nesting * 2))" to "(1 << (bb->nesting << 1))", and got a 25% speedup in Scimark2. But I cried victory too son: I also made it 2% worse on x86... a less aggressive setting like "(1 * (bb->nesting << 1))" is more a compromise (only 10% more on amd64, and not worse on x86), but I still have to look at the generated code better and find the optimal point (or maybe decide that there is not an architecture independent optimal point).
Now, this can be reasonable, because on amd64 the JIT uses SSE instructions for floating point operations, where the regalloc works "properly", while on x86 the FPU is used in the traditional "stack based" way.

The point of all the above it that exploring this "tuning space" can be a long (and sometimes tedious) process, often driven more by "sense of smell" than anything else.
But "sense of smell" is not scientific, and the "boring" side of the equation makes it hard to gather results in a rigorous way. So I wrote a small script to automate the process, which can be found here together with some benchmark.

Mind you, it is rough, my perl-fu is poor, some paths are hard coded, real men use sed to modify files in place instead of stupid perl loops... but it gets the job done. It is meant to be used directly from the "mini" directory in the JIT, and produces lots of nice result files that can then be examined quickly.
And all the boring part, edit source file, save, build, execute benchmarks n times, save results with a meaningful name, repeat, do the same on a different machine... well, it's gone, freeing my time for something better!

So, if you are interested in JIT tuning, and have a machine with some spare cycles to run benchmarks, please give a look to that script. It is fairly easy to describe your own set of benchmark runs in an array of perl hashes, and it will modify the JIT source, rebuild and run all the benchmarks for you in sequence.

And of you have interesting results, let me know!

In the next weeks I'll do the following:

Of course, if nothing more important shows up!

21 Nov 2005 (Permalink)

Spreading the word

Some news... November 26th is national Linux Day. I'll be giving two talks on Mono, here and here.

These are the slides I'll use: SXI (OpenOffice), PDF and text (this one freely redistributable). Not that I'll be able to do the full presentation in the time I'll have, but I prefer to have all the slides anyway.

Moreover, an article I wrote will be published on this Italian magazine on December 1st. It will take eight full pages, and is also the subject of a fancy cover:

Cover showing a monkey helping a penguin out of a Windows prison

The icing an the cake will be two Suse 10 DVDs (x86 32 and 64 bits), and a large selection of Mono packages on an additional DVD...

Plus, more articles about Mono will follow in the next months.

16 Nov 2005 (Permalink)

Fastpath way to get rid of the dead

I've not been blogging for ages... last time I did, I was starting to work on HSSA.

Now, HSSA is progressing well, it already works in my tree, and I have a functioning dead code elimination (deadce) pass implemented that is based on it. I was testing and benchmarking this "HSSA deadce", comparing it to the existing "SSA deadce" we already have in the JIT, when I realized that I also had a third deadce implementation, which didn't use SSA at all. I'll take Patrik Torstensson's suggestion, and call it "fastpath" deadce, because the JIT path that doesn't use SSA is the fast one.

If someone wants to laugh at the "famous last words", he can have a look at my previous blog post, where I mentioned exactly this deadce pass. Anyway this time, since I was testing deadce effectiveness, it seemed a nice thing to include it in the benchmarks...

The results have been at least interesting: fastpath deadce is actually fairly effective, and it also doesn't have the JIT time overhead of the SSA version.

This table lists how long it takes to the JIT to compile "mscorlib.dll". The times are measured with "time mono -O=options --compile-all mscorlib.dll".

Time to JIT compile "mscorlib.dll", in seconds
Options-all-all,ssa (default)ssadeadcessa,deadce
JIT time1.051.35 1.321.551.451.6

From the first two columns we see that just computing SSA has about a 30% overhead with respect to a compilation with no optimizations at all.

Then we see that the overhead of fastpath deadce on a "default" compilation is almost 10%, while ssa deadce has 21% (and note that in practice, since ssa does not work with aliasing, the new fastpath deadce is applied to more methods).

So, the JIT overhead is reasonably small.

And finally, some lovely benchmark... I tried SciMark, and here are the results:

SciMark results (MFlops)
Options (default) deadce ssa
deadce
consprop
copyprop
deadce
consprop
copyprop
deadce
inline
inline deadce
inline
Composite Score 205.92 207.50 210.61 207.72 209.82 213.63 214.52
FFT 111.42 110.91 124.74 101.83 113.39 123.55 123.41
SOR 394.40 400.97 400.03 405.99 402.71 402.78 405.82
Monte Carlo 4.91 5.16 4.23 5.13 4.85 4.69 5.16
Sparse MatMult 167.77 160.82 172.33 169.99 173.70 173.47 172.30
LU 351.12 359.62 351.71 355.69 354.47 363.67 365.89

Well, we can see that deadce is doing its job! There are no miracles, but it's not bad anyway.

Another consideration: SSA deadce is still more effective than the fastpath one, likely because SSA allows for more precision in understanding the effect that statements have (but I still have to verify this, it is also possible that I missed something in the fastpath implementation). Anyway, the point of this benchmark run is that fastpath deadce has far less JIT overhead, so that we could consider it for inclusion in the default optimization set, something for which SSA does not qualify (at least for now). Of course SSA still has its use: more advanced things (ABCREM and redundant check removal in general, PRE, and in the future register promotion and escape analysis) are better done with [H]SSA. But we can see that a simple deadce pass can be effectively done without it.

And by the way, I spent a lot of time trying to be sure that those numbers are accurate. I had to kill evolution, gaim, beagle, the wireless card, and carefully monitor system load otherwise the results were erratic (just of a few percent, but this is what we're looking at). I then executed the benchmark several (>10) times for every case and took the best score for each (with the idea that it is the one where other things interfered less), and anyway all the best scores were almost identical. And of course the patch passes all standard tests (a full Mono rebuild with the optimization turned on by default, and all regression tests).

One final strange result... look at this table:

SciMark results (MFlops)
[exe compiled with a MS .NET 2.0 csc]
Options (default) consprop
copyprop
deadce
Composite Score 136.65 137.52
FFT 52.60 52.29
SOR 404.41 405.48
Monte Carlo 4.61 4.50
Sparse MatMult 140.03 143.80
LU 81.61 81.52

Well, deadce still has some effect, but... the overall performance is just awful!

Looking better, we see that FFT and LU (and to some extent Sparse MatMult) are really badly affected, and bring the score down. I already have some small test program that isolates them (csc 2.0 was also able to reveal a very subtle bug in HSSA, on SciMark LU), so I'll investigate further. The sources and binaries of the benchmark are here ("SciMark.exe" is the original executable), and anyway I got the original package here.

There is still work to do... I'd have to check the exact effect this has on inlining looking at the code produced compiling small specific benchmarks, and I'd also have to see if we can lift the restriction imposed by catch/finally clauses (the last patches by Patrik that fix the CFG should help here). This would be nice, because it would mean that this optimization would apply to all methods! For instance, Ben warned me that every "foreach" statement introduces a "finally" clause, which disables deadce and everything that needs SSA in the JIT.

Finally, if anybody wants to look at the source, it has been posted here.

12 Apr 2005 (Permalink)

Dealing with aliasing

Recently I've been working on alias analysis (also known as "points to" analysis).

Very briefly, this means understanding the behavior of pointers in the compiled code so that various optimizations can be performed also in the presence of indirect load and store operations (where "indirect" means "performed using pointers").

In the beginning, I started working on this because I wanted to implement a dead store elimination pass without using SSA. The idea was that it would have helped when we perform inlining, because we create one local variable for each argument of the inlined method. Quite often those variables could be eliminated by a dead store elimination pass if copy propagation has been done, but for now we do deadce only using SSA. It is easy to implement deadce without SSA if you have liveness information, and we always compute liveness to run the linear register allocator, so I decided to base the deadce pass on that.

I found out that our liveness code does not really take aliasing into account; this is enough anyway for how the regalloc uses liveness, but my deadce pass was causing crashes all over the place. So I begun investigating alias analysis.

The problem is that alias analysis can mean a lot of different things. There are a lot of possible approaches to the problem, and the analysis can be more or less accurate. After a while (and a few messages on the mono-devel list) we settled on the initial goal of having correct liveness information for local variables. This would allow me to do my deadce pass, and would also be enough to enable SSA also when the address of a local variable has been taken (now SSA is totally disabled in those cases). On the other hand, this also means that we explicitly do not care about pointers that point outside local variables.

I then implemented this kind of analysis, used it in a deadce pass, and posted a patch. It then turned out that my deadce didn't help so much for inlining. It was in any case better than the SSA based deadce because it had much faster compilation times (no SSA overhead), but it wasn't so urgent to commit it...

Meanwhile, I turned back to SSAPRE, because it just wasn't working right. I committed a patch that at least seems to limit the damages (the code produced for the example in my previous blog entry looks OK this time), but the bottom line is that SSAPRE, as is, is just not very effective.

Of course one way to improve the situation would be adding GVN capabilities, but looking at the code in the benchmarks the real issue seems to be that indirect operations are not handled at all. Since object fields are generally accessed through a pointer to the base of the object (plus the appropriate offset), this means that all expressions containing a field access are simply ignored by SSAPRE.

So we are back to aliasing, but this time taking into account all pointers, not just those referring to local variables. And this kind of analysis is much more complex... This is another reason why I still haven't committed the current aliasing patch: it is likely that I'll do changes to its data structures, and I if possible I'd like not to work on two different implementations at the same time.

One of the trickiest things in this kind of "full" alias analysis is giving it an SSA representation. In the end, SSA in its natural form gives unique versions to local variables, not to arbitrary memory locations. However, usually SSA based optimizations expect to work on "versioned" values... One interesting solution to this problem is an extended SSA form called HSSA. This involves several steps:

As usual, the devil is in the details.

Ignore the fact that I still don't fully understand all the consequences of throwing GVN into the mix. Let's also forget that probably representing virtual variables in Mini will be non trivial (for us a variable is just an entry in the cfg->vars array, identified by its index, but the way in which virtual variables should be identified is a bit tricky). Anyway, the real issue is the first point: how do we know where to introduce χ and μ operations?

The simple answer is: performing alias analysis!

Yes, because actually the HSSA paper shows how to fully use the results of this kind of analysis, making it work safely and effectively with SSA. How to obtain those information is a different thing. In my previous patch I actually avoided most of the problem, because I considered only local variables. Since a pointer to a local variable can be created only explicitly inside the method, it is fairly easy to track the usage of those pointers. Moreover, falling back to "all variables for which I lost track of their address are mutually aliased" is not that bad, because the situation is not frequent.

On the other hand, pointers to locations outside the method local storage are totally out of control. Since we cannot perform interprocedural analysis, we must handle all method calls as black boxes, and we cannot assume anything on input arguments as well. In this situation the most obvious technique to use is "type based" alias analysis. This basically means the following: since we know our execution environment is strongly typed, we assume that all pointers whose value is unknown can alias all pointers of the same (or compatible) type.

The problem is, this works well in Java, because in that environment the only "pointers" are references to objects. But in CIL the situation is different: not only we have object references, we also have managed pointers to "simple" memory locations (typically value types), that can be either local variables or "parts" of other objects.

To better understand what I mean, have a look at the following program:

using System; using System.Runtime.InteropServices; public struct Data { public int i; public byte b; } [StructLayout(LayoutKind.Explicit)] public struct Union { [FieldOffset(0)] public int i; [FieldOffset(0)] public byte b; } public class Test { public static void WorkData (int v, ref Data d, ref byte b) { b = 0; d.i = v; d.b = (byte) v; d.i += b; } public static void WorkUnion (int v, ref int i, ref byte b) { i = v; b = (byte) v; i += b; } public static void Main () { Data d = new Data (); Union u = new Union (); WorkData (1, ref d, ref d.b); WorkUnion (1, ref u.i, ref u.b); } }

Of course it does nothing useful, but it shows a couple of things:

As you can see, CIL is a very interesting language when dealing with aliasing: the fact that it is strongly typed allows some assumptions on pointer values, but the possibility of referencing "raw" value types gives a lot of complications...

I'll spend the following days doing two things: refining the conditions to perform type based alias analysis on CIL, and understanding the implications of the introduction of GVN into HSSA. This thing looks particularly promising to me... it should in the end give our JIT the same optimization capabilities of other mature compilers.

Of course, ideas and suggestions are always welcome... please send them to the mono-devel list!

16 Mar 2005 (Permalink)

Improving SSAPRE

One of the optimizations I've implemented lately is SSAPRE.

The problem is, this sweet beast works, and works correctly, but sometimes the code it produces is slower than the original. Needless to say, this is a Bad ThingTM. And it is also a bit embarassing...

Recently lupus sent me a benchmark that showed this behavior and was easy to understand, so I decided to have a deeper look at the issue (previously I wanted to wait until I had added GVN capabilities to SSAPRE, but now I see it's better to at least understand what's really going on).

The "meat" of the benchmark is the following method:

public void Advance(double dt) { double dx, dy, dz, distance, mag; for (int i=0; i < bodies.Length; i++) { for (int j=i+1; j < bodies.Length; j++) { dx = bodies[i].x - bodies[j].x; dy = bodies[i].y - bodies[j].y; dz = bodies[i].z - bodies[j].z; distance = Math.Sqrt(dx*dx + dy*dy + dz*dz); mag = dt / (distance * distance * distance); bodies[i].vx -= dx * bodies[j].mass * mag; bodies[i].vy -= dy * bodies[j].mass * mag; bodies[i].vz -= dz * bodies[j].mass * mag; bodies[j].vx += dx * bodies[i].mass * mag; bodies[j].vy += dy * bodies[i].mass * mag; bodies[j].vz += dz * bodies[i].mass * mag; } } foreach (Body body in bodies) { body.x += dt * body.vx; body.y += dt * body.vy; body.z += dt * body.vz; } }

I verified that if I apply SSAPRE only to this method, all things being equal the benchmark is almost 10% slower, so it's worth understanding what happens.

The issue is that looking only at the code transformations doesn't help at all, because they are all correct, and they all eliminate redundancies. It is a known thing that SSAPRE can increase the pressure on the register allocator, so I ended up verifying everything, included the resulting machine code (on x86) with and without the optimization.

One of the first things I found looking at the code is that more local variables are used. Without SSAPRE, the prologue contains "sub $0x28,%esp", and with it this becomes "sub $0x34,%esp", so 12 more bytes are used by SSAPRE for temporary variables. So yes, there is more register pressure, but just seeing it this way does not justify a 10% loss in performance.

Looking at the actual differences in the generated code is really interesting. SSAPRE gets many things right... the most obvious is the access to "bodies", which is an instance field (Advance is an instance method). At the beginning, there is a "mov 0x8(%ebp),%edi" instruction, which puts the "this" parameter in edi. This is fine, but with SSAPRE it gets better: a temporary is created to hold the address of "bodies"; "bodies" is the only instance field that is used in the method, so the "this" variable is dead, and is eliminated. This promotes the address of "bodies" to edi. In practice, a single "add $0x8,%edi" does the trick, and from then on, all accesses to "bodies" are like "mov (%edi),%eax" with SSAPRE (they were like "mov 0x8(%edi),%eax" without it). This happens 21 times.

Then there is the implicit access to bodies.Length in the "foreach" loop, which SSAPRE correctly recognizes as a loop invariant.

Another subtle transformation introduced by SSAPRE is the following: every time a "-=" or "+=" operator is used, the address of the affected location must be computed twice: one to perform the operation, and another to store the result. To make a concrete example, this means that without SSAPRE the last line of the foreach loop ("body.z += dt * body.vz") starts with a "fldl 0x18(%ebx)" and ends with a "fstpl 0x18(%ebx)". On the other hand, with SSAPRE, first of all the sequence "mov %edi,%eax; add $0x18,%eax" is inserted. Then the two other instructions should become "fldl (%eax)" and "fstpl (%eax)". Not such a big win, but not so bad after all. At least, not enough to justify a 10% loss in performance.

The problem is that things do not go that way. The line without SSAPRE is compiled like this:

fldl 0x18(%ebx) fldl 0xc(%ebp) fldl 0x30(%ebx) fmulp %st,%st(1) faddp %st,%st(1) fstpl 0x18(%ebx) While with SSAPRE it becomes like this: mov %edi,%eax add $0x18,%eax mov %eax,0xffffffc4(%ebp) mov %eax,%ecx fldl (%ecx) fldl 0xc(%ebp) fldl 0x30(%edi) fmulp %st,%st(1) faddp %st,%st(1) fstpl (%eax) Let's analyze what happens.

The sequence that is actually inserted is "mov %edi,%eax; add $0x18,%eax; mov %eax,0xffffffc4(%ebp); mov %eax,%ecx", and the two key instructions then become "fldl (%ecx)" and "fstpl (%eax)" (note the use of ecx in the load and eax in the store). What's puzzling is that the sequence between the load and the store operates entirely on the FP stack ("fldl 0xc(%ebp); fldl 0x30(%edi); fmulp %st,%st(1); faddp %st,%st(1)", and the same with ebx instead of edi without SSAPRE), so it should not affect the precalculation of 0x18(%edi). Moreover, eax and ecx are only used in that load and store, and the location 0xffffffc4(%ebp) is assigned to, but it's dead. In the IR representation (before the linearization) there's nothing that would induce this code, except for the fact that the temporary introduced by SSAPRE to hold the address "0x18(%edi)" is in fact allocated in 0xffffffc4(%ebp)" after the linear scan regalloc has run. So this seems to me some artifact introduced by the local regalloc... This happens in all the three lines of the foreach loop, exactly in the same way, and similar things happen in the six lines with a "-=" or a "+=" in the loop above.

The last disaster happens in the inner loop (the one on the "j" index). I think you can easily see in the code that index "i" is incremented twice: one in the outer loop, to increment the counter, and another at the head of the inner loop, to initialize the "j" index. SSAPRE (being picky with redundancies) doesn't like this, so it has a nice idea: it introduces a temporary variable at the head of the inner loop, to store the result of "i + 1". Then, at the end of each iteration of the outer loop, instead of incrementing "i" it reads the value from the temporary. Note that the added temporary is live across the whole inner loop, but it is (obviously) never used there, because its value is "i + 1" and not "i"; it is only used at the end of each outer loop's iteration. Of course this is not such a big win, but it should not be so bad, either...

The twist is that the global rgalloc gets confused. Without SSAPRE, it allocates "i" in esi, and "j" in edi. With SSAPRE, "i" stays in esi, but the stupid SSAPRE temporary for "i + 1" goes in edi, and "j" goes in 0xfffffff0(%ebp)! So, every time "j" is used in the inner loop, 0xfffffff0(%ebp) is used instead of simply %edi...

Put this all together, and indeed you get an almost 10% degradation in performance.

The problem is, how to get out of this?

There are two kind of solutions:

It is likely that both things should be done.

In the example above, all the redundancies that cause troubles are of this kind: they are a simple arithmetic operation (an add or a sub), they introduce one temporary to eliminate just one computation, and the operation that is eliminated wasn't a loop invariant. So, they introduce one temporary, but they don't gain that much. Chances are that, if the operands of the expression are both in registers and the temporary is on the stack, the code produced by SSAPRE would naturally be always slower.

The problem is that when SSAPRE runs it has no idea of the choices that the regalloc will do, so this cannot really be fine tuned without introducing a sort of cooperation between the PRE pass and the regalloc, which is not a short term goal IMHO. So, the only reasonable heuristic I can imagine is the one I described above. It makes loop optimizations a prerequisite for SSAPRE (otherwise it cannot tell when an expression is a loop invariant, for SSAPRE all partial redundancies are equal), but it should not be that hard to to.

Anyway, IMHO also some work on the regalloc would be necessary (I know it will be rewritten). In fact, applying the above limitation to SSAPRE in this example would be helpful only because the register pressure would be reduced. It seems to me that all the transformations performed by SSAPRE were useful in principle, and that it is mostly a bad choice of register variables that makes the code so bad. However, I'm not such an assembly expert, and I'm not saying this to defend SSAPRE, so please if some of the transformations described above was harmful in itself just let me know with a message in mono-devel.

Moreover, there is another kind of problem that must be addressed: there are a lot of obvious redundancies that SSAPRE simply doesn't see!

For instance, look at all the accesses to the "bodies" array. There are just two indexes, "i" and "j", how is it that SSAPRE cannot factor them out?

The problem is that "bodies" is not a local variable. It is a field of the "this" argument, and (to put it simply) fields are not "in SSA form". By this I mean two things:

This kind of thing is addressed by a full alias analysis (or "points to" analysis... the two terms are often synonymous, but in this case "points to" seems more appropriate).

In practice, this means tracking where pointers point to, and how they are used. In the above example, it is obvious that "bodies" is never assigned, and there are no method calls that could change it, so the array is always the same. This, besides enhancing SSAPRE (and therefore putting even more pressure on the register allocator), would also allow ABCREM to operate on the array.

I have already implemented a simple alias analysis pass, and it works (I plan to blog about it shortly). However, it is reduced in scope, as it only tracks pointers to local variables. In the (near) future I should extend this to work on fields as well.

To sum it up, four kind of things should be done to enhance SSAPRE:

Note that the first two things would prevent performance degradations, while the last two are proper enhancements.

This is all... please send comments and suggestions to mono-devel!

14 Oct 2004 (Permalink)

Memories...

Preparing for the trip to Boston, I've been thinking about my presence in the mono team, and how it happened.

I've officially been a monkey for a bit less than half a year now, but actually it all started exactly one year ago...

But, recalling memories, it started way before. I clearly remember May 7th, 2003: I went to Budapest for a three day meeting, with a coworker (that now took my place as supervisor in my previous team in Siemens). On the plane we chatted about our jobs, and our desires, and I told him plainly that the best thing that could happen me would be to be work for Ximian in the mono team. I like free software, I like studying the structure of programming languages, and the problems of building compilers, and that project was exciting for me. Only, the combination of my Siemens job and my family was too demanding to leave me the time to become a contributor...

How is it that I remember it was May 7th, 2003? The previous night, before going to bed, my wife had the feeling that something was going wrong with the baby (she was pregnant, 7th month), and/or something could go wrong with my trip. We even considered canceling my meeting but then she said "No, don't worry, it was just a feeling after all". And, seriously, I would have had a hard time canceling such a meeting without a very valid reason.

The following day I went to Budapest, and in the evening both my wife and my son narrowly escaped death at the local hospital, where she decided to go "just to make a check". The doctors had to take the baby out with a Caesarean delivery, and it is quite likely that if my wife had not been there exactly at that time, both of them would not be here now.

I could not rebook the flight, and had to enjoy the whole meeting and get home three days later.

Another "interesting" date... September 3rd, 2003: even if I knew I had no time to contribute to mono, and my current job was based on Java and not .NET, I bought Don Box's book, to start learning. How is it I remember the date? It was our 5th wedding anniversary, we were strolling near the cinema, and I sow the book in a store... I consider it my wife's gift for the anniversary :-)

Roughly one month later, I was looking at the code in mini, and browsing monologue...

The number of coincidences in this story is simply too high.

15 Sep 2004 (Permalink)

ADSL

Well, the title says it all... for the very first time, I'm blogging from home, through my newly activated ADSL line :-)

This will also mean that from now on I will be more present on IRC (well, actually "more present" sounds ridiculous, as any amount of presence will make me more present).

The downside of this is that today I lost some hour setting up things (all the thinks I needed, modem, filters, cables, a couple of ethernet over power line adaptors, and a wireless card have been stored in various places around the house for months).

I've also been hit by the problem of kernel 2.6 using IPV6 packets by default (at least for DNS, I have not understood completely what's going on there), and reverting to IPV4 after 20 seconds... At first it seemed like the connection was terribly slow. Then, after various tests assessing it at more than 550 kbps (it is an 800/320 line), I noticed the problem was a 20 seconds lag when opening pages in new sites... I googled for it, and got the solution:

echo alias net-pf-10 off >> /etc/modprobe.conf

Now the only thing that still doesn't work is the wireless card. I have solved a kernel/driver/firmware download issue (my Fedora core lacked the file for the card firmware). Now the driver seems to be working (I also see nice flashing leds on the card), but still does not "talk" to the access point, go figure :-(

I'm going to test moving the laptop near the access point this evening, and if this does not work... boh!

14 Sep 2004 (Permalink)

SSAPRE

SSAPRE, SSAPRE, everywhere,
All other things just shrink.
SSAPRE, SSAPRE, everywhere,
nothing more I can think!

I've been without blogging for a long while... (in practice, since MovableType disappeared from primates.ximian.com after the license change).

In this period I'm implementing this SSAPRE thing, which is "SSA based Partial Redundancy Elimination".

Partial Redundancy Elimination (or PRE) is an optimization that (guess what?) tries to remove redundant computations. It achieves this by saving the result of "not redundant" evaluations of expressions into appositely created temporary variables, so that "redundant" evaluations can be replaced by a load from the appropriate variable.

Of course, on register starved architectures (x86) a temporary could cost more than the evaluation itself... PRE guarantees that the live range of the introduced variables is the minimal possible, but the added pressure on the register allocator can be an issue.

The nice thing about PRE is that it not only removes "full" redundancies, but also "partial" ones. A full redundancy is easy to spot, and straightforward to handle, like in the following example (in every example here, the "expression" is "a + b"):

int FullRedundancy1 (int a, int b) { int v1 = a + b; int v2 = a + b; return v1 + v2; }

PRE would transform it like this:

int FullRedundancy1 (int a, int b) { int t = a + b; int v1 = t; int v2 = t; return v1 + v2; }

Of course, either a copy propagation pass or a register allocator smart enough to remove unneeded variables would be necessary afterwords.

Another example of full redundancy is the following:

int FullRedundancy2 (int a, int b) { int v1; if (a >= 0) { v1 = a + b; // BB1 } else { a = -a; // BB2 v1 = a + b; } int v2 = a + b; // BB3 return v1 + v2; }

Here the two expressions in BB1 and BB2 are *not* the same thing (a is modified in BB2), but both are redundant with the expression in BB3, so the code can be transformed like this:

int FullRedundancy2 (int a, int b) { int v1; int t; if (a >= 0) { t = a + b; // BB1 v1 = t; } else { a = -a; // BB2 t = a + b; v1 = t; } int v2 = t; // BB3 return v1 + v2; }

Note that there are still two occurrences of the expression, while it can be easily seen that one (at the beginning of BB3) would suffice. This, however, is not a redundancy for PRE, because there is no path in the CFG where the expression is evaluated twice. Maybe this other kind of redundancy (which affects code size, and not the computations that are actually performed) would be eliminated by code hoisting, but I should check it; anyway, it is not a PRE related thing.

An example of partial redundancy, on the other hand, is the following:

int PartialRedundancy (int a, int b) { int v1; if (a >= 0) { v1 = a + b; // BB1 } else { v1 = 0; // BB2 } int v2 = a + b; // BB3 return v1 + v2; }

The redundancy is partial because the expression is computed more than once along some path in the CFG, not all paths. In fact, on the path BB1 - BB3 the expression is computed twice, but on the path BB2 - BB3 it is computed only once. In this case, PRE must insert new occurrences of the expression in order to obtain a full redundancy, and then use temporary variables as before. Adding a computation in BB2 would do the job.

One nice thing about PRE is that loop invariants can be seen as partial redundancies. The idea is that you can get into the loop from two locations: from before the loop (at the 1st iteration), and from inside the loop itself (at any other iteration). If there is a computation inside the loop that is in fact a loop invariant, PRE will spot this, and will handle the BB before the loop as a place where to insert a new computation to get a full redundancy. At this point, the computation inside the loop would be replaced by an use of the temporary stored before the loop, effectively performing "loop invariant code motion".

Now, this is what PRE does to the code.

But how does it work?

In "classic" solutions, PRE is formulated as a data flow analysis problem. The Muchnick provides a detailed description of the algorithm in this way (it is by far the most complex problem of this kind in the whole book). The point is that this algorithm does not exploit the SSA form. In fact, it has to perform all that amount of data flow analysis exactly because it does not take advantage of the work already done if you have put the program into SSA form.

The SSAPRE algorithm, on the other hand, is designed with SSA in mind. It fully exploits the properties of SSA variables, it also explicitly reuses some data structures that must have been computed when building the SSA form, and takes great care to output its code in that form already (which means that the temporaries it introduces are already "versioned", with all the phi variables correctly placed).

The main concept used in this algorithm is the "Factored Redundancy Graph" (or FRG in short). Basically, for each given expression, this graph shows all its occurrences in the program, and redundant ones are linked to their "representative occurrence" (the 1st occurrence met in a CFG traversal). The central observation is that the FRG is "factored" because each expression occurrence has exactly one representative occurrence, in the same way as the SSA form is a "factored" use-definition graph (each use is related to exactly one definition). And in fact building the FRG is much like building an SSA representation, with PHI nodes and all. The really interesting point is that the FRG for a given expression has exactly the same "shape" of the use-definition graph for the temporary var that must be introduced to remove the redundancy, and this is why SSAPRE can easily emit its output code in correct SSA form.

One particular characteristic of the SSAPRE algorithm is that it is "sparse", in the sense that it operates on expressions individually, looking only at the specific