Massimiliano Mantione's Blog Massimiliano Mantione's Blog http://primates.ximian.com/~massi/all.html Massimiliano Mantione massi@ximian.com Fri, 05 Sep 2008 07:34:41 GMT http://backend.userland.com/rss lb# 04 Sep 2008 <h3 class="title"> Hack week fun with Mono on games! </h3> <p> This hack week I wanted to demo the use of Mono to program the logic of a game. <br/> This is an area where Mono can be really useful, and we want to increase its adoption even more (we also had a booth at the GDC <a href="http://tirania.org/blog/archive/2008/Feb-26.html">last year</a>), so a good technical demo can be useful. </p> <p> First of all I had to select a game to work on. There were a few prerequisites I had: <ul> <li> The game has to be freely redistributable, so that anybody can see the demo without needing to buy the game. So, being Free Software and multiplatform are good things in this sense. </li> <li> It must be a "real world" example, not just a toy, ideally a real commercial game: we'd like to show that Mono is useful in these kinds of projects. </li> <li> Doing the work must be feasible in one week, or anyway doable with reasonable effort: it is a <i>demo</i> I want to do, not spend one year working on a game. </li> <li> Comparing Mono to a different mainstream game scripting engine in the same context would also be nice. </li> </ul> </p> <p> As you can see, many prerequisites conflict with each other. One way to reconcile the "real world" thing and the "redistributable" one would be to choose commercial games that have been released as Free Software, like many ID Software ones. <br/> In this sense, any of the Quake 3 derivatives would be really nice. They also are conceived for being "modded", so doing some extension in C# is definitely doable. Unfortunately, the amount of work required to do (or port) a game mod to C# is (IMHO) too much. They tend to weight in at 100k lines of code, which is a bit too much for the demo I had in mind. Also the AI of a single bot (fighter) can easily exceed 20k lines of code, and what's worse is that there's no clear interface for that in the game, so drawing the line between Mono and the rest of the engine would be an additional work. </p> <p> In the end I selected the <a href="http://torcs.sourceforge.net/">TORCS</a> racing simulator. <br/> It is not a commercial game, but its level of realism is adequate for the demo: the drivers must take into account all sort of parameters to race effectively, including basic aerodynamics, ground effect, spoiler angles, tire friction and temperature, engine response, fuel load... <br/> Moreover, there is a clear API to write the AI of drivers as plugins in separated loadable modules. </p> <p> So, this is what I did: I took the TORCS driver API, wrote a <a href="http://www.swig.org/">SWIG</a> interface file for it, and used it to produce a C# wrapper. I then selected the best of the TORCS drivers, which seemed to be one of the "berniew" ones by Bernhard Wymann, with the intention to port it to C#. However, it was "too good", with 2700 real lines of fairly complex c++ code (reported by sloccount), so I selected the "next good" driver seeing how well the others raced against berniew. It seemed to me that "inferno" (apparently an evolution of "tanhoj") was a very good one, so I decided to port that: with only 800 lines of code it raced almost as well as berniew. </p> <p> The porting was easy, and now I have a working TORCS driver implemented in C#. <br/> I then sow a subtle error, similar to a difference of precision in the calculations (but likely just a porting bug), which causes stupid driving mistakes in the bot, I'm going to debug that soon. </p> <p> The next planned steps were to reimplement the same driver in Lua, and benchmark that comparing it with the C# and C++ implementations, measuring both execution time and memory usage, and also to comment on the "implementation experience" (C# vs Lua). </p> <p> However, in the process I found out that SWIG, as a tool for interfacing C# with native code, does the job but has serious deficiencies. <br/> One is that it does not work with arrays (at least the version I have, 1.3.35). <br/> Another one is that it produces <i>lots</i> of native code, and invokes it practically for everything, so even a simple field access becomes a call to a native C function (which involves a managed-umnanaged transition and all). I really believe that a wrapper generator tailored specifically for the .NET world could use the CIL much more effectively (the SWIG C# backend is almost a copy of the Java one...). <br/> Finally, SWIG encourages the use of "interface files" distinct from the native C header files of the API it is working on. While this is understandable, I think a more direct approach would be more practical. </p> <p> My impression is that the Windows crowd does not feel the pain because when they need to interoperate with native code they have managed c++. At first I just dismissed it as a "language abomination", something taking the already complex C++ and adding all sort of quirks and different ways of doing things on top of it. But now, thinking better, it is the <i>perfect</i> tool to interoperate managed and unmanaged code, and in the free software world we lack something with that power. </p> <p> I am not advocating that we implement managed C++ right away... I'm only saying that we are still using primitive tools to do this kind of job, and quite often we just say that "doing wrappers by hand is the best approach, anyway it's easy", because none of the tools we have is good enough. </p> <p> But this (and the rest of my activity on TORCS, and Mono inside games) could be the subject of my next blog post :-) </p> http://primates.ximian.com/~massi/all.html#9%2f4%2f2008+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#9%2f4%2f2008+1%3a55%3a00+PM Thu, 04 Sep 2008 13:55:00 GMT 14 Feb 2008 <h3 class="title"> Hack week interlude! </h3> <p> On Monday and Tuesday I have been implementing <a href="http://primates.ximian.com/~massi/blog/archive/2008/Feb-03.html">this</a> <a href="http://primates.ximian.com/~massi/blog/archive/2008/Feb-04.html">crazy</a> <a href="http://primates.ximian.com/~massi/blog/archive/2008/Feb-04.html">idea</a>, but then I got stuck in a bad design problem. <br/> The idea is a long term thing anyway... I have never thought to be able to do something related to compiler and language design this week, I only wanted to have a flexible parsing framework, which <i>eventually</i> will be used in compilers. <br/> But even just that has proved being a bit hard. On Wednesday I was tempted to throw in the towel, and do something different (like helping with the regex engine reimplementation, which likely will be <i>the</i> hack week hit in the Mono project). </p> <p> Then, I've been fascinated by <a href="http://www.advogato.org/person/lupus/diary/25.html">this</a> post by uber-hacker <a href="http://www.advogato.org/person/lupus">lupus</a>. <br/> It's obviously a hack, but a clever and cute one. </p> <p> And it has something in common with the idea of extending compilers... so, I took a one day break to see if it was possible doing it in a different way, without changing the mcs C# grammar, but instead modifying mcs to accept arbitrary "extensions" that put into it code that is executed at compile time, and can in turn emit arbitrary code inside methods. </p> <p> Thanks to <a href="http://mareksafar.blogspot.com">Marek Safar</a>, that was ready to help me when I could not understand what mcs was doing... here is the code I can now write: <xmp class="code-csharp">using System; using Mono.CSharp.Extensions; #pragma compiler extensions enable namespace Test { public class Program { public static void Main () { IL.Ldstr ("Hello, world!"); IL.Call ("System.Console", "WriteLine", "System.String"); } } } </xmp> </p> <p> Of course, if one really wanted to insist in using IL syntax, he could write an extension that allows code like this: <xmp class="code-csharp">using System; using Mono.CSharp.Extensions; #pragma compiler extensions enable namespace Test { public class Program { public static void Main () { IL.Emit (@" ldstr ""Hello, world!"" call void class [mscorlib]System.Console::WriteLine(string) "); } } } </xmp> </p> <p> The point is, once you have hacked mcs to allow extensions, you can go wild and write your own :-) <br/> And <i>before</i> anyone tells me: <u>yes</u>, Lisp and Scheme macros are better at this. <br/> The mcs patch is minimal: the diff is 178 lines, with exactly 124 lines added, and more than one third of the patch deals with putting in place the "#pragma" to activate the detection of extensions (if you don't enable the pragma mcs behaves normally). <br/> Then, each extension has its own implementation, in an assembly clearly separated from the compiler, and which (of course!) is only needed at compile time and not at run time. For instance, the "inline IL" compiler extension looks like this: <xmp>using System; using System.Reflection; using System.Reflection.Emit; using Mono.CSharp; #pragma compiler extensions enable namespace Mono.CSharp.Extensions { public class IL { [CompileTimeInvocationAttribute("EmitLdstr")] public static extern void Ldstr (string s); public static void EmitLdstr (CompileTimeInvocation invocation, EmitContext ec) { ec.ig.Emit (OpCodes.Ldstr, ArgumentAsStringLiteral (invocation.Arguments [0])); } [CompileTimeInvocationAttribute("EmitCall")] public static extern void Call (string t, string m, string a); public static void EmitCall (CompileTimeInvocation invocation, EmitContext ec) { Type type = ArgumentAsType (invocation.Arguments [0]); string methodName = ArgumentAsStringLiteral (invocation.Arguments [1]); Type[] argumentTypes = ArgumentAsTypeArray (invocation.Arguments [2]); MethodInfo method = type.GetMethod (methodName, argumentTypes); ec.ig.Emit (OpCodes.Call, method); } ................ </xmp> The idea is the following: each call to a method marked "CompileTimeInvocationAttribute" will not be emitted. Instead, the code of the method named in the attribute will be executed inside mcs in the "Emit" pass, and it has the responsibility of emitting the needed code. </p> <p> Now, to be fair: this is a hack, and a bad one as well: extensions can use the internal mcs API, so changing mcs can break them. In practice, one is writing code that technically should be placed inside the compiler, but he is allowed to do it in a separated assembly. <br/> Anyway, it works well, it's been fun, and I've learned a lot doing so! </p> <p> And Cocos? Well, it's not dead at all. Taking the break allowed me to solve the design issue, and it's shaping up well. It still has the potential of being a good parsing framework. Particularly, it should make it very easy to re-parse only fragments of the program when they change (think when you are in an editor inside an IDE). And I am succeeding in keeping the semantic actions separated from the grammar, so that the same grammar can be reused in different tools. Only, it will not work by Friday :-( <br/> So, more hacking ahead :-) </p> http://primates.ximian.com/~massi/all.html#2%2f14%2f2008+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#2%2f14%2f2008+1%3a55%3a00+PM Thu, 14 Feb 2008 13:55:00 GMT 09 Feb 2008 <h3 class="title"> Cocos: "Compiler Construction Set", part 3 (implementation approach) </h3> <p> In a couple of <a href="http://primates.ximian.com/~massi/blog/archive/2008/Feb-03.html">previous</a> <a href="http://primates.ximian.com/~massi/blog/archive/2008/Feb-04.html">posts</a> I described an idea for a new parser generator (or, better, framework). <br/> To sum it up, it has the following goals: <ul> <li>Allowing the creation of truly modular grammars, where reusing a grammar inside another has a clear result, with no risk of conflicts.</li> <li>Being able to exploit this modularity also at run time (which of course means <i>compiler</i> run time, but for dynamic languages could mean program run time), to allow extensible languages.</li> <li>Also clearly (and safely) separate the language definitions from the actions to be taken when parsing, so that the same grammar can be easily reused in different tools working on the same language.</li> <li>Have a good performance.</li> <li>Being intuitive and easy to use.</li> <li>Have <i>very</i> good support for error reporting and recovery.</li> <li>In general, being so good that it leaves you no excuses for using a different tool :-)</li> <li>Target just the .NET world (yes, it is a restriction, but I'm fine with that).</li> </ul> </p> <p> Given the above goals, I'll apply the following implementation strategies: <ul> <li> Make so that each grammar, when "ready", becomes an assembly, and is usable as such. This is extremely important for modularity, and has an interesting implication: </li> <li> the syntax used to specify grammars is <i>irrelevant</i>, what really matters is that each grammar is a software component that respects the rules of the framework. Therefore, I don't really care about inventing or using a [meta]language to express grammars, I only want to specify the framework rules. Quite likely, any reasonable .NET language (C#, Boo, VB...) will be able to produce assemblies that are valid grammars with minimal effort on the developer side, and that's OK as a starting point. </br> Therefore, a "compiler compiler tool" will not be needed in the first prototype. </li> <li> As a side point, the so called "semantic actions" (the code that must be executed when the grammar rules match) should typically be put in an assembly <i>separated</i> from the grammar one, to allow easy (and proper) reuse of the grammar in a different context. Therefore the framework should clearly specify how the interface between grammars and actions works. </li> <li> I am settling on PEGs as the theoretical foundation of the framework because it seems to me that they are the only formalism that fulfills the requirements of simplicity, power and most of all modularity of the grammars. This, of course, raises the issue of the performance problem introduced by backtracking... </li> <li> ...which I plan to address in the following ways: <ul> <li> Typically, PEG proponents advocate (and like!) the fact that with PEGs you can merge the scanner and the parser. I disagree with this, not because it cannot be done, but because IMHO it hurts more than how it helps. AFAIK, historically the lexer is a performance critical part of a compiler, because it wrestles with each and every character of the input stream. Now, consider what happens when a PEG backtracks: it has to reparse all the input section already processed by the failing rule, which also means re-tokenize it again, one character at a time... at this point it is clear why one resorts to memoization: all those token matches are saved in the memoized state. What I guess is that if we keep the traditional layering of parsers, PEGs will naturally perform better, because the scanner will do very small backtracks, and then the parser will work at the token level, so its backtracks will be much cheaper. </br> Mind you, it's not that Cocos should <i>enforce</i> this separation! It's simply that it should be a good practice. Moreover, keeping these layers separate should also keep the parsers more modular: for instance, in a language with a preprocessor of some sort (also like the C# one), it should be natural to have a grammar for the preprocessor, and on top of that the typical scanner and parser, with the three pieces working as separated (and reusable) modules in the compilation pipeline. </li> <li> The "separation of layers" described above should have another consequence related to performance. Remember that scanners are typically done with <a href="http://en.wikipedia.org/wiki/Regular_grammar">regular grammars</a>, which are much simpler (and more efficient) than full blown PEGs. Yet, if you look at a <a href="">regular expression</a>, you can almost "read it as a PEG": the two formalisms "look" similar on the surface when the described language is <a href="http://en.wikipedia.org/wiki/Regular_language">regular</a>. And of course regular languages are a subset of what PEGs can recognize. Moreover, regular expressions are composable: joining them in any way (either with alternation or in sequence) still gives a pattern with no ambiguities. Therefore I plan to have support for regexs in Cocos, either explicitly or "automatically" (in the sense that the parser generator, if it detects that a rule is in fact a regex, will produce a <a href="http://en.wikipedia.org/wiki/Finite_state_machine">FSM</a> to match it instead of using the backtracking PEG logic). </li> <li> Another way to reduce backtracking is to perform a sort of "automatic refactoring" of alternative rules when they start with the same sequence: it can happen that the language designer does not want to do the refactoring himself (maybe because that common sequence is useless as a standalone language construct, or maybe because it needs different actions), but doing it internally will avoid useless backtracking. </li> <li> Yet another thing that could be done is allowing the parser generator to recognize more "optimizable rules", and produce code that deals with them accordingly (for instance with a machine with a stack of states). If necessary, these kind of strategies can be added later. </li> <li> Finally, I plan to allow arbitrary "syntactic actions" inside rules, which means arbitrary code that can dictate if the rule succeeds or fails. The rationale is that Cocos will not <i>limit</i> the programmer in any way: the parsers it produces are recursive descent, just like the typical hand made ones, it simply wants to be a tool to take all the "cruft" out of the parser, but the idea is that the developer is simply writing code to match a language: if "custom" code is needed, so be it. In this sense, the performance problem is put in the hands of the developer (where it belongs, IMHO): if one designs a language that absolutely needs unlimited lookahead, every solution will need backtracking and perform poorly, there's nothing that a tool can do to avoid it. So, Cocos will provide ways of writing clean and modular parsers, which should perform more or less like hand written recursive descent parsers, and that's it. </li> </ul> The above was about dealing with PEG's backtracking illness, now let's move on with the implementation ideas... </li> <li> The framework should be <i>generic</i>, in the sense that its parsers will be able to operate on "symbols" of arbitrary types. This means the Unicode alphabet, or the ASCII one (why not sometimes?), or arbitrary tokens (when using a separate lexer), or anything that supports the basic requirement of being able to match what the grammar is looking for (so, any user defined type). This will allow Cocos to be used also for tree pattern matching, mostly like ANTLR does. This kind of generality should be easy to obtain basing the framework on generic types and interfaces. </li> <li> Then, the actions to be executed when a rule matches should be specifiable at will (of course!), but there should also be some support for the most common cases. For instance, if one wants to build an AST it should be possible to have it typed (likely using some kind of factory), but it should also be possible to have an untyped AST "out of the box" with little or no effort. </li> </ul> </p> <a href=""></a> <p> Another implementation issue is: it's OK that grammars should be encapsulated in assembly files, but how should the grammar be represented in practice? </br> In this sense, we have a couple of major alternatives. Since each grammar rule is, in the end, a function that can try to match an "input stream", we could directly represent rules with methods. With this approach, each rule would be the <i>code</i> that matches the language construct it describes. While this approach is interesting, it has a couple of major drawbacks: actually emitting that code is not easy before we have the framework in place, and (most important) examining rules is really cumbersome (one should parse the CIL instruction stream). </br> Another approach is to have methods that return data structures that in turn represent the rules. A project that choose this approach is <a href="http://www.codeplex.com/irony">Irony</a>, and I am planning to do something similar (one of the main differences being that Irony is not based on PEGs, so its grammars lack the composability I'd require). </br> Since PEGs do not require state tables, in the first implementation Cocos would not even require a "compiler-compilation" phase: a simple algorithm that <i>interprets</i> the rule structure will do the job. Only later, when we want to refine the system, we could write an assembly processor that optimizes rules into code (or maybe builds the FSM in the "regular language" case). </p> <p> If each rule is an object we could run into problems when initializing recursive rules (where the object graph has cycles), but the problem can be solved using lazy initialization, maybe with some help from reflection (in a way similar to what <a href="http://bracha.org/executableGrammars.pdf">Newspeak</a> does). </p> <p> The final issue that needs "sketching" is the interface between the rules and the actions to execute when they match. But, since each grammar rule will be an object, the easiest thing seems an event based interface, so that each rule can easily notify a match (or maybe also other events). </br> An event based approach could also have the side effect of providing debugging hooks for free: initially we could just implement a simple set of tracing actions, and later an application that loads a grammar and instruments it with tracing events that can potentially stop at certain conditions, and then lets the user inspect the parser state (maybe visually). </p> <p> Given all the above, Cocos is taking shape, at least on paper... now it's time to do it in practice! </br> Next week is the second Novell "hack week", so I'll spend it coding, and see what happens :-) </p> http://primates.ximian.com/~massi/all.html#2%2f9%2f2008+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#2%2f9%2f2008+1%3a55%3a00+PM Sat, 09 Feb 2008 13:55:00 GMT 04 Feb 2008 <h3 class="title"> Cocos: "Compiler Construction Set", part 2 (state of the art) </h3> <p> So, let's dive into technology for modular compiler front ends... <br/> ...only to bang our head against the bottom of an empty pool! </p> <p> Let's face it, there is still nothing like that really usable in production at the moment. </p> <p> The language that comes nearer to the extensibility idea is likely <a href="http://boo.codehaus.org">Boo</a>, and I absolutely love it, but it is extensible within the syntax already recognized by its parser, while Cocos has the broader goal of providing a truly extensible parser. </p> <p> And by the way, languages of the Lisp and Scheme family do not qualify: they surely operate at the <a href="http://www.paulgraham.com/avg.html">highest possible level of abstraction</a>, but in my eyes they do so at the expense of not being "languages": they are a serialized form of the parse tree of a (hypothetical) human readable language. I mean, all those parenthesis are there to make so that code is data, and data can be handled as code, which is fine, but IMHO it makes writing code less practical. It is very elegant at the theoretical level, but I really dislike it in practice: I don't want the clutter of all those parenthesis when a machine could (and should!) do the bookkeeping for me. I want languages that bend to <i>my</i> needs, handling the code representation <i>I</i> want, whatever it is, because in practice syntactic sugar is <i>good</i> (both for readability and "wrist friendliness" when writing). </p> <p> So, how do we build an extensible and modular parser? <br/> Well, how do we build scanners and parsers in the first place? With parser generators! </p> <p> Modern production-ready parser generators typically start from the formal description of a <a href="http://en.wikipedia.org/wiki/Context-free_grammar">context-free grammar</a>, and then try to derive a parser for it. The parser can be <a href="http://en.wikipedia.org/wiki/LL_parser">LL</a>, <a href="http://en.wikipedia.org/wiki/LR_parser">LR</a>, <a href="http://en.wikipedia.org/wiki/LALR_parser">LALR</a>, or a variant of those, but what matters is that the process is <i>backwards</i>: the programmer does not specify the parsing algorithm, but the abstract definition of a set of symbol sequences for which the parser generator <i>tries</i> to produce a parsing algorithm. </p> <p> There can be ambiguities, and they must typically be handled in ad-hoc ways (or, when possible, eliminated refactoring the language description, which typically makes it uglier). Moreover (which is worse) language descriptions are not naturally composable: you cannot freely combine them and expect to obtain a parser for the "composite language" out of the box ("shift-reduce conflict" should ring a bell here!). And this kills our goal of having a generator for modular parsers. <br/> This, by the way, rules out <i>wonderful</i> frameworks like <a href="http://www.antlr.org">ANTLR</a> (which I really like, but as far as I know has no notion of "extendable parser"). </p> <p> A different formalism to describe languages is a "parsing expression grammar", or <a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">PEG</a>. On the surface a PEG looks like a language description in <a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">BNF</a> form, but its meaning is totally different: it is not "backwards" (the abstract description of a set of symbol sequences from which a tool will derive a parsing program), it is the <i>direct</i> description of the parsing program. Thanks to this, PEGs are composable without ambiguities, and are also very intuitive when you write them. </p> <p> This, of course, comes with a price: they need to backtrack when a "rule" fails, and this makes the matching time exponential, which is <i>bad</i>. There is a technique (<a href="http://en.wikipedia.org/wiki/Memoization">memoization</a>) to cope with this issue, but IMHO it is a bit cumbersome, and has a very high memory cost, so I do not consider it <i>the</i> solution to make PEGs usable in all contexts. </p> <p> This does not mean that all hopes are lost! PEGs look so promising (and are so <i>nice</i>) that a lot of projects are using them. There are various parser frameworks for programming languages like Ruby, Haskell and others (a good list is <a href="http://pdos.csail.mit.edu/~baford/packrat">here</a>). Some of these projects use memoization, others just accept the performance penalty, but none that I know of has found an effective way to solve the issue (and neither has goals as broad as Cocos). One interesting project is <a href="http://www.chrisseaton.com/katahdin">Katahdin</a>, which like Cocos aims at having a parser that can be changed dynamically. But probably the most interesting is <a href="http://bracha.org/executableGrammars.pdf">Newspeak</a>, which has also been presented at <a href="http://www.langnetsymposium.com">Lang .NET 2008</a>. </p> <p> In fact, I've been thinking about Cocos for far too long, framing the issue, solving design problems, gathering ideas... now I'm seeing that "the time is right", there are several projects moving in this direction, so something good should come out in the end. Newspeak in particular seems to be very well thought out (of course, IMHO). </p> <p> In the next post I'll focus on what should make Cocos unique in this landscape (otherwise it would be pointless doing it!), and how I plan to actually implement it. </p> http://primates.ximian.com/~massi/all.html#2%2f4%2f2008+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#2%2f4%2f2008+1%3a55%3a00+PM Mon, 04 Feb 2008 13:55:00 GMT 03 Feb 2008 <h3 class="title"> Cocos: "Compiler Construction Set", part 1 (framing the problem) </h3> <p> In this post I'll be talking about a project I'd like to start in the near future (quite likely the next Novel "hack week"), which, like most free software projects, will come into existence because I have an itch to scratch. It is a very experimental project, and while I think my ideas are sound and useful I don't really know how it will go in the end. Nevertheless, this itch needs to be scratched! </p> <p> Particularly, I have several gripes with modern compiler technologies (and programming languages). </p> <p> Starting with languages, there's far too many of them. Not that this is a bad thing in itself: there are good reasons for it, and I understand them well. I also happen to use different languages in different contexts myself, and don't regret it at all. </br> What I dislike is the difficulty we have <i>integrating</i> them. </br> In this sense, the .NET framework is a big step in the right direction, because with its language independence it allows you to seamlessly mix and match components written in different languages. All you need is a compiler for your favourite language that can emit a .NET assembly. </p> <p> And here comes the first gripe about compiler technologies: writing compilers is still needlessly hard. Of course a full blown compiler is a complex piece of software, and I don't expect this inherent complexity to magically go away. But let's face it, most programming languages share a lot of basic constructs, and it's a shame that efery time we write a compiler we typically redo everything from scratch. </p> <p> My next gripe with programming languages integration is that until now it is typically done without mixing them <i>in the same source file</i>. And when it's done that way, only <i>one</i> compiler drives the compilation, and just embeds strings that will be evaluated at run time by different compilers (like when we embed SQL). This misses completely the integration between languages in a <i>unified</i> context, with proper type and semantic checks. </p> <p> Now, LINQ is an answer to this problem in the SQL case. But let's look at this answer: it took a lot of effort to have LINQ in the first place, requiring major compiler updates, and it solves the problem only for <i>one</i> domain: data access and filtration. So, my next gripe with compiler technology is that I would like a truly extensible compiler, so that if I feel the need to write "from some-source select some-access-code where some-predicate" I can do it right away, extending the language myself. </br> Or if I want to put there an XML literal, the XML will be parsed at compile time, its structure will be checked against the needs of the code that will work on it, and no string with the full XML will be embedded in the resulting executable at all. </br> Or, more generally, if I need a small DSL (Domain Specific Language) I can implement it with minimal effort, then write snippets of my DSL directly inside my code, and everything will just work. These language extensions should then be distributed and handled just like libraries, in the sense of software components: simply having the required assembly file available at compile time should be enough for the compiler to use the extension. </p> <p> This way the compiler would become truly modular. And this modularity should be exploited also in another direction: to avoid rewriting the <i>same</i> compiler all over again. Nowadays, if you want a full blown C# development environment, you start with a C# compiler. Then you need an IDE, which also must be able to (partially) understand the language, so you write a (possibly half assed) compiler for the IDE's needs. And then, when you need a debugger, it needs to understand language expressions, so you write another one. And if you need one more source code analysis tool... you are back to square one. </br> All of these (partial) compilers typically share nothing at all. Since they have different needs, it is perfectly OK for them to be different pieces of software, but it seems strange to me that they cannot reuse <i>anything</i>: at the very least the front ends should be a common module. </br> But this, with current compiler technology, is a dream: it's non impossible, but it would be so hard that we just don't do it. </p> <p> This is why I am focusing on compiler technology and not programming languages: I know that the current babel of languages is not going away anytime soon, even only because different people have different syntactic tastes. However, I have the suspect that the current babel is made much worse by the fact that we are not able to integrate the different languages properly, and most of all because when we have a language we cannot adapt it to our real needs, so we jump to a different one that (not being properly integrated with the rest) increases the mess. </p> <p> So, to sum it up, my main gripe with compiler technology is that compilers are not modular enough, which decomposes into: <ul> <li>compiler <i>front ends</i> are not modular, and</li> <li>compiler <i>back ends</i> are not reusable.</li> </ul> </p> <p> If we stick to the .NET world, the situation for back ends is actually not that bad. For the static case, <a href="http://www.mono-project.com/Cecil">Cecil</a> is a very good start for the low level part, and <a href="http://www.ohloh.net/projects/5319?p=Milo">Milo</a> should be perfect when it's ready. And for the dynamic case, the <a href="http://blogs.msdn.com/hugunin/archive/2007/04/30/a-dynamic-language-runtime-dlr.aspx">DLR</a> seems there to help us. Therefore, I will concentrate my gripes on compiler front end technology. </p> <p> On the next blog post... </p> http://primates.ximian.com/~massi/all.html#2%2f3%2f2008+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#2%2f3%2f2008+1%3a55%3a00+PM Sun, 03 Feb 2008 13:55:00 GMT 03 Jan 2008 <h3 class="title">Comparing profiling APIs</h3> <h4>Implementation troubles</h4> <p> Lately I've had some troubles implementing <a href="http://www.mono-project.com/HeapShot">heap-shot</a>-like functionaliy in the <a href="http://primates.ximian.com/~massi/blog/archive/2007/Nov-19.html">new profiler</a>. I wanted to have it working before committing the code, but as you'll see this will take some time, so I will commit it earlier... </p> <p> The reasons for the port are, essentially, performance (heap-shot uses the Mono runtime API to walk every field of every object, and takes-releases a lock at every allocation), and having a 64-bit port of heap-shot. </p> <p> Integrating it in the new profiler would allow me to reuse all the event logging and file writing machinery that is already in place there, which is fairly efficient (or at least is supposed to be so!). However, all this machinery had some problem when applied to heap profiling. The logging profiler maintains a global hash table of mappings from <tt>MonoClass*</tt> values to numeric unique IDs. The table is used so that the external file reader does not depend on raw pointer values, and also because writing IDs take less space than writing large values. This table, however, is not kept up to date all the time (which would require taking a lock at every profiler event). Instead, each time it is necessary to flush the events to disk, the event buffers are scanned and the table is completed (and flushed!) just before writing the event block. </p> <p> This is perfect for the logging profiler, but causes lots of bad interactions when working inside the GC profiling event handlers. In this context, all the threads registered with the runtime are stopped. Therefore, if the GC hook tries to update the table, it has to take the profile lock, but it cannot because one of the stopped threads could have taken it (and, being stopped, it's not going to release it until the collection stops...). </p> <p> In the last week, trying to fix all the profiler bugs before committing, I've been dealing with various kinds of deadlocks, or troubles because the table was not complete. I managed to implement the <tt>MonoObject*</tt> buffers between the allocation and GC hooks in a completely lock-free way, and split the snapshot and file writing jobs so that the snapshot (which <i>must</i> happen inside the GC hook) does not strictly need the mapping table. <br/> However, I overlooked a basic fact: I implemented the object scanning in the heap snapshot using per-class bitmaps that describe the places where references are in the object (for efficiency), and I decided to store those bitmaps in the class-ID mapping table (because it seemed reasonable: it is a table where the <tt>MonoClass*</tt> is the key, it seemed stupid keeping <i>two</i> such tables inside the profiler!). </p> <p> Of course, this conflicts with the assumption that the GC hooks cannot use that table... of course there are workarounds for this, but after putting so may workarounds in place, I started thinking that maybe the profiler could get some more help from the runtime. After all, it is this separation of the profiler from the runtime that is causing the issue, so maybe extending the profiling API could give us a more elegant solution. </p> <p> In this context, it makes sense to see what the other two major VMs out there (the Sun JVM and the Microsoft .NET CLR) are doing. So (again, on <a href="http://www.advogato.org/person/lupus">Paolo</a>'s suggestion) I examined their profiling APIs. </p> <h4>The Microsoft CLR profiling API</h4> <p> Online resources for MS .NET profiling API are <a href="http://www.blong.com/Conferences/DCon2003/Internals/Profiling.htm">here</a> and <a href="http://msdn.microsoft.com/msdnmag/issues/01/12/hood">here</a>. But the real reference is the "Profiling.doc" file distributed with the .NET SDK. </p> <p> Comparing this profiling API to the Mono one, the first thing we we see is that they are <i>very</i> similar. In the bulk of the load-unload (and method enter-exit) events, there are just minor differences: <ul> <li>There is an event (<tt>ModuleAttachedToAssembly</tt>) to explicitly see when an image is associated with an assembly.</li> <li>There are a couple of events (<tt>JITCachedFunctionSearchStarted</tt> and <tt>JITCachedFunctionSearchFinished</tt>) that measure the time spent looking for suitable method native code for a call, but AFAIK these are not relevant to the current Mono implementation.</li> <li>There is a <tt>JITInlining</tt> event, which can also veto if inlining will actually happen on a case by case basis.</li> <li>There are remoting specific callbacks.</li> <li>There are also COM specific callbacks.</li> <li>Exception reporting is more detailed.</li> <li>They have several functions to inspect the state of things (objects, classes, domains...), or for instance to force a garbage collection, but we have the same in the Mono runtime API.</li> </ul> </p> <p> There is, however, a feature which would be really handy for me in implementing a proper heap (and CG) profiler: they have "runtime suspension" callbacks. In practice, a profiler is notified of the moment when the GC will suspend all the threads which are registered with the runtime: it is guaranteed that none of those threads will execute managed code after that callback. Of course, also the "reverse" callback is present (for when the suspend action is done). <br/> This would be really useful for my heap-shot implementation: the suspend hook would make sure that no thread still has the profiler lock before the GC starts, and when the suspend is done that the mapping table is ready. This way the GC hook could safely use the table to access the bitmaps that describe where each object has reference values. </p> <p> But the major difference related to heap profiling is the fact that the .NET CLR actually already implements a heap-shot equivalent. And, thinking about it, this really makes sense: in my implementation I am in fact duplicating some work that the GC already does, like: <ul> <li>Keeping track of all the heap objects, and iterate through them one by one: doing this in our profiler means storing <i>all</i> the <tt>MonoObject*</tt> values of allocated objects in appropriate buffers, and process them inside the GC callback, which even if done in a lock free way is anyway a overhead.</li> <li>For each object, scan all the references it contains, and to do it in an efficient way, precompute (and store!) a bitmap that describes the object structure for each class.</li> <li>Also scan all static fields for references...</li> </ul> And it's not over: they also report all the roots they find on the stacks (and registers), which our heap shot will have a very hard time doing, while the GC already does it anyway. </p> <p> So, we should decide what to do: <ul> <li>We can go on with the current API and use workarounds like a second table (managed inside the GC hook), or an auxiliary table that the writer thread will then integrate into the main one.</li> <li>We can improve the API a bit, so that the GC hook will be sure to have the table ready whan it needs it (because it will have more awareness and control on the thread suspend action).</li> <li>Or we could overhaul the GC profiling API, and implement heap-shot basically inside the GC...</li> </ul> </p> <h4>The Java profiling API</h4> <p> The online reference for the JVM profiling API is <a href="http://java.sun.com/j2se/1.5.0/docs/guide/jvmpi/jvmpi.html">here</a>. However, there has been a <a href="http://java.sun.com/developer/technicalArticles/Programming/jvmpitransition">transition</a> to the "tooling" API, <a href="http://java.sun.com/javase/6/docs/platform/jvmti/jvmti.html">JVMTI</a>, so I will only consider this last one. </p> <p> Comparing the JVMTI profiling API to the Mono one, we see many more differences, because the whole approach is different. </p> <p> First of all, there are no fast callbacks for method enter/exit and things like that. Instead, the profiler can instrument the application bytecode directly, and the profiler hooks can be implemented in managed code, to avoid the transition costs and eventually inline the callbacks if they are small enough. </p> <p> Then, there is the (complex) concept of "JVMTI context", how it must be used, and especially how references must be managed. And the JVMTI is meant to implement several kind of tools, including debuggers, so the API is naturally more comprehensive than the Mono profiler API. </p> <p> For the purpose of this comparison, it suffice to say that the heap analysis functions are really complete, and also look relatively heavy, with many callbacks and the possibility to arbitrarily tag individual objects and classes and use the tags as filters for the callbacks. </p> <p> The JVMTI reference explicitly states that they presumed that a "batch" approach (working with arrays of references) would have been more efficient (in the sense of having a better throughput), but tests proved that it's not the case, so they resorted to the more flexible callback approach. </p> <h4>In the end...</h4> <p> We should decide what to do in our profiler :-) </p> http://primates.ximian.com/~massi/all.html#1%2f3%2f2008+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#1%2f3%2f2008+1%3a55%3a00+PM Thu, 03 Jan 2008 13:55:00 GMT 19 Nov 2007 <h3 class="title">Mono and oprofile profiler news</h3> <p> Lately I've been working on the Mono profiler. <br/> Most of the work has been on implementing a new profiler, a prototype implementation can be found <a href="http://lists.ximian.com/pipermail/mono-devel-list/2007-October/025196.html">here</a> (but beware, that patch no longer works in current svn). </p> <p> To sum it up, this is not a revolution because the profiling API stays the same, it's just a reimplementation of the current default profiler. </p> <p> This is motivated by a few issues we have with the current profiler: <ul> <li> It stores all events in memory, and the events still point to the internal data structures (mainly MonoMethod* and MonoClass*) so that it cannot really work when dynamic methods are freed and-or appdomains are unloaded. For now a couple of gross hacks are in place (basically, if "mono_profiler_get_events () != 0" methods are never freed), but we should get rid of them. </li> <li> Since it stores everything in memory, its memory usage grows with time (in the statistical case we just allocate a fixed buffer in the default profiler, enough for most simple tests but not for a long running application). </li> <li> It processes all the data at the end of the process execution, so these things are not feasible-easy:<ul> <li>getting profiling data while the process is still running, and</li> <li>having the data presented better by different front ends.</li></ul> </li> </ul> <p> The main differences of this new profiler: <ul> <li> It does not "process" the profiling events, it just logs them into a file. They are as packed as possible (a few bytes per event). </li> <li> Therefore: <ul><li>it can "survive" when data structures are freed, and</li> <li>one can write front ends that examine the log and show the data without being tied to the Mono runtime and without having to wait for the process to end,</li> <li>but... it will produce <i>large</i> log files when running for a long time in non-statistical mode</li></ul> </li> </ul> </p> <p> These days I'm working on a managed implementation for the decoding library (which processes the files produced by the profiler), I will show it in a BOF at the <a href="http://www.mono-project.com/MonoSummit2007">Mono Summit 2007</a>. </p> <p> But the real topic of this post is another profiler related work: we can finally integrate <a href="http://oprofile.sourceforge.net">oprofile</a> with Mono! <br/> Specifically, managed method names will show up in the opreport output, as shown in this screenshot of the oprofile-viewer GUI: <img src="http://primates.ximian.com/~massi/blog/archive/2007/Nov-19/OProfile-Viewer.png" alt="OProfile Viewer" align="middle"/> </p> <p> One nice aspect of oprofile is that it works automatically, without requiring instrumentation of the application code. And it can profile the whole <i>system</i>, including the kernel, so one can have a global view of how the application spends its time. Moreover, it can also profile things like cache misses, branch predictors behavior, and in general everything that's measured by the CPU performance counters. </p> <p> The only thing is needs to provide a meaningful output is that the elf files of the profiled programs provide symbol names for their functions... which obviously is not true for jitted code. <br/> However, there <i>is</i> interest is using oprofile also for jitted code, so to cope with this, the oprofile developers put in place a solution: generate fake elf files for the anonymous memory regions where jitted code resides ("anonymous" because they are not related to any file on disk), so that the oprofile reporting tools will find the symbols they need in these files, and work unmodified. Of course, these symbol names must come from the JIT itself. For this purpose, oprofile now exports a very simple API, and a library that contains the functions that the JIT must invoke to emit the symbols. Their work was mostly oriented to Java VMs, but the discussion on the oprofile mailing list kept it VM-neutral. </p> <p> With time, this code in oprofile has matured, and at the beginning of October they created a CVS branch that tracks these efforts, so that now it has been very easy to fetch it and make it work with Mono. There were rough edges, but thanks to Philippe Elie and Maynard Johnson on the oprofile IRC channel we sorted them out quickly, so now it works! <br/> The only downside is that call graphs are not supported for now. I won't enter in the details, but it will take a while to have them working (and BTW, they will likely work more for Mono than for Java, because due to the mixed JIT-interpreted nature of the JVM they don't really have complete stacks that oprofile can examine). </p> <p> So, the oprofile code is already available in their "JIT_SUPPORT" cvs branch; I will likely commit the Mono code in the following days or during-after the summit... but for the truly impatient, <a href="http://primates.ximian.com/~massi/blog/archive/2007/Nov-19/mono-profiler-oprofile.c">here</a> it is! </p> <p> As you can see, it's really, <i>really</i> easy... enjoy :-) </p> http://primates.ximian.com/~massi/all.html#11%2f19%2f2007+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#11%2f19%2f2007+1%3a55%3a00+PM Mon, 19 Nov 2007 13:55:00 GMT 29 Oct 2007 <h3 class="title">Unforeseen absence</h3> <p> During last weeks my son's health got really bad. He had been suffering due to his adenoids in the last year, but we wanted to avoid surgery if possible... <br/> The situation got better during summer, but now it is not bearable anymore. Among other things, during the night at times he simply cannot breath at all... so the doctors (and we) decided that he'll have surgery urgently, to take his adenoids and tonsils off. </p> <p> For reasons described <a href="http://italiasalute.leonardo.it/news.asp?ID=6900">here</a> (sorry, it is in Italian, it essentially describes a new and better way to perform that surgery) we will go <a href="http://www.gaslini.org">here</a>, which is about 200 Km from where we live. </p> <p> Therefore, I will simply disappear for a few days... and <i>hopefully</i> this will solve my son's problems! </p> <p> <b>Updated November 1<sup>st</sup>, 2007</b> <br/> We are back from the hospital, and everything went well. <br/> For the first time since more than one year, Michele's oxygen saturation is at 99% most of the time, and absolutely <i>never</i> below 95% (before he had frequent apnea periods when the saturation dropped below 70%). The convalescence is rather painful for him, which is not a nice thing to watch, but it is really worth it... and this time we know that in a week everything will be over! </p> http://primates.ximian.com/~massi/all.html#10%2f29%2f2007+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#10%2f29%2f2007+1%3a55%3a00+PM Mon, 29 Oct 2007 13:55:00 GMT 10 May 2007 <h3 class="title">Trimming VTables in the Mono runtime</h3> <p> As everybody knows, a "<i>VTable</i>" (short for "<i>virtual table</i>") is a data structure used to lookup methods when they are called using dynamic binding (which in Java, C++ and .NET is called doing "<i>virtual calls</i>"). <br/> In C++ they are generated by the compiler, while in Java and .NET they are managed by the runtime. </p> <p> Whet is less known is that Java and .NET, allowing multiple inheritance from interfaces (in the sense that every class can <i>implement</i> several interfaces), make building VTables a bit tricky. </p> <p> The problem is not listing all the methods which a class implements and can be called with a CALLVIRT opcode, which is easy. The real issue is that the CALLVIRT opcode, as implemented in the JIT, must find the required method <i>fast</i>: nobody wants method call operations to be inherently slow. </p> <p> It's here that the multiple inheritance issue shows up: typically, a VTable is just an array of function pointers, and each method has its slot. The CALLVIRT opcode must produce a call instruction that picks the right slot. If the <i>this</i> pointer is typed (statically, at compile time) as a class, things are easy, because the slot positions are kept constant along the inheritance hierarchy (just like the layout of fields), thanks to the single inheritance between classes. <br/> For instance, if method M of class C has slot 3, it will have slot 3 in every class derived from C, so the CALLVIRT opcode just needs to access the VTable at that slot and the correct method will be called. </p> <p> Now, consider interfaces: method M in interface I could be implemented at slot S1 in class C1, slot S2 in class C2... in practice, it could be in a different slot in every class that implements I. Therefore, the CALLVIRT opcode must use some sort of lookup mechanism to find the right slot for any given instance of a class that implements I. </p> <p> In <a href="http://www.research.ibm.com/people/d/dgrove/papers/oopsla01.html">this</a> paper the issue is explored in detail, and several solutions are explained. In any case, the problem is the tradeoff between time and space: you can have very fast VTable access (constant time, like array accesses), or you can have small VTables, but you cannot have both. </p> <p> The solution used in Mono until now is conceptually similar to the "Selector Indexed Tables" described in the paper (even if very different in practice). For each interface we generate a unique numeric ID at runtime (let's call it UIID). In the VTable, beside the space for the class methods, we reserve one "region" for each implemented interface, where the pointers to the interface methods are replicated; inside each of these regions the pointers have exactly the same order as the method declarations in the interface itself. Then we have, beside the proper VTable, a table that maps, for each UIID, the offset of the corresponding "interface region" inside the VTable (we call this table "interface_offsets"). </p> <p> This way, the CALLVIRT opcode must access interface_offsets using the UIID to find the right VTable region (simple array access), displace it with a constant corresponding to the method slot in the interface (a simple addition), and the method is found. </p> <p> The downside is that "interface_offsets" is an array, and it must be as large as the highest UIID of the interfaces implemented by the class (and is then typically full of "holes", unused slots corresponding to UIIDs of unimplemented interfaces). </p> <p> As Miguel noted, the use of generic types makes the number of interfaces in the runtime exceptionally high. Every class which implements even just one interface is then forced to have a very large "interface_offsets" table, only because its interface happens to have a high UIID. The more interfaces are generated instantiating generic types, the higher the UIIDs... In an application like Banshee, the runtime ends up allocating one Mbyte of RAM only for tables which are mostly empty, to support the execution of one Mbyte of jitted code. </p> <p> And the situation is even worse than that! <br/> Actually, the runtime maintains <i>two</i> kinds of VTables: one for "metadata handling", inside MonoClass, which is an array of MonoMethod pointers, and another for the JIT, which is MonoVTable and which contains the actual pointers to the machine code (and which for this reason is usually specific to each application domain). Needless to say, <i>both</i> tables need their related "interface offset mapping" table, even if the "mono_stats" only keep track of the memory allocated for the second kind of VTables (the "proper" ones, MonoVTable). So interface offset tables waste their space twice. </p> <p> To put a remedy to this, we decided to implement IMT (Interface Method Tables), as described in the paper linked above. Briefly, IMT replaces "interface_offsets" with a small fixed size hash table. Note that the key hash functions are computed at <i>JIT</i> time, and not at <i>run</i> time, so method calls, in the absence of collisions, are roughly just as fast as with the current implementation. <br/> Of course the tricky part is handling collisions. About this, one must realize that the IMT contains pointers to code, not data, and it is used as a jump table by the JIT. So, while an hash table, as a data structure, handles collisions with <i>lists</i>, the IMT handles them with <i>thunks of code</i> which have the capability of resolving the collision, and which the runtime must generate when it fills up the IMT slots. </p> <p> In a preparation for implementing this, in current SVN the "interface_offsets" field is gone from MonoClass (which means, the "metadata handling" virtual tables do not waste memory anymore). It has been replaced by a pair of compact arrays (which have the same function), and a bitmap used for "ininst" checks by the JIT. </p> <p> The next step is actually implementing IMT in MonoVTable, which I'll do in the following weeks... </p> http://primates.ximian.com/~massi/all.html#5%2f10%2f2007+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#5%2f10%2f2007+1%3a55%3a00+PM Thu, 10 May 2007 13:55:00 GMT 09 May 2007 <h3 class="title">I acknowledge that I have stayed...</h3> <p> ...without blogging <i>technical</i> content far too long! </p> <p> This is going to change <u><i>soon</i></u> (I mean, these days... crap, seeing Miguel sort of <a href="http://article.gmane.org/gmane.comp.gnome.mono.devel/23347">forced to explain</a> what I'm doing because I am not blogging is really shameful). <br/> Meanwhile, I announce a couple of Mono events here in Italy: I'll be present in <a href="http://www.openlabs.it">OpenLabs</a> on <a href="http://www.openlabs.it/?q=node/223">May 21st</a>, and Novell organizes a <a href="http://www.novell.com/global/italia/eventi/monoday.html">Mono Day</a> on May 23rd. </p> <p> Both events can be a good occasion to meet some friend! </p> <p> My next blog entry will be about <a href="http://www.research.ibm.com/people/d/dgrove/papers/oopsla01.html">IMT</a>, and then I'll explain what's going on with the new regalloc... </p> http://primates.ximian.com/~massi/all.html#5%2f9%2f2007+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#5%2f9%2f2007+1%3a55%3a00+PM Wed, 09 May 2007 13:55:00 GMT 12 Feb 2007 <h3 class="title">This period shows no sign of stopping...</h3> <p> <u>...but <i>I</i> should be over it</u> :-) </p> <p> This blog entry is mostly for all my friends that sow me "disappear" suddenly, from IRC and the web in general. </p> <p> To say it flatly, I've been a bit "burn out". <br/> I had too many things to face, all in a row, I did some mistake and they all summed up, so that in the end I lost a lot of energy... <br/> I't pointless making a list of what happened since September (it would be too long, really!), but it involves: <ul> <li> A long series of illnesses of my son: he has been able to go to kindergarten only four weeks in all these months, the rest of the time he was mostly ill at home, otherwise ill at the hospital (not just days, but weeks at a time)... anybody who has kids knows what kind of energy drain this can be. </li> <li> Some work-related stress, plus the fact that, while I was in Boston, <i>both</i> my wife and my son got ill (she had pneumonia), and they had nobody that could help them (all four of our parents were travelling!), so I was not really in a relaxed mood... </li> <li> Also, my mother died on <a href="http://primates.ximian.com/~massi/blog/archive/2006/Dec-01.html">December 1st</a>, and while I took it well, it has not been relaxing, either... </li> </ul> </p> <p> Of course I <i>know</i> there are guys in situations much worse than mine. <br/> Anyway, the major symptom of my loss of energy was that my working performance gradually dropped to zero. Which means, I tried to get things done, but had no results. </p> <p> I finally decided to have a period of <i>pure</i> vacation, and generally take care of myself and get some rest. <br/> The vacation helped a lot, in spite of the fact that, again, both my son and my wife were ill, so I didn't rest that much. But what helped me the most, ironically, is that <i>I</i> got ill last Friday! Whoever knows me also knows that I seldom feel bad (in the last fifteen years or so it happened three times, and before I hardly remember the details...), but this time it has been useful: mostly thanks to my wife, I spent three days literally sleeping, resting, and doing nothing at all! </p> <p> So, now I feel better... today I am still officially in "resting mode", but I turned on the notebook anyway, and started working off my mail backlog. <br/> The only sore point is that my son is still ill, and yet <u>nobody</u> understands what he <i>really</i> has. From time to time he seems to heal, the remedies seem to work, but then his situation regularly gets back to bad. This will keep us busy, and obviously a little worried, but at the very least <i>I</i> feel better :-) </p> http://primates.ximian.com/~massi/all.html#2%2f12%2f2007+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#2%2f12%2f2007+1%3a55%3a00+PM Mon, 12 Feb 2007 13:55:00 GMT 01 Dec 2006 <h3 class="title">These things are part of life.</h3> <p> Today, in the early morning (2am), my mother passed away. <br/> We already had a sort of miracle <a href="http://primates.ximian.com/~massi/blog/archive/2006/May-19.html">before</a>, but these things can only be postponed, not avoided. </p> <p> Anyway, given that we cannot decide <i>when</i> these things happen, we should care about the <i><u>how</u></i>, which is IMHO much more important. <br/> In this sense, I cannot imagine how it could have been better: my mother has been serene until her last breath, she was conscious, she could recognize us, and we had the chance to stay near her all these days and nights, here in Sicily. And, even more important, she was serene <i>knowing</i> what was happening to her, and she just accepted that as a fact of life. She really passed away "in God's Grace", which is marvelous: we could never ask for more. </p> <p> Of course, faith helps: it helped her, and it helps me and my wife. And praying gives that kind of peace that you generally cannot get anywhere else. Also, I know I'll meet her again... it's not that I "believe" it, I know it just as I "know" that I have two hands, or that the sun will rise tomorrow. And she already gave us signs of her presence. </p> <p> In the end, I want to thank everybody that stayed close to us. It's good feeling that we are not alone. </p> <p> We will give her the last farewell tomorrow, and her body will rest here, in the village where she was born. <br/> We will likely come back home (I mean near Milan) at the end of the following week, all together (me, my wife, my son and my father). </p> http://primates.ximian.com/~massi/all.html#12%2f1%2f2006+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#12%2f1%2f2006+1%3a55%3a00+PM Fri, 01 Dec 2006 13:55:00 GMT 25 May 2006 <h3 class="title">Chasing elusive JIT bugs...</h3> <p> %*@#! <br/> Back from Sicily, I debugged the "last" issue related to deadce, but as soon as I committed the fix and re-enabled the five options described <a href="http://primates.ximian.com/~massi/blog/archive/2006/Mar-27.html">here</a>, more amd64 specific issues surfaced. </p> <p> The problem is that I am <i>not</i> able to reproduce the bugs in any way on the 64 bit machine I have access to. <br/> I really need the bug reporters to help me in finding the problem. </p> <p> But, since instructions on how to isolate the source of JIT problems without having JIT knowledge could be interesting to many, instead of just mailing them privately I'll publish them here (and then copy them to the wiki, too). <br/> The only knowledge that is assumed is that you can rebuild mono from source, you have some C programming knowledge so you know how <tt>#define</tt> works, and you can search the code with grep. </p> <p> So, assume you have a test case (even a whole program) that always fails with a specific JIT option (or combination of options), and always works without that option. </p> <p> The first thing to do, when a JIT option causes trouble, is identifying one <i>single</i> method that is compiled incorrectly when the "bad" option is enabled, and correctly when it is disabled. This method could be the top one in the stack trace in case of crash, but often it is not so, and anyway to avoid wasting time debugging correct code it would be nice to have a way to tell for sure if you have really found the right method. </p> <p> For this purpose, I often instrument the JIT so that it applies certain options only to single methods, or anyway to a subset of the program's methods (by the way, to get a list of the compiled methods it is enough to execute the test giving one <tt>"-v"</tt> flag to <tt>"mono"</tt>, and grepping for <tt>" emitted at "</tt>). </p> <p> Inside the mini directory, you can grep for macros like <tt>"MONO_APPLY_DEADCE_TO_SINGLE_METHOD"</tt>, <tt>"MONO_APPLY_TREE_MOVER_TO_SINGLE_METHOD"</tt>, <tt>"MONO_APPLY_SSAPRE_TO_SINGLE_METHOD"</tt>, <tt>"MONO_INLINE_CALLED_LIMITED_METHODS"</tt>, and <tt>"MONO_INLINE_CALLER_LIMITED_METHODS"</tt> (these last two only since r61101). Each of them, when true, enables code that compares the current method full name with what's found in a specific environment variable, and applies the option only if the strings match. </p> <p> The macros with "single" expext an exact match, while the ones with "limited" have a "starts with" behavior, which is more useful when searching "manually" (I sometimes use perl scripts to automate the search for the failing method), but anyway it is really easy to change this behavior. </p> <p> So, suppose that <tt>"-O=inline"</tt> causes a problem. You could set <tt>"MONO_INLINE_CALLER_LIMITED_METHODS"</tt> to 1, rebuild, and start running the test setting the environment variable <tt>"MONO_INLINE_CALLER_METHOD_NAME_LIMIT"</tt> initially to "System", to see if the failing method is in the System namespace, and then refining the search. <br/> Keeping a <i>sorted</i> list of all the methods that were jitted when the test passed can be useful to direct the search. Note that I said "when the test passed", and not "failed", because the failure could have cut the logs and made the list incomplete... </p> <p> At some point, you'll eventually find one single method that changes the test result. Then, the best thing you can do is running the test with lots of logs enabled (ideally put the <tt>"-v"</tt> option five times on the command line, and enable all the relevant debugging macros like <tt>"DEBUG_DEADCE"</tt> or <tt>"DEBUG_ALIAS_ANALYSIS"</tt>). </p> <p> Be warned that the resulting logs can be huge! <br/> There are two basic ways to produce smaller logs: just compile the failing method with <tt>"mono --compile ..."</tt>, or put an "if" at the beginning of <tt>"mini_method_compile"</tt> that sets <tt>"cfg->verbose_level"</tt> to 5 if the method name matches (at that point you can just hard code the string in the source!). <br/> I prefer the second method because it does not alter the test run (so that if the error was dependent on something else, like having a class constructor already executed, you are in the same condition, otherwise you could end up chasing ghosts). </p> <p> When you have those logs, please send them to me, they are generally enough to understand what went wrong :-) </p> http://primates.ximian.com/~massi/all.html#5%2f25%2f2006+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#5%2f25%2f2006+1%3a55%3a00+PM Thu, 25 May 2006 13:55:00 GMT 19 May 2006 <h3 class="title">Back from Sicily.</h3> <p> As somebody knows, I've made a trip to Sicily lately... and I'm back today. It's not been exactly a "vacation"... basically, my mother is badly ill (a tumor, the doctors clearly said that she will not recover, and explained us how she will likely die, they just don't know when). <br/> She was born in Sicily, and wanted to see her brother, her sister, and many other persons there... so she and my father made a trip there, and I followed them with my wife and my son. </p> <p> The result has been... probably some of the happiest days we ever had together! And the nice thing is that she also feels much better. </p> <p> Now, <i>how much</i> better is difficult to say... certainly we've all seen her doing things we never thought she could do again. <br/> I <i>do</i> believe in miracles, but before saying something is really a miracle I must have hard facts supporting it. Next Tuesday we will have the results of some medical exams she had done a few days ago, then we will see what's really going on there... </p> <p> Until then, I'm just really happy of the time we had. <br/> Even only this is already a miracle! </p> http://primates.ximian.com/~massi/all.html#5%2f19%2f2006+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#5%2f19%2f2006+1%3a55%3a00+PM Fri, 19 May 2006 13:55:00 GMT 13 Apr 2006 <h3 class="title">Chasing a ghost, and Microsoft JIT inlining behavior</h3> <p> Finally, the tree mover (and all the optimizations described <a href="http://primates.ximian.com/~massi/blog/archive/2006/Mar-27.html">here</a>) is in svn, but that optimization set is not enabled by default. </p> <p> The problem (which showed up as soon as I committed the patch) is that inlining breaks several pieces of code, most notably <a href="http://www.monodevelop.com/Main_Page">MonoDevelop</a>. </p> <p> All the issues are related to calls to <tt>MethodBase.GetCurrentMethod()</tt>, <tt>Assembly.GetExecutingAssembly()</tt>, <tt>Assembly.GetCallingAssembly()</tt> and any of the <tt>StackFrame</tt> constructors (thanks to <a href="http://www.advogato.org/person/lupus/">lupus</a> for this list!), and are related to the fact that inlining a method alters the chain of stack frames. </p> <p> To fix this in a reliable way I wrote a small program to use as regression test, you can find it <a href="http://primates.ximian.com/~massi/blog/archive/2006/Apr-13/">here</a>. It <i>must</i> be split in two assemblies otherwise most of the tests don't make sense, but it is really easy to understand. <br/> It also immediately shows the problems: with Mono from current svn, executing with "<tt>-O=-inline</tt>" prints five "<tt>PASSED</tt>" messages, which turn into five "<tt>FAILED</tt>" ones with "<tt>-O=inline</tt>". </p> <p> So far, so good... Initially it seemed that a good solution would be checking if any of those methods was called, and if it was, refuse to inline the caller. However, after a few trials and errors, I found out that the proper fix is inlining only <i>leaf</i> methods (I mean, methods which don't call any other method), with the exception of internally generated wrappers, which can be inlined safely. <br/> The rationale is the following: suppose you have a caller method CM, in assembly A, and a small called method SM, in assembly B. SM, then, calls another large method, LM (which for simplicity we think will never be inlined), which in turn calls for instance <tt>Assembly.GetCallingAssembly()</tt>. Now, if SM is inlined, it looks like if LM is called from CM (from assembly A), and not from SM (from assembly B): this makes so that the return value of <tt>Assembly.GetCallingAssembly()</tt> inside LM changes. <br/> Note that inside SM there is nothing hinting for this, just a call to a method (LM). Finally note that this is true also if LM is a constructor, which means that also methods which call constructors cannot be inlined (which means any method that creates objects!). </p> <p> To be fair, there is another reasoning we can follow which allows us to inline more methods: it is not the fact that a method merely <i>contains</i> a <tt>call</tt> or <tt>new</tt> CIL opcode that should prevent inlining. Instead, it is the fact that in the jitted code there's a call to potentially unknown code (either methods or constructors), which in turn could depend on the "correctness" of the chain of stack frames. This way of reasoning would allow us to inline nested calls, as long as none of them ends up calling a not-inlined method (or an inlined wrapper, because it <i>will</i> in the end call another piece of unknown code!). </p> <p> But, regardless of the strategy we'll choose, fixing this is not hard, and in fact I did it quickly. <br/> But even when in the regression test program I got all tests passing, the MonoDevelop splash screen was still not appearing, and scary warnings were logged every time I had inline enabled... To make a long story short, I wasted way too much time chasing a ghost: I was convinced that inline was doing something wrong (the problem was there <i>only</i> when inline was enabled), and I could not find what it was. <br/> Then, trying to reproduce the problem with a smaller program, I got it: inline was doing nothing wrong, it was just exposing to the JIT the load of a static field a bit earlier. This static field was the SplashScreemForm singleton in MonoDevelop, and since the class is "beforefieldinit", the JIT initialized the field on the spot. SplashScreemForm is a Gtk.Window, so it got created before the GTK# initialization, and found itself without a display... bad thing. <br/> However, the proper fix is just adding an empty static constructor to SplashScreemForm, so that it gets the strict field init semantics. The last line logged by the sample program I linked above shows exactly this kind of behavior. <br/> And anybody making a GTK# Window a singleton should take note to remember to remove the "beforefieldinit" from the class, and access the singleton only <i>after</i> the GTK# Application has been initialized (I did the same mistake trying to reproduce the bug!). </p> <p> So, in the end we'll have a JIT that always respects the structure of the call stack, but before you write code that needs this, think twice: the Microsoft runtime is not so strict ;-) <br/> If you execute the test program on FX 1.1 or 2.0, you'll see that both of the <tt>Assembly.GetCallingAssembly()</tt> related tests fail (thanks to <a href="http://pages.infinit.net/ctech/poupou.html">Sebastien</a> for running the tests for me). The code pattern reproduced in the second one (in constructors) is sometimes used to load images from resources when building graphical objects, so code like this could work on Mono, and be broken on Windows... </p> http://primates.ximian.com/~massi/all.html#4%2f13%2f2006+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#4%2f13%2f2006+1%3a55%3a00+PM Thu, 13 Apr 2006 13:55:00 GMT 27 Mar 2006 <h3 class="title">Almost Easter, and still moving trees around...</h3> <p> Lately I've been mostly working on the "<i>tree mover</i>", which is described in <a href="http://tirania.org/blog/archive/2006/Jan-24.html">this</a> post by Miguel. <br/> As you can see, it is not an optimization by itself, it is a code transformation necessary to help our current register allocator, and in fact the plan is not to use it alone, but to help inlining. In practice, to get the full benefits <i>five</i> more optimizations must be enabled starting from the current default in the Mono JIT: <ul> <li><u>Inlining</u>: What this does is obvious, it replaces method calls with the actual method code, "expanded inline". It has been present in the JIT for years, but never enabled because it generally produced worse code. I tuned it to be applied only for small and extremely simple (branchless) methods, but it still introduces many temporary variables... </li> <li><u>Copy propagation</u>: This is the first step in eliminating unneeded variables: if a value is copied from A to B, then each time B is used copy propagation modifies the code to use A instead (of course, only until either A or B is assigned again!). This typically has the effect of producing <i>dead</i> variables (which are assigned but never used). </li> <li><u>Constant propagation</u>: Similar to copy propagation, but applied to constant values. </li> <li><u>Tree propagation</u>: This is a code transformation that eliminates some variables completely from the JIT internal code representation. With the current register allocator, this means that their values will not be seen by the global regalloc, and will be handled directly by the local one. </li> <li><u>Dead code elimination</u>: <a href="http://primates.ximian.com/~massi/blog/archive/2005/Nov-16.html">This</a> optimization eliminates dead assignments (most of them have been produced by the previous passes). </li> </ul> </p> <p> As you can see, all these passes are meant to work together in a "pipeline". After a lot of hammering on them, I have seen concrete results on some real and complex benchmark, like XMLMark. <br> Working on this benchmark, first of all I modified the ini files, to be able to see exactly what the JIT was affecting; so I created five ini files, one for each XMLMark functionality (each functionality gets weight 100, and all the others stay at 0): <ul> <li><u>DOM-SEL</u>: DOM parser, element selection. <li><u>DOM-MOD</u>: DOM parser, content modification. <li><u>DOM-STR</u>: DOM parser, structure modification. <li><u>DOM-SER</u>: DOM parser, serialization. <li><u>SAX-SEL</u>: SAX (pull) parser, element selection. </li> </ul> I also had to set the number of agents to 1 to get reliable results (the multithreaded nature of XMLMark was getting in the way, and after all I was interested just in the effects of JIT settings). </p> <p> Then, I tested with a lot of option combinations: </p> <TABLE border="1"> <CAPTION><EM>Options</EM></CAPTION> <TR> <TH>Label</TH><TH>Options</TH> </TR> <TR><TD align="left">ALL</TD><TD align="left">all</TD></TR> <TR><TD align="left">ALL-NO-SSA</TD><TD align="left">all,-ssa,-ssapre,-abcrem</TD></TR> <TR><TD align="left">FAST</TD><TD align="left">consprop,copyprop,treeprop,deadce,inline</TD></TR> <TR><TD align="left">NO-DEAD</TD><TD align="left">consprop,copyprop,treeprop,-deadce,inline</TD></TR> <TR><TD align="left">NO-INLINE-DEAD</TD><TD align="left">consprop,copyprop,treeprop,-deadce,-inline</TD></TR> <TR><TD align="left">NO-INLINE</TD><TD align="left">consprop,copyprop,treeprop,deadce,-inline</TD></TR> <TR><TD align="left">NO-INLINE-TREE</TD><TD align="left">consprop,copyprop,-treeprop,deadce,-inline</TD></TR> <TR><TD align="left">NO-PROP</TD><TD align="left">-consprop,-copyprop,-treeprop,deadce,inline</TD></TR> <TR><TD align="left">NO-PROP-INLINE</TD><TD align="left">-consprop,-copyprop,-treeprop,deadce,-inline</TD></TR> <TR><TD align="left">NO-TREE-DEAD</TD><TD align="left">consprop,copyprop,-treeprop,-deadce,inline</TD></TR> <TR><TD align="left">NO-TREE</TD><TD align="left">consprop,copyprop,-treeprop,deadce,inline</TD></TR> <TR><TD align="left">NOTHING</TD><TD align="left">-consprop,-copyprop,-treeprop,-deadce,-inline</TD></TR> </TABLE> <p> Of these combinations, "NOTHING" is the current JIT default, "FAST" is the full set I'm testing (which ideally will become the default), "ALL-NO-SSA" means "everything the JIT can do without using SSA", and all the "NO-<i>something</i>" are the "FAST" set without something, to see if all the options are really necessary to get the best results. </p> <p> These are the numbers I got: </p> <TABLE border="1"> <CAPTION><EM>XMLMark results, in transactions per seconds</EM></CAPTION> <TR> <TH>XMLMark</TH><TH>DOM-SEL</TH><TH>DOM-MOD</TH><TH>DOM-STR</TH><TH>DOM-SER</TH><TH>SAX-SEL</TH> </TR> <TR> <TH align="left">ALL</TH><TD align="right">53.40</TD><TD align="right">50.20</TD> <TD align="right">50.95</TD><TD align="right">54.60</TD><TD align="right">15.40</TD> </TR> <TR> <TR> <TH align="left">ALL-NO-SSA</TH><TD align="right">53.40</TD><TD align="right">50.50</TD> <TD align="right">51.10</TD><TD align="right">55.75</TD><TD align="right">15.25</TD> </TR> <TR> <TH align="left">FAST</TH><TD align="right">53.75</TD><TD align="right">50.85</TD> <TD align="right">51.45</TD><TD align="right">56.00</TD><TD align="right">15.95</TD> </TR> <TR> <TH align="left">NO-DEAD</TH><TD align="right">52.50</TD><TD align="right">49.85</TD> <TD align="right">50.05</TD><TD align="right">54.10</TD><TD align="right">13.15</TD> </TR> <TR> <TH align="left">NO-INLINE-DEAD</TH><TD align="right">47.85</TD><TD align="right">45.45</TD> <TD align="right">45.10</TD><TD align="right">49.20</TD><TD align="right">13.10</TD> </TR> <TR> <TH align="left">NO-INLINE</TH><TD align="right">50.15</TD><TD align="right">47.05</TD> <TD align="right">46.25</TD><TD align="right">51.20</TD><TD align="right">15.45</TD> </TR> <TR> <TH align="left">NO-INLINE-TREE</TH><TD align="right">50.85</TD><TD align="right">47.55</TD> <TD align="right">47.85</TD><TD align="right">52.15</TD><TD align="right">15.35</TD> </TR> <TR> <TH align="left">NO-PROP</TH><TD align="right">53.05</TD><TD align="right">50.55</TD> <TD align="right">50.45</TD><TD align="right">54.70</TD><TD align="right">15.45</TD> </TR> <TR> <TH align="left">NO-PROP-INLINE</TH><TD align="right">49.60</TD><TD align="right">46.90</TD> <TD align="right">47.10</TD><TD align="right">50.95</TD><TD align="right">15.25</TD> </TR> <TR> <TH align="left">NO-TREE-DEAD</TH><TD align="right">51.90</TD><TD align="right">49.20</TD> <TD align="right">49.90</TD><TD align="right">53.90</TD><TD align="right">15.75</TD> </TR> <TR> <TH align="left">NO-TREE</TH><TD align="right">52.70</TD><TD align="right">49.90</TD> <TD align="right">50.25</TD><TD align="right">54.90</TD><TD align="right">15.75</TD> </TR> <TR> <TH align="left">NO-TREE</TH><TD align="right">52.70</TD><TD align="right">49.90</TD> <TD align="right">50.25</TD><TD align="right">54.90</TD><TD align="right">15.75</TD> </TR> <TR> <TH align="left">NOTHING</TH><TD align="right">51.05</TD><TD align="right">47.70</TD> <TD align="right">47.85</TD><TD align="right">51.75</TD><TD align="right">15.00</TD> </TR> </TABLE> <p> Here are the same data in graphical form: </p> <p> <img src="http://primates.ximian.com/~massi/blog/archive/2006/Mar-27/XMLMark-results.gif" alt="XMLMarl results chart" align="middle"/> </p> <p> To make some sense of it, here's a chart of the gains against the "NOTHING" options set, as percentages: </p> <p> <img src="http://primates.ximian.com/~massi/blog/archive/2006/Mar-27/XMLMark-gains.gif" alt="XMLMarl gains chart" align="middle"/> </p> <p> Some easy observations: <ul> <li> The "FAST" set is always the best, in every benchmark functionality. </li> <li> Eliminating <tt>deadce</tt> and/or <tt>inline</tt> almost always results in a heavy performance <i>loss</i>. This happens because all the "propagation" options are not "complete" optimizations, rather they prepare the code for the final <tt>deadce</tt> pass which eliminates useless stores. And also <tt>inline</tt>, by itself, eliminates the overhead of method calls but produces messy code that <i>must</i> be cleaned up by the other passes. </li> <li> Anyway, the "NO-PROP" set is clearly worse than the "FAST" one. This is another strong hint that all the options in the "FAST" set should work together (because neither <tt>deadce</tt> nor <tt>inline</tt> can, alone, give strong speedups). </li> <li> Finally, also the "NO-TREE" set is always worse than the "FAST" one, showing that the current regalloc really benefits from the elimination of variables from the internal code representation performed by <tt>treeprop</tt>. </li> </ul> </p> <p> The (small) downside of these gains is that they cost about 25% more JIT time (this is the cumulative overhead of the "FAST" set). This is clearly visible in programs which exercise a lot of code but are very fast, like a mcs bootstrap: in this case the "FAST" set <i>loses</i> 1.33% (it takes 3.86 seconds instead of 3.81). However, uning a custom mcs driver that performs a second compilation (a sort of "hot run", because all the code is already jitted), and making sure that also the "cold run" had all the source files in the system buffer cache, shows that the "FAST" set <i>gains</i> 1.85% against the current default. This means that longer compilations should not suffer at all from the JIT overhead. </p> <p> Finally, these results are on x86... on amd64 I noticed the need for further tuning in the linear regalloc spill costs: the gains are still there, but in some cases they are smaller. And also other benchmarks need some more tuning. </p> <p> Anyway, this code should hit svn "real soon now", and be part of Mono 1.2 (so that nobody should worry about the overhead of wrapping fields with properties anymore!) . </p> <p> Next in my todo list is getting rid of the SSAPRE issues we have... </p> http://primates.ximian.com/~massi/all.html#3%2f27%2f2006+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#3%2f27%2f2006+1%3a55%3a00+PM Mon, 27 Mar 2006 13:55:00 GMT 13 Mar 2006 <h3 class="title">So you want to tune the JIT... but where are the knobs?</h3> <p> Certain optimizations are always a win, so, if they don't take too much JIT time, they should be always turned on. Examples are constant folding, using intrinsic functions, reorganizing branches, or also dead code elimination (again, if the JIT time is small, which for now in Mono means without using SSA). </p> <p> Other optimizations, however, are more problematic... sometimes they are good, other times they don't optimize at all: they make worse code. In this case, some <i>tuning</i> is needed: the JIT must decide where the optimization is useful, and where it is useless or even harmful. </p> <p> A typical example of this in Mono is SSAPRE, which looks good on paper (it eliminates redundancies), but actually makes a tradeoff between the cost of space (to store precomputed values) and computations, and this tradeoff is not always obvious. <br/> So far I've seen the following places in the JIT where this kind of tuning is important: <ul> <li> <u>Inlining</u>: obviously sometimes it's good and sometimes it's bad, so the JIT must weight the "cost" of inining a method, and see where it is more than the cost of the call. </li> <li> <u>SSAPRE</u>: here the decision must be taken for each redundant expression that can be eliminated, weighting the cost of the computation vs the added pressure on the regalloc. </li> <li> <u>Regalloc</u>: in the global regalloc the JIT considers the "spill cost" of each variable, and gives precedence to variables with higher cost. </li> </ul> </p> <p> So we can (very roughly) consider each of the parameters described above as a knob we can use to "tune" the JIT: changing the threshold used to take a decision, the JIT behavior is altered. <br/> And obviously there <i>should</i> be an optimal tuning point... </p> <p> But things are not so simple. <br/> First of all, it can happen that one knob influences another. For instance, more inlining can increase the pressure on the regalloc, which means that the behavior of the regalloc knob will react differently. Then, not all CPU architectures can react to the tuning in the same way. One of my latest posts to mono-devel-list is an example of both things: I changed the way in which the spill cost takes into account the BB nesting level from "<tt>(1 + (bb->nesting * 2))</tt>" to "<tt>(1 << (bb->nesting << 1))</tt>", and got a 25% speedup in Scimark2. But I cried victory too son: I also made it 2% <i>worse</i> on x86... a less aggressive setting like "<tt>(1 * (bb->nesting << 1))</tt>" is more a compromise (only 10% more on amd64, and not worse on x86), but I still have to look at the generated code better and find the optimal point (or maybe decide that there is not an architecture independent optimal point). <br/> Now, this can be reasonable, because on amd64 the JIT uses SSE instructions for floating point operations, where the regalloc works "properly", while on x86 the FPU is used in the traditional "stack based" way. </p> <p> The point of all the above it that exploring this "tuning space" can be a long (and sometimes tedious) process, often driven more by "sense of smell" than anything else. <br/> But "sense of smell" is not scientific, and the "boring" side of the equation makes it hard to gather results in a rigorous way. So I wrote a small script to automate the process, which can be found <a href="http://primates.ximian.com/~massi/BENCHMARKS/benchmark.pl">here</a> together with some benchmark. </p> <p> Mind you, it is rough, my perl-fu is poor, some paths are hard coded, real men use <tt>sed</tt> to modify files in place instead of stupid perl loops... but it gets the job done. It is meant to be used directly from the "mini" directory in the JIT, and produces lots of nice result files that can then be examined quickly. <br/> And all the <i>boring</i> part, edit source file, save, build, execute benchmarks <i>n</i> times, save results with a meaningful name, repeat, do the same on a different machine... well, it's gone, freeing my time for something better! </p> <p> So, if you are interested in JIT tuning, and have a machine with some spare cycles to run benchmarks, please give a look to that script. It is fairly easy to describe your own set of benchmark runs in an array of perl hashes, and it will modify the JIT source, rebuild and run all the benchmarks for you in sequence. </p> <p> And of you have interesting results, let me know! </p> <p> In the next weeks I'll do the following: <ul> <li> Take a break from JIT tuning, and fix the correctness of the <tt>klass</tt> field inside <tt>MonoInst</tt> (needed both for type based alias analysis and for the new garbage collector). </li> <li> Tune also SSAPRE in a decent way. </li> <li> Try to improve SciMark2 again, but this time simply in a "profiler driven" way: run the profiler, look at the assembly cod the hottest methods, and see what should be done... </li> </ul> </p> <p> Of course, if nothing more important shows up! </p> http://primates.ximian.com/~massi/all.html#3%2f13%2f2006+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#3%2f13%2f2006+1%3a55%3a00+PM Mon, 13 Mar 2006 13:55:00 GMT 21 Nov 2005 <h3 class="title">Spreading the word</h3> <p> Some news... November 26<sup>th</sup> is national <a href="http://www.linux.it/LinuxDay/">Linux Day</a>. I'll be giving two talks on Mono, <a href="http://www.openlabs.it/linuxday2005/">here</a> and <a href="http://www.linuxdaymilano.it/">here</a>. </p> <p> These are the slides I'll use: <a href="http://primates.ximian.com/~massi/MonoLinuxDay2005.sxi">SXI (OpenOffice)</a>, <a href="http://primates.ximian.com/~massi/MonoLinuxDay2005.pdf">PDF</a> and <a href="http://primates.ximian.com/~massi/MonoLinuxDay2005.txt">text</a> (this one freely redistributable). Not that I'll be able to do the <em>full</em> presentation in the time I'll have, but I prefer to have all the slides anyway. </p> <p> Moreover, an article I wrote will be published on <a href="http://www.oltrelinux.com/">this</a> Italian magazine on December 1<sup>st</sup>. It will take eight full pages, and is also the subject of a fancy cover: </p> <p> <img src="http://www.oltrelinux.com/immagini/numero_49/copertina_49.gif" alt="Cover showing a monkey helping a penguin out of a Windows prison" align="middle"/> </p> <p> The icing an the cake will be two Suse 10 DVDs (x86 32 and 64 bits), and a large selection of Mono packages on an additional DVD... </p> <p> Plus, more articles about Mono will follow in the next months. </p> http://primates.ximian.com/~massi/all.html#11%2f21%2f2005+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#11%2f21%2f2005+1%3a55%3a00+PM Mon, 21 Nov 2005 13:55:00 GMT 16 Nov 2005 <h3 class="title">Fastpath way to get rid of the dead</h3> <p> I've not been blogging for ages... last time I did, I was starting to work on <a href="http://primates.ximian.com/~massi/blog/archive/2005/Apr-12.html">HSSA</a>. </p> <p> Now, HSSA is progressing well, it already works in my tree, and I have a functioning dead code elimination (deadce) pass implemented that is based on it. I was testing and benchmarking this "HSSA deadce", comparing it to the existing "SSA deadce" we already have in the JIT, when I realized that I also had a <i>third</i> deadce implementation, which didn't use SSA at all. I'll take Patrik Torstensson's suggestion, and call it "fastpath" deadce, because the JIT path that doesn't use SSA is the fast one. </p> <p> If someone wants to laugh at the "famous last words", he can have a look at my previous blog post, where I mentioned exactly this deadce pass. Anyway this time, since I was testing deadce effectiveness, it seemed a nice thing to include it in the benchmarks... </p> <p> The results have been at least interesting: fastpath deadce is actually fairly effective, and it also doesn't have the JIT time overhead of the SSA version. </p> <p> This table lists how long it takes to the JIT to compile "mscorlib.dll". The times are measured with "<tt>time mono -O=</tt><i>options</i><tt> --compile-all mscorlib.dll</tt>". </p> <TABLE border="1"> <CAPTION><EM>Time to JIT compile "mscorlib.dll", in seconds</EM></CAPTION> <TR> <TH>Options</TH><TH>-all</TH><TH>-all,ssa</TH> <TH>(default)</TH><TH>ssa</TH><TH>deadce</TH><TH>ssa,deadce</TH> </TR> <TR> <TH>JIT time</TH><TD align="right">1.05</TD><TD align="right">1.35</TD> <TD align="right">1.32</TD><TD align="right">1.55</TD><TD align="right">1.45</TD><TD align="right">1.6</TD> </TR> </TABLE> <p> From the first two columns we see that just computing SSA has about a 30% overhead with respect to a compilation with no optimizations at all. </p> <p> Then we see that the overhead of fastpath deadce on a "default" compilation is almost 10%, while ssa deadce has 21% (and note that in practice, since ssa does not work with aliasing, the new fastpath deadce is applied to more methods). </p> <p> So, the JIT overhead is reasonably small. </p> <p> And finally, some lovely benchmark... I tried SciMark, and here are the results: </p> <TABLE border="1"> <CAPTION><EM>SciMark results (MFlops)</EM></CAPTION> <TR> <TH>Options</TH> <TH>(default)</TH> <TH>deadce</TH> <TH>ssa<br/>deadce</TH> <TH>consprop<br/>copyprop<br/>deadce</TH> <TH>consprop<br/>copyprop<br/>deadce<br/>inline</TH> <TH>inline</TH> <TH>deadce<br/>inline</TH> </TR> <TR> <TH align="left">Composite Score</TH> <TD align="right">205.92</TD> <TD align="right">207.50</TD> <TD align="right">210.61</TD> <TD align="right">207.72</TD> <TD align="right">209.82</TD> <TD align="right">213.63</TD> <TD align="right">214.52</TD> </TR> <TR> <TH align="left">FFT</TH> <TD align="right">111.42</TD> <TD align="right">110.91</TD> <TD align="right">124.74</TD> <TD align="right">101.83</TD> <TD align="right">113.39</TD> <TD align="right">123.55</TD> <TD align="right">123.41</TD> </TR> <TR> <TH align="left">SOR</TH> <TD align="right">394.40</TD> <TD align="right">400.97</TD> <TD align="right">400.03</TD> <TD align="right">405.99</TD> <TD align="right">402.71</TD> <TD align="right">402.78</TD> <TD align="right">405.82</TD> </TR> <TR> <TH align="left">Monte Carlo</TH> <TD align="right">4.91</TD> <TD align="right">5.16</TD> <TD align="right">4.23</TD> <TD align="right">5.13</TD> <TD align="right">4.85</TD> <TD align="right">4.69</TD> <TD align="right">5.16</TD> </TR> <TR> <TH align="left">Sparse MatMult</TH> <TD align="right">167.77</TD> <TD align="right">160.82</TD> <TD align="right">172.33</TD> <TD align="right">169.99</TD> <TD align="right">173.70</TD> <TD align="right">173.47</TD> <TD align="right">172.30</TD> </TR> <TR> <TH align="left">LU</TH> <TD align="right">351.12</TD> <TD align="right">359.62</TD> <TD align="right">351.71</TD> <TD align="right">355.69</TD> <TD align="right">354.47</TD> <TD align="right">363.67</TD> <TD align="right">365.89</TD> </TR> </TABLE> <p> Well, we can see that deadce is doing its job! There are no <i>miracles</i>, but it's not bad anyway. </p> <p> Another consideration: SSA deadce is still more effective than the fastpath one, likely because SSA allows for more precision in understanding the effect that statements have (but I still have to verify this, it is also possible that I missed something in the fastpath implementation). Anyway, the point of this benchmark run is that fastpath deadce has far less JIT overhead, so that we could consider it for inclusion in the default optimization set, something for which SSA does not qualify (at least for now). Of course SSA still has its use: more advanced things (ABCREM and redundant check removal in general, PRE, and in the future register promotion and escape analysis) are better done with [H]SSA. But we can see that a simple deadce pass can be effectively done without it. </p> <p> And by the way, I spent a <em>lot</em> of time trying to be sure that those numbers are accurate. I had to kill evolution, gaim, beagle, the wireless card, and carefully monitor system load otherwise the results were erratic (just of a few percent, but this is what we're looking at). I then executed the benchmark several (>10) times for every case and took the <em>best</em> score for each (with the idea that it is the one where other things interfered less), and anyway all the best scores were almost identical. And of course the patch passes all standard tests (a full Mono rebuild with the optimization turned on by default, and all regression tests). </p> <p> One final strange result... look at this table: </p> <TABLE border="1"> <CAPTION><EM>SciMark results (MFlops)<br/>[exe compiled with a MS .NET 2.0 csc]</EM></CAPTION> <TR> <TH>Options</TH> <TH>(default)</TH> <TH>consprop<br/>copyprop<br/>deadce</TH> </TR> <TR> <TH align="left">Composite Score</TH> <TD align="right">136.65</TD> <TD align="right">137.52</TD> </TR> <TR> <TH align="left">FFT</TH> <TD align="right">52.60</TD> <TD align="right">52.29</TD> </TR> <TR> <TH align="left">SOR</TH> <TD align="right">404.41</TD> <TD align="right">405.48</TD> </TR> <TR> <TH align="left">Monte Carlo</TH> <TD align="right">4.61</TD> <TD align="right">4.50</TD> </TR> <TR> <TH align="left">Sparse MatMult</TH> <TD align="right">140.03</TD> <TD align="right">143.80</TD> </TR> <TR> <TH align="left">LU</TH> <TD align="right">81.61</TD> <TD align="right">81.52</TD> </TR> </TABLE> <p> Well, deadce still has some effect, but... the overall performance is just awful! </p> <p> Looking better, we see that FFT and LU (and to some extent Sparse MatMult) are really badly affected, and bring the score down. I already have some small test program that isolates them (csc 2.0 was also able to reveal a very subtle bug in HSSA, on SciMark LU), so I'll investigate further. The sources and binaries of the benchmark are <a href="http://primates.ximian.com/~massi/blog/files/SciMark-mcs-csc.tar.gz">here</a> ("SciMark.exe" is the original executable), and anyway I got the original package <a href="http://www.cs.ucsb.edu/~ckrintz/racelab/PhxCSBenchmarks/">here</a>. </p> <p> There is still work to do... I'd have to check the exact effect this has on inlining looking at the code produced compiling small specific benchmarks, and I'd also have to see if we can lift the restriction imposed by catch/finally clauses (the last patches by Patrik that fix the CFG should help here). This would be nice, because it would mean that this optimization would apply to <em>all</em> methods! For instance, Ben warned me that every "<tt>foreach</tt>" statement introduces a "<tt>finally</tt>" clause, which disables deadce and everything that needs SSA in the JIT. </p> <p> Finally, if anybody wants to look at the source, it has been posted <a href="http://lists.ximian.com/archives/public/mono-devel-list/2005-November/015765.html">here</a>. </p> http://primates.ximian.com/~massi/all.html#11%2f16%2f2005+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#11%2f16%2f2005+1%3a55%3a00+PM Wed, 16 Nov 2005 13:55:00 GMT 12 Apr 2005 <h3 class="title">Dealing with aliasing</h3> <p> Recently I've been working on <i>alias analysis</i> (also known as "<i>points to</i>" analysis). </p> <p> Very briefly, this means understanding the behavior of pointers in the compiled code so that various optimizations can be performed also in the presence of indirect load and store operations (where "indirect" means "performed using pointers"). </p> <p> In the beginning, I started working on this because I wanted to implement a dead store elimination pass without using SSA. The idea was that it would have helped when we perform inlining, because we create one local variable for each argument of the inlined method. Quite often those variables could be eliminated by a dead store elimination pass if copy propagation has been done, but for now we do deadce only using SSA. It is easy to implement deadce without SSA if you have liveness information, and we always compute liveness to run the linear register allocator, so I decided to base the deadce pass on that. </p> <p> I found out that our liveness code does not really take aliasing into account; this is enough anyway for how the regalloc uses liveness, but my deadce pass was causing crashes all over the place. So I begun investigating alias analysis. </p> <p> The problem is that alias analysis can mean a lot of different things. There are a lot of possible approaches to the problem, and the analysis can be more or less accurate. After a while (and a few messages on the mono-devel list) we settled on the initial goal of having correct liveness information for <i>local</i> variables. This would allow me to do my deadce pass, and would also be enough to enable SSA also when the address of a local variable has been taken (now SSA is totally disabled in those cases). On the other hand, this also means that we explicitly do not care about pointers that point <i>outside</i> local variables. </p> <p> I then implemented this kind of analysis, used it in a deadce pass, and posted a <a href="http://lists.ximian.com/archives/public/mono-devel-list/2005-March/010791.html">patch</a>. It then turned out that my deadce didn't help so much for inlining. It was in any case better than the SSA based deadce because it had much faster compilation times (no SSA overhead), but it wasn't so urgent to commit it... </p> <p> Meanwhile, I turned <a href="http://primates.ximian.com/~massi/blog/archive/2005/Mar-16.html">back</a> to <a href="http://primates.ximian.com/~massi/blog/archive/2004/Sep-14.html">SSAPRE</a>, because it just wasn't working right. I committed a <a href="http://lists.ximian.com/archives/public/mono-patches/2005-April/056703.html">patch</a> that at least seems to limit the damages (the code produced for the example in my previous blog entry looks OK this time), but the bottom line is that SSAPRE, as is, is just not very effective. </p> <p> Of course one way to improve the situation would be adding GVN capabilities, but looking at the code in the benchmarks the real issue seems to be that indirect operations are not handled at all. Since object fields are generally accessed through a pointer to the base of the object (plus the appropriate offset), this means that <i>all</i> expressions containing a field access are simply ignored by SSAPRE. </p> <p> So we are back to aliasing, but this time taking into account all pointers, not just those referring to local variables. And this kind of analysis is much more complex... This is another reason why I still haven't committed the current aliasing patch: it is likely that I'll do changes to its data structures, and I if possible I'd like not to work on two different implementations at the same time. </p> <p> One of the trickiest things in this kind of "full" alias analysis is giving it an SSA representation. In the end, SSA in its natural form gives unique versions to local variables, not to arbitrary memory locations. However, usually SSA based optimizations expect to work on "versioned" values... One interesting solution to this problem is an extended SSA form called <a href="http://citeseer.ist.psu.edu/chow96effective.html">HSSA</a>. This involves several steps: <ul> <li> First of all, indirect operations must be recognized, and explicitly represented in the method code. The paper introduces two new operators in the IR: "&chi;" to represent indirect stores, and "&mu;" for indirect loads, and explains how they must be introduced in the code. These are in fact dummy operators (&chi; is in fact very similar to the "<TT>OP_DUMMY_STORE</TT>" we already have), and must be discarded when destroying the SSA form. </li> <li> Then, a sort of trick (which the paper calls "<i>zero versioning</i>") can be applied to avoid an explosion of SSA versions. Briefly, this exploits the fact that variables receiving their value from a &chi; operation have in fact an unknown value, so it is pointless giving them a unique SSA version: a dummy "zero version" is used for all of them (and their &phi; variables as well). This helps in keeping the SSA form reasonably small. </li> <li> At this point "<i>virtual variables</i>" are introduced. Again, briefly, they are a way to make external memory locations look like program variables, so that SSA versioning naturally applies to them. </li> <li> Finally, GVN (yes, Global Value Numbering!) is applied to the whole SSA form, obtaining the "Hashed SSA form" or HSSA. This has several effects, some of which I still don't completely understand. Particularly, I still cannot see the difference between GVN-SSA and applying SSAPRE to HSSA... it is worth noting that HSSA and SSAPRE have been developed by the same persons in SGI, and in fact in the SSAPRE paper they state that they run their benchmarks on a compiler using HSSA internally. So, maybe in retrospect there's no wonder that SSAPRE it is not so effective in Mono: I used it on a plain and simple SSA form, without GVN and without dealing with indirect load/store operations. I'll try to understand better the full implications of going HSSA in the following days. </li> </ul> </p> <p> As usual, the devil is in the details. </p> <p> Ignore the fact that I still don't fully understand all the consequences of throwing GVN into the mix. Let's also forget that probably representing virtual variables in Mini will be non trivial (for us a variable is just an entry in the <TT>cfg->vars</TT> array, identified by its index, but the way in which virtual variables should be identified is a bit tricky). Anyway, the real issue is the first point: how do we know where to introduce &chi; and &mu; operations? </p> <p> The simple answer is: performing alias analysis! </p> <p> Yes, because actually the HSSA paper shows how to fully <i>use</i> the results of this kind of analysis, making it work safely and effectively with SSA. How to <i>obtain</i> those information is a different thing. In my previous patch I actually avoided most of the problem, because I considered only local variables. Since a pointer to a local variable can be created only <i>explicitly</i> inside the method, it is fairly easy to track the usage of those pointers. Moreover, falling back to "all variables for which I lost track of their address are mutually aliased" is not that bad, because the situation is not frequent. </p> <p> On the other hand, pointers to locations outside the method local storage are totally out of control. Since we cannot perform interprocedural analysis, we must handle all method calls as black boxes, and we cannot assume anything on input arguments as well. In this situation the most obvious technique to use is "type based" alias analysis. This basically means the following: since we know our execution environment is strongly typed, we assume that all pointers whose value is unknown can alias all pointers of the same (or compatible) type. </p> <p> The problem is, this works well in Java, because in that environment the only "pointers" are references to objects. But in CIL the situation is different: not only we have object references, we also have managed pointers to "simple" memory locations (typically value types), that can be either local variables or "parts" of other objects. </p> <p> To better understand what I mean, have a look at the following program: </p> <xmp class="code-csharp">using System; using System.Runtime.InteropServices; public struct Data { public int i; public byte b; } [StructLayout(LayoutKind.Explicit)] public struct Union { [FieldOffset(0)] public int i; [FieldOffset(0)] public byte b; } public class Test { public static void WorkData (int v, ref Data d, ref byte b) { b = 0; d.i = v; d.b = (byte) v; d.i += b; } public static void WorkUnion (int v, ref int i, ref byte b) { i = v; b = (byte) v; i += b; } public static void Main () { Data d = new Data (); Union u = new Union (); WorkData (1, ref d, ref d.b); WorkUnion (1, ref u.i, ref u.b); } } </xmp> <p> Of course it does nothing useful, but it shows a couple of things: <ul> <li> Inside <TT>WorkData</TT>, that it is possible to invoke the method so that <TT>d</TT> and <TT>b</TT> partially overlap. This means (roughly speaking) that <i>every</i> pointer of a given type can alias any other pointer of a "compatible type" when the "origin" of both pointers is ultimately unknown. In this example, the two pointers are <TT>b</TT> and <TT>d.b</TT>: the origin of <TT>d.b</TT> is unknown because it points inside <TT>d</TT>, and the origin of <TT>d</TT> is unknown. At first sight, it could seem that <TT>b</TT> and <TT>d</TT> cannot have anything in common (one points to a "compound", the example would work the same way with an object), and the other is a primitive type, nevertheless they ultimately alias each other. </li> <li> Inside <TT>WorkUnion</TT>, that <TT>i</TT> and <TT>b</TT> can alias each other, even if their types are not compatible! Note that when compiling <TT>WorkUnion</TT> we have absolutely no clue that it could be called in that way, so even if we decided to analyze all the layouts of all involved types we would not get any help: <TT>i</TT> and <TT>b</TT> can overlap, period. This is a wicked example, but its consequences on alias analysis cannot be ignored: when compiling methods like <TT>WorkUnion</TT> <TT>i</TT> and <TT>b</TT> must be considered mutually aliased. </li> </ul> </p> <p> As you can see, CIL is a very interesting language when dealing with aliasing: the fact that it is strongly typed allows <i>some</i> assumptions on pointer values, but the possibility of referencing "raw" value types gives a lot of complications... </p> <p> I'll spend the following days doing two things: refining the conditions to perform type based alias analysis on CIL, and understanding the implications of the introduction of GVN into HSSA. This thing looks particularly promising to me... it should in the end give our JIT the same optimization capabilities of other mature compilers. </p> <p> Of course, ideas and suggestions are always welcome... please send them to the mono-devel list! </p> http://primates.ximian.com/~massi/all.html#4%2f12%2f2005+1%3a55%3a00+PM Massimiliano Mantione (massi@ximian.com) http://primates.ximian.com/~massi/all.html#4%2f12%2f2005+1%3a55%3a00+PM Tue, 12 Apr 2005 13:55:00 GMT 16 Mar 2005 <h3 class="title">Improving SSAPRE</h3> <p> One of the optimizations I've implemented lately is <a href="http://primates.ximian.com/~massi/blog/archive/2004/Sep-14.html">SSAPRE</a>. </p> <p> The problem is, this sweet beast works, and works correctly, but sometimes the code it produces is <i>slower</i> than the original. Needless to say, this is a Bad Thing<SMALL><SUP>TM</SUP></SMALL>. And it is also a bit embarassing... </p> <p> Recently <a href="http://www.advogato.org/person/lupus/">lupus</a> sent me a benchmark that showed this behavior <i>and</i> was easy to understand, so I decided to have a deeper look at the issue (previously I wanted to wait until I had added GVN capabilities to SSAPRE, but now I see it's better to at least understand what's really going on). </p> <p> The "meat" of the benchmark is the following method: </p> <xmp class="code-csharp">public void Advance(double dt) { double dx, dy, dz, distance, mag; for (int i=0; i < bodies.Length; i++) { for (int j=i+1; j < bodies.Length; j++) { dx = bodies[i].x - bodies[j].x; dy = bodies[i].y - bodies[j].y; dz = bodies[i].z - bodies[j].z; distance = Math.Sqrt(dx*dx + dy*dy + dz*dz); mag = dt / (distance * distance * distance); bodies[i].vx -= dx * bodies[j].mass * mag; bodies[i].vy -= dy * bodies[j].mass * mag; bodies[i].vz -= dz * bodies[j].mass * mag; bodies[j].vx += dx * bodies[i].mass * mag; bodies[j].vy += dy * bodies[i].mass * mag; bodies[j].vz += dz * bodies[i].mass * mag; } } foreach (Body body in bodies) { body.x += dt * body.vx; body.y += dt * body.vy; body.z += dt * body.vz; } } </xmp> <p> I verified that if I apply SSAPRE <i>only</i> to this method, all things being equal the benchmark is almost 10% slower, so it's worth understanding what happens. </p> <p> The issue is that looking only at the code transformations doesn't help at all, because they are all correct, and they all eliminate redundancies. It is a known thing that SSAPRE can increase the pressure on the register allocator, so I ended up verifying <i>everything</i>, included the resulting machine code (on x86) with and without the optimization. </p> <p> One of the first things I found looking at the code is that more local variables are used. Without SSAPRE, the prologue contains "<TT>sub $0x28,%esp</TT>", and with it this becomes "<TT>sub $0x34,%esp</TT>", so 12 more bytes are used by SSAPRE for temporary variables. So yes, there is more register pressure, but just seeing it this way does not justify a 10% loss in performance. </p> <p> Looking at the actual differences in the generated code is really interesting. SSAPRE gets many things right... the most obvious is the access to "bodies", which is an instance field (Advance is an instance method). At the beginning, there is a "<TT>mov 0x8(%ebp),%edi</TT>" instruction, which puts the "this" parameter in <TT>edi</TT>. This is fine, but with SSAPRE it gets better: a temporary is created to hold the address of "bodies"; "bodies" is the only instance field that is used in the method, so the "this" variable is dead, and is eliminated. This promotes the address of "bodies" to <TT>edi</TT>. In practice, a single "<TT>add $0x8,%edi</TT>" does the trick, and from then on, all accesses to "bodies" are like "<TT>mov (%edi),%eax</TT>" with SSAPRE (they were like "<TT>mov 0x8(%edi),%eax</TT>" without it). This happens 21 times. </p> <p> Then there is the implicit access to bodies.Length in the "foreach" loop, which SSAPRE correctly recognizes as a loop invariant. </p> <p> Another subtle transformation introduced by SSAPRE is the following: every time a "-=" or "+=" operator is