One of the optimizations I've implemented lately is SSAPRE.
The problem is, this sweet beast works, and works correctly, but sometimes the code it produces is slower than the original. Needless to say, this is a Bad ThingTM. And it is also a bit embarassing...
Recently lupus sent me a benchmark that showed this behavior and was easy to understand, so I decided to have a deeper look at the issue (previously I wanted to wait until I had added GVN capabilities to SSAPRE, but now I see it's better to at least understand what's really going on).
The "meat" of the benchmark is the following method:
I verified that if I apply SSAPRE only to this method, all things being equal the benchmark is almost 10% slower, so it's worth understanding what happens.
The issue is that looking only at the code transformations doesn't help at all, because they are all correct, and they all eliminate redundancies. It is a known thing that SSAPRE can increase the pressure on the register allocator, so I ended up verifying everything, included the resulting machine code (on x86) with and without the optimization.
One of the first things I found looking at the code is that more local variables are used. Without SSAPRE, the prologue contains "sub $0x28,%esp", and with it this becomes "sub $0x34,%esp", so 12 more bytes are used by SSAPRE for temporary variables. So yes, there is more register pressure, but just seeing it this way does not justify a 10% loss in performance.
Looking at the actual differences in the generated code is really interesting. SSAPRE gets many things right... the most obvious is the access to "bodies", which is an instance field (Advance is an instance method). At the beginning, there is a "mov 0x8(%ebp),%edi" instruction, which puts the "this" parameter in edi. This is fine, but with SSAPRE it gets better: a temporary is created to hold the address of "bodies"; "bodies" is the only instance field that is used in the method, so the "this" variable is dead, and is eliminated. This promotes the address of "bodies" to edi. In practice, a single "add $0x8,%edi" does the trick, and from then on, all accesses to "bodies" are like "mov (%edi),%eax" with SSAPRE (they were like "mov 0x8(%edi),%eax" without it). This happens 21 times.
Then there is the implicit access to bodies.Length in the "foreach" loop, which SSAPRE correctly recognizes as a loop invariant.
Another subtle transformation introduced by SSAPRE is the following: every time a "-=" or "+=" operator is used, the address of the affected location must be computed twice: one to perform the operation, and another to store the result. To make a concrete example, this means that without SSAPRE the last line of the foreach loop ("body.z += dt * body.vz") starts with a "fldl 0x18(%ebx)" and ends with a "fstpl 0x18(%ebx)". On the other hand, with SSAPRE, first of all the sequence "mov %edi,%eax; add $0x18,%eax" is inserted. Then the two other instructions should become "fldl (%eax)" and "fstpl (%eax)". Not such a big win, but not so bad after all. At least, not enough to justify a 10% loss in performance.
The problem is that things do not go that way. The line without SSAPRE is compiled like this:
The sequence that is actually inserted is "mov %edi,%eax; add $0x18,%eax; mov %eax,0xffffffc4(%ebp); mov %eax,%ecx", and the two key instructions then become "fldl (%ecx)" and "fstpl (%eax)" (note the use of ecx in the load and eax in the store). What's puzzling is that the sequence between the load and the store operates entirely on the FP stack ("fldl 0xc(%ebp); fldl 0x30(%edi); fmulp %st,%st(1); faddp %st,%st(1)", and the same with ebx instead of edi without SSAPRE), so it should not affect the precalculation of 0x18(%edi). Moreover, eax and ecx are only used in that load and store, and the location 0xffffffc4(%ebp) is assigned to, but it's dead. In the IR representation (before the linearization) there's nothing that would induce this code, except for the fact that the temporary introduced by SSAPRE to hold the address "0x18(%edi)" is in fact allocated in 0xffffffc4(%ebp)" after the linear scan regalloc has run. So this seems to me some artifact introduced by the local regalloc... This happens in all the three lines of the foreach loop, exactly in the same way, and similar things happen in the six lines with a "-=" or a "+=" in the loop above.
The last disaster happens in the inner loop (the one on the "j" index). I think you can easily see in the code that index "i" is incremented twice: one in the outer loop, to increment the counter, and another at the head of the inner loop, to initialize the "j" index. SSAPRE (being picky with redundancies) doesn't like this, so it has a nice idea: it introduces a temporary variable at the head of the inner loop, to store the result of "i + 1". Then, at the end of each iteration of the outer loop, instead of incrementing "i" it reads the value from the temporary. Note that the added temporary is live across the whole inner loop, but it is (obviously) never used there, because its value is "i + 1" and not "i"; it is only used at the end of each outer loop's iteration. Of course this is not such a big win, but it should not be so bad, either...
The twist is that the global rgalloc gets confused. Without SSAPRE, it allocates "i" in esi, and "j" in edi. With SSAPRE, "i" stays in esi, but the stupid SSAPRE temporary for "i + 1" goes in edi, and "j" goes in 0xfffffff0(%ebp)! So, every time "j" is used in the inner loop, 0xfffffff0(%ebp) is used instead of simply %edi...
Put this all together, and indeed you get an almost 10% degradation in performance.
The problem is, how to get out of this?
There are two kind of solutions:
It is likely that both things should be done.
In the example above, all the redundancies that cause troubles are of this kind: they are a simple arithmetic operation (an add or a sub), they introduce one temporary to eliminate just one computation, and the operation that is eliminated wasn't a loop invariant. So, they introduce one temporary, but they don't gain that much. Chances are that, if the operands of the expression are both in registers and the temporary is on the stack, the code produced by SSAPRE would naturally be always slower.
The problem is that when SSAPRE runs it has no idea of the choices that the regalloc will do, so this cannot really be fine tuned without introducing a sort of cooperation between the PRE pass and the regalloc, which is not a short term goal IMHO. So, the only reasonable heuristic I can imagine is the one I described above. It makes loop optimizations a prerequisite for SSAPRE (otherwise it cannot tell when an expression is a loop invariant, for SSAPRE all partial redundancies are equal), but it should not be that hard to to.
Anyway, IMHO also some work on the regalloc would be necessary (I know it will be rewritten). In fact, applying the above limitation to SSAPRE in this example would be helpful only because the register pressure would be reduced. It seems to me that all the transformations performed by SSAPRE were useful in principle, and that it is mostly a bad choice of register variables that makes the code so bad. However, I'm not such an assembly expert, and I'm not saying this to defend SSAPRE, so please if some of the transformations described above was harmful in itself just let me know with a message in mono-devel.
Moreover, there is another kind of problem that must be addressed: there are a lot of obvious redundancies that SSAPRE simply doesn't see!
For instance, look at all the accesses to the "bodies" array. There are just two indexes, "i" and "j", how is it that SSAPRE cannot factor them out?
The problem is that "bodies" is not a local variable. It is a field of the "this" argument, and (to put it simply) fields are not "in SSA form". By this I mean two things:
In practice, this means tracking where pointers point to, and how they are used. In the above example, it is obvious that "bodies" is never assigned, and there are no method calls that could change it, so the array is always the same. This, besides enhancing SSAPRE (and therefore putting even more pressure on the register allocator), would also allow ABCREM to operate on the array.
I have already implemented a simple alias analysis pass, and it works (I plan to blog about it shortly). However, it is reduced in scope, as it only tracks pointers to local variables. In the (near) future I should extend this to work on fields as well.
To sum it up, four kind of things should be done to enhance SSAPRE:
This is all... please send comments and suggestions to mono-devel!