Massimiliano Mantione's Blog

RSS

../../massi.rss2

Email

massi@ximian.com

Docs:

I'm working on...

16 Nov 2005 (Permalink)

Fastpath way to get rid of the dead

I've not been blogging for ages... last time I did, I was starting to work on HSSA.

Now, HSSA is progressing well, it already works in my tree, and I have a functioning dead code elimination (deadce) pass implemented that is based on it. I was testing and benchmarking this "HSSA deadce", comparing it to the existing "SSA deadce" we already have in the JIT, when I realized that I also had a third deadce implementation, which didn't use SSA at all. I'll take Patrik Torstensson's suggestion, and call it "fastpath" deadce, because the JIT path that doesn't use SSA is the fast one.

If someone wants to laugh at the "famous last words", he can have a look at my previous blog post, where I mentioned exactly this deadce pass. Anyway this time, since I was testing deadce effectiveness, it seemed a nice thing to include it in the benchmarks...

The results have been at least interesting: fastpath deadce is actually fairly effective, and it also doesn't have the JIT time overhead of the SSA version.

This table lists how long it takes to the JIT to compile "mscorlib.dll". The times are measured with "time mono -O=options --compile-all mscorlib.dll".

Time to JIT compile "mscorlib.dll", in seconds
Options-all-all,ssa (default)ssadeadcessa,deadce
JIT time1.051.35 1.321.551.451.6

From the first two columns we see that just computing SSA has about a 30% overhead with respect to a compilation with no optimizations at all.

Then we see that the overhead of fastpath deadce on a "default" compilation is almost 10%, while ssa deadce has 21% (and note that in practice, since ssa does not work with aliasing, the new fastpath deadce is applied to more methods).

So, the JIT overhead is reasonably small.

And finally, some lovely benchmark... I tried SciMark, and here are the results:

SciMark results (MFlops)
Options (default) deadce ssa
deadce
consprop
copyprop
deadce
consprop
copyprop
deadce
inline
inline deadce
inline
Composite Score 205.92 207.50 210.61 207.72 209.82 213.63 214.52
FFT 111.42 110.91 124.74 101.83 113.39 123.55 123.41
SOR 394.40 400.97 400.03 405.99 402.71 402.78 405.82
Monte Carlo 4.91 5.16 4.23 5.13 4.85 4.69 5.16
Sparse MatMult 167.77 160.82 172.33 169.99 173.70 173.47 172.30
LU 351.12 359.62 351.71 355.69 354.47 363.67 365.89

Well, we can see that deadce is doing its job! There are no miracles, but it's not bad anyway.

Another consideration: SSA deadce is still more effective than the fastpath one, likely because SSA allows for more precision in understanding the effect that statements have (but I still have to verify this, it is also possible that I missed something in the fastpath implementation). Anyway, the point of this benchmark run is that fastpath deadce has far less JIT overhead, so that we could consider it for inclusion in the default optimization set, something for which SSA does not qualify (at least for now). Of course SSA still has its use: more advanced things (ABCREM and redundant check removal in general, PRE, and in the future register promotion and escape analysis) are better done with [H]SSA. But we can see that a simple deadce pass can be effectively done without it.

And by the way, I spent a lot of time trying to be sure that those numbers are accurate. I had to kill evolution, gaim, beagle, the wireless card, and carefully monitor system load otherwise the results were erratic (just of a few percent, but this is what we're looking at). I then executed the benchmark several (>10) times for every case and took the best score for each (with the idea that it is the one where other things interfered less), and anyway all the best scores were almost identical. And of course the patch passes all standard tests (a full Mono rebuild with the optimization turned on by default, and all regression tests).

One final strange result... look at this table:

SciMark results (MFlops)
[exe compiled with a MS .NET 2.0 csc]
Options (default) consprop
copyprop
deadce
Composite Score 136.65 137.52
FFT 52.60 52.29
SOR 404.41 405.48
Monte Carlo 4.61 4.50
Sparse MatMult 140.03 143.80
LU 81.61 81.52

Well, deadce still has some effect, but... the overall performance is just awful!

Looking better, we see that FFT and LU (and to some extent Sparse MatMult) are really badly affected, and bring the score down. I already have some small test program that isolates them (csc 2.0 was also able to reveal a very subtle bug in HSSA, on SciMark LU), so I'll investigate further. The sources and binaries of the benchmark are here ("SciMark.exe" is the original executable), and anyway I got the original package here.

There is still work to do... I'd have to check the exact effect this has on inlining looking at the code produced compiling small specific benchmarks, and I'd also have to see if we can lift the restriction imposed by catch/finally clauses (the last patches by Patrik that fix the CFG should help here). This would be nice, because it would mean that this optimization would apply to all methods! For instance, Ben warned me that every "foreach" statement introduces a "finally" clause, which disables deadce and everything that needs SSA in the JIT.

Finally, if anybody wants to look at the source, it has been posted here.

All entries
This is a personal web page. Things said here do not represent the position of my employer.