Massimiliano Mantione's Blog

RSS

../../massi.rss2

Email

massi@ximian.com

Docs:

I'm working on...

13 Mar 2006 (Permalink)

So you want to tune the JIT... but where are the knobs?

Certain optimizations are always a win, so, if they don't take too much JIT time, they should be always turned on. Examples are constant folding, using intrinsic functions, reorganizing branches, or also dead code elimination (again, if the JIT time is small, which for now in Mono means without using SSA).

Other optimizations, however, are more problematic... sometimes they are good, other times they don't optimize at all: they make worse code. In this case, some tuning is needed: the JIT must decide where the optimization is useful, and where it is useless or even harmful.

A typical example of this in Mono is SSAPRE, which looks good on paper (it eliminates redundancies), but actually makes a tradeoff between the cost of space (to store precomputed values) and computations, and this tradeoff is not always obvious.
So far I've seen the following places in the JIT where this kind of tuning is important:

So we can (very roughly) consider each of the parameters described above as a knob we can use to "tune" the JIT: changing the threshold used to take a decision, the JIT behavior is altered.
And obviously there should be an optimal tuning point...

But things are not so simple.
First of all, it can happen that one knob influences another. For instance, more inlining can increase the pressure on the regalloc, which means that the behavior of the regalloc knob will react differently. Then, not all CPU architectures can react to the tuning in the same way. One of my latest posts to mono-devel-list is an example of both things: I changed the way in which the spill cost takes into account the BB nesting level from "(1 + (bb->nesting * 2))" to "(1 << (bb->nesting << 1))", and got a 25% speedup in Scimark2. But I cried victory too son: I also made it 2% worse on x86... a less aggressive setting like "(1 * (bb->nesting << 1))" is more a compromise (only 10% more on amd64, and not worse on x86), but I still have to look at the generated code better and find the optimal point (or maybe decide that there is not an architecture independent optimal point).
Now, this can be reasonable, because on amd64 the JIT uses SSE instructions for floating point operations, where the regalloc works "properly", while on x86 the FPU is used in the traditional "stack based" way.

The point of all the above it that exploring this "tuning space" can be a long (and sometimes tedious) process, often driven more by "sense of smell" than anything else.
But "sense of smell" is not scientific, and the "boring" side of the equation makes it hard to gather results in a rigorous way. So I wrote a small script to automate the process, which can be found here together with some benchmark.

Mind you, it is rough, my perl-fu is poor, some paths are hard coded, real men use sed to modify files in place instead of stupid perl loops... but it gets the job done. It is meant to be used directly from the "mini" directory in the JIT, and produces lots of nice result files that can then be examined quickly.
And all the boring part, edit source file, save, build, execute benchmarks n times, save results with a meaningful name, repeat, do the same on a different machine... well, it's gone, freeing my time for something better!

So, if you are interested in JIT tuning, and have a machine with some spare cycles to run benchmarks, please give a look to that script. It is fairly easy to describe your own set of benchmark runs in an array of perl hashes, and it will modify the JIT source, rebuild and run all the benchmarks for you in sequence.

And of you have interesting results, let me know!

In the next weeks I'll do the following:

Of course, if nothing more important shows up!

All entries
This is a personal web page. Things said here do not represent the position of my employer.