Certain optimizations are always a win, so, if they don't take too much JIT time, they should be always turned on. Examples are constant folding, using intrinsic functions, reorganizing branches, or also dead code elimination (again, if the JIT time is small, which for now in Mono means without using SSA).
Other optimizations, however, are more problematic... sometimes they are good, other times they don't optimize at all: they make worse code. In this case, some tuning is needed: the JIT must decide where the optimization is useful, and where it is useless or even harmful.
A typical example of this in Mono is SSAPRE, which looks good on paper
(it eliminates redundancies), but actually makes a tradeoff between the
cost of space (to store precomputed values) and computations, and this
tradeoff is not always obvious.
So far I've seen the following places in the JIT where this kind of
tuning is important:
So we can (very roughly) consider each of the parameters described above
as a knob we can use to "tune" the JIT: changing the threshold used to
take a decision, the JIT behavior is altered.
And obviously there should be an optimal tuning point...
But things are not so simple.
First of all, it can happen that one knob influences another.
For instance, more inlining can increase the pressure on the regalloc,
which means that the behavior of the regalloc knob will react differently.
Then, not all CPU architectures can react to the tuning in the same way.
One of my latest posts to mono-devel-list is an example of both things:
I changed the way in which the spill cost takes into account the BB
nesting level from "(1 + (bb->nesting * 2))" to
"(1 << (bb->nesting << 1))", and got a 25% speedup in Scimark2.
But I cried victory too son: I also made it 2% worse on x86...
a less aggressive setting like "(1 * (bb->nesting << 1))" is
more a compromise (only 10% more on amd64, and not worse on x86), but I
still have to look at the generated code better and find the optimal
point (or maybe decide that there is not an architecture independent
optimal point).
Now, this can be reasonable, because on amd64 the JIT uses SSE instructions
for floating point operations, where the regalloc works "properly", while
on x86 the FPU is used in the traditional "stack based" way.
The point of all the above it that exploring this "tuning space" can be a
long (and sometimes tedious) process, often driven more by "sense of smell"
than anything else.
But "sense of smell" is not scientific, and the "boring" side of the
equation makes it hard to gather results in a rigorous way.
So I wrote a small script to automate the process, which can be found
here
together with some benchmark.
Mind you, it is rough, my perl-fu is poor, some paths are hard coded, real
men use sed to modify files in place instead of stupid perl
loops... but it gets the job done. It is meant to be used directly from the
"mini" directory in the JIT, and produces lots of nice result files that
can then be examined quickly.
And all the boring part, edit source file, save, build, execute
benchmarks n times, save results with a meaningful name, repeat,
do the same on a different machine... well, it's gone, freeing my time
for something better!
So, if you are interested in JIT tuning, and have a machine with some spare cycles to run benchmarks, please give a look to that script. It is fairly easy to describe your own set of benchmark runs in an array of perl hashes, and it will modify the JIT source, rebuild and run all the benchmarks for you in sequence.
And of you have interesting results, let me know!
In the next weeks I'll do the following:
Of course, if nothing more important shows up!