Lately I've been mostly working on the "tree mover", which is
described in
this
post by Miguel.
As you can see, it is not an optimization by itself, it is a code transformation
necessary to help our current register allocator, and in fact the plan is not
to use it alone, but to help inlining. In practice, to get the full benefits
five more optimizations must be enabled starting from the current
default in the Mono JIT:
As you can see, all these passes are meant to work together in a "pipeline".
After a lot of hammering on them, I have seen concrete results on some real
and complex benchmark, like XMLMark.
Working on this benchmark, first of all I modified the ini files, to be able to
see exactly what the JIT was affecting; so I created five ini files, one for each
XMLMark functionality (each functionality gets weight 100, and all the others
stay at 0):
Then, I tested with a lot of option combinations:
| Label | Options |
|---|---|
| ALL | all |
| ALL-NO-SSA | all,-ssa,-ssapre,-abcrem |
| FAST | consprop,copyprop,treeprop,deadce,inline |
| NO-DEAD | consprop,copyprop,treeprop,-deadce,inline |
| NO-INLINE-DEAD | consprop,copyprop,treeprop,-deadce,-inline |
| NO-INLINE | consprop,copyprop,treeprop,deadce,-inline |
| NO-INLINE-TREE | consprop,copyprop,-treeprop,deadce,-inline |
| NO-PROP | -consprop,-copyprop,-treeprop,deadce,inline |
| NO-PROP-INLINE | -consprop,-copyprop,-treeprop,deadce,-inline |
| NO-TREE-DEAD | consprop,copyprop,-treeprop,-deadce,inline |
| NO-TREE | consprop,copyprop,-treeprop,deadce,inline |
| NOTHING | -consprop,-copyprop,-treeprop,-deadce,-inline |
Of these combinations, "NOTHING" is the current JIT default, "FAST" is the full set I'm testing (which ideally will become the default), "ALL-NO-SSA" means "everything the JIT can do without using SSA", and all the "NO-something" are the "FAST" set without something, to see if all the options are really necessary to get the best results.
These are the numbers I got:
| XMLMark | DOM-SEL | DOM-MOD | DOM-STR | DOM-SER | SAX-SEL |
|---|---|---|---|---|---|
| ALL | 53.40 | 50.20 | 50.95 | 54.60 | 15.40 |
| ALL-NO-SSA | 53.40 | 50.50 | 51.10 | 55.75 | 15.25 |
| FAST | 53.75 | 50.85 | 51.45 | 56.00 | 15.95 |
| NO-DEAD | 52.50 | 49.85 | 50.05 | 54.10 | 13.15 |
| NO-INLINE-DEAD | 47.85 | 45.45 | 45.10 | 49.20 | 13.10 |
| NO-INLINE | 50.15 | 47.05 | 46.25 | 51.20 | 15.45 |
| NO-INLINE-TREE | 50.85 | 47.55 | 47.85 | 52.15 | 15.35 |
| NO-PROP | 53.05 | 50.55 | 50.45 | 54.70 | 15.45 |
| NO-PROP-INLINE | 49.60 | 46.90 | 47.10 | 50.95 | 15.25 |
| NO-TREE-DEAD | 51.90 | 49.20 | 49.90 | 53.90 | 15.75 |
| NO-TREE | 52.70 | 49.90 | 50.25 | 54.90 | 15.75 |
| NO-TREE | 52.70 | 49.90 | 50.25 | 54.90 | 15.75 |
| NOTHING | 51.05 | 47.70 | 47.85 | 51.75 | 15.00 |
Here are the same data in graphical form:
To make some sense of it, here's a chart of the gains against the "NOTHING" options set, as percentages:
Some easy observations:
The (small) downside of these gains is that they cost about 25% more JIT time (this is the cumulative overhead of the "FAST" set). This is clearly visible in programs which exercise a lot of code but are very fast, like a mcs bootstrap: in this case the "FAST" set loses 1.33% (it takes 3.86 seconds instead of 3.81). However, uning a custom mcs driver that performs a second compilation (a sort of "hot run", because all the code is already jitted), and making sure that also the "cold run" had all the source files in the system buffer cache, shows that the "FAST" set gains 1.85% against the current default. This means that longer compilations should not suffer at all from the JIT overhead.
Finally, these results are on x86... on amd64 I noticed the need for further tuning in the linear regalloc spill costs: the gains are still there, but in some cases they are smaller. And also other benchmarks need some more tuning.
Anyway, this code should hit svn "real soon now", and be part of Mono 1.2 (so that nobody should worry about the overhead of wrapping fields with properties anymore!) .
Next in my todo list is getting rid of the SSAPRE issues we have...