Massimiliano Mantione's Blog

RSS

../../massi.rss2

Email

massi@ximian.com

Docs:

I'm working on...

03 Jan 2008 (Permalink)

Comparing profiling APIs

Implementation troubles

Lately I've had some troubles implementing heap-shot-like functionaliy in the new profiler. I wanted to have it working before committing the code, but as you'll see this will take some time, so I will commit it earlier...

The reasons for the port are, essentially, performance (heap-shot uses the Mono runtime API to walk every field of every object, and takes-releases a lock at every allocation), and having a 64-bit port of heap-shot.

Integrating it in the new profiler would allow me to reuse all the event logging and file writing machinery that is already in place there, which is fairly efficient (or at least is supposed to be so!). However, all this machinery had some problem when applied to heap profiling. The logging profiler maintains a global hash table of mappings from MonoClass* values to numeric unique IDs. The table is used so that the external file reader does not depend on raw pointer values, and also because writing IDs take less space than writing large values. This table, however, is not kept up to date all the time (which would require taking a lock at every profiler event). Instead, each time it is necessary to flush the events to disk, the event buffers are scanned and the table is completed (and flushed!) just before writing the event block.

This is perfect for the logging profiler, but causes lots of bad interactions when working inside the GC profiling event handlers. In this context, all the threads registered with the runtime are stopped. Therefore, if the GC hook tries to update the table, it has to take the profile lock, but it cannot because one of the stopped threads could have taken it (and, being stopped, it's not going to release it until the collection stops...).

In the last week, trying to fix all the profiler bugs before committing, I've been dealing with various kinds of deadlocks, or troubles because the table was not complete. I managed to implement the MonoObject* buffers between the allocation and GC hooks in a completely lock-free way, and split the snapshot and file writing jobs so that the snapshot (which must happen inside the GC hook) does not strictly need the mapping table.
However, I overlooked a basic fact: I implemented the object scanning in the heap snapshot using per-class bitmaps that describe the places where references are in the object (for efficiency), and I decided to store those bitmaps in the class-ID mapping table (because it seemed reasonable: it is a table where the MonoClass* is the key, it seemed stupid keeping two such tables inside the profiler!).

Of course, this conflicts with the assumption that the GC hooks cannot use that table... of course there are workarounds for this, but after putting so may workarounds in place, I started thinking that maybe the profiler could get some more help from the runtime. After all, it is this separation of the profiler from the runtime that is causing the issue, so maybe extending the profiling API could give us a more elegant solution.

In this context, it makes sense to see what the other two major VMs out there (the Sun JVM and the Microsoft .NET CLR) are doing. So (again, on Paolo's suggestion) I examined their profiling APIs.

The Microsoft CLR profiling API

Online resources for MS .NET profiling API are here and here. But the real reference is the "Profiling.doc" file distributed with the .NET SDK.

Comparing this profiling API to the Mono one, the first thing we we see is that they are very similar. In the bulk of the load-unload (and method enter-exit) events, there are just minor differences:

There is, however, a feature which would be really handy for me in implementing a proper heap (and CG) profiler: they have "runtime suspension" callbacks. In practice, a profiler is notified of the moment when the GC will suspend all the threads which are registered with the runtime: it is guaranteed that none of those threads will execute managed code after that callback. Of course, also the "reverse" callback is present (for when the suspend action is done).
This would be really useful for my heap-shot implementation: the suspend hook would make sure that no thread still has the profiler lock before the GC starts, and when the suspend is done that the mapping table is ready. This way the GC hook could safely use the table to access the bitmaps that describe where each object has reference values.

But the major difference related to heap profiling is the fact that the .NET CLR actually already implements a heap-shot equivalent. And, thinking about it, this really makes sense: in my implementation I am in fact duplicating some work that the GC already does, like:

And it's not over: they also report all the roots they find on the stacks (and registers), which our heap shot will have a very hard time doing, while the GC already does it anyway.

So, we should decide what to do:

The Java profiling API

The online reference for the JVM profiling API is here. However, there has been a transition to the "tooling" API, JVMTI, so I will only consider this last one.

Comparing the JVMTI profiling API to the Mono one, we see many more differences, because the whole approach is different.

First of all, there are no fast callbacks for method enter/exit and things like that. Instead, the profiler can instrument the application bytecode directly, and the profiler hooks can be implemented in managed code, to avoid the transition costs and eventually inline the callbacks if they are small enough.

Then, there is the (complex) concept of "JVMTI context", how it must be used, and especially how references must be managed. And the JVMTI is meant to implement several kind of tools, including debuggers, so the API is naturally more comprehensive than the Mono profiler API.

For the purpose of this comparison, it suffice to say that the heap analysis functions are really complete, and also look relatively heavy, with many callbacks and the possibility to arbitrarily tag individual objects and classes and use the tags as filters for the callbacks.

The JVMTI reference explicitly states that they presumed that a "batch" approach (working with arrays of references) would have been more efficient (in the sense of having a better throughput), but tests proved that it's not the case, so they resorted to the more flexible callback approach.

In the end...

We should decide what to do in our profiler :-)

All entries
This is a personal web page. Things said here do not represent the position of my employer.