Alex Rønne Petersen profiler

This is the first in a series of posts I’ll be writing on the work we’ve been doing to improve the stability of Mono’s log profiler. All improvements detailed in these blog posts are included in Mono 4.6.0, featuring version 1.0 of the profiler. Refer to the release notes for the full list of changes and fixes.

The problem we’ll be looking at today is a crash that arose when running the profiler in sampling mode (i.e. mono --profile=log:sample foo.exe) together with the SGen garbage collector.

The Problem

SGen uses so-called managed allocators to perform very fast allocations from the nursery (generation 0). These managed allocators are generated by the Mono runtime and contain specialized code for allocating particular kinds of types (small objects, strings, arrays, etc). One important invariant that managed allocators rely on is that a garbage collection absolutely cannot proceed if any suspended thread was executing a managed allocator at the time it was suspended. This allows the managed allocators to do their thing without taking the global GC lock, which is a huge performance win for multithreaded programs. Should this invariant be broken, however, the state of the managed heap is essentially undefined and all sorts of bad things will happen when the GC proceeds to doing scanning and sweeping.

Unfortunately, the way that sampling works caused this invariant to be broken. When the profiler is running in sampling mode, it periodically sends out a signal (e.g. SIGPROF) whose signal handler will collect a managed stack trace for the target thread and write it as an event to the log file. There’s nothing actually wrong with this. However, the way that SGen checks whether a thread is currently executing a managed allocator is as follows (simplified a bit from the actual source code):

static gboolean
is_ip_in_managed_allocator (MonoDomain *domain, gpointer ip)
{
    /*
     * ip is the instruction pointer of the thread, as obtained by the STW
     * machinery when it temporarily suspends the thread.
     */

    MonoJitInfo *ji = mono_jit_info_table_find_internal (domain, ip, FALSE, FALSE);

    if (!ji)
        return FALSE;

    MonoMethod *m = mono_jit_info_get_method (ji);

    return sgen_is_managed_allocator (m);
}

To understand why this code is problematic, we must first take a look at how signal delivery works on POSIX systems. By default, a signal can be delivered while another signal is still being handled. For example, if your program has two separate signal handlers for SIGUSR1 and SIGUSR2, both of which simply do while (1); to spin forever, and you send SIGUSR1 followed by SIGUSR2 to your program, you’ll see a stack trace looking something like this:

(gdb) bt
#0  0x0000000000400564 in sigusr2_signal_handler ()
#1  <signal handler called>
#2  0x0000000000400553 in sigusr1_signal_handler ()
#3  <signal handler called>
#4  0x00000000004005d3 in main ()

Unsurprisingly, if we print the instruction pointer, we’ll find that it’s pointing into the most recently called signal handler:

(gdb) p $pc
$1 = (void (*)()) 0x400564 <sigusr2_signal_handler+15>

This matters because SGen uses a signal of its own to suspend threads in its STW machinery. So that means we have two signals in play: The sampling signal used by the profiler, and the suspend signal used by SGen. Both signals can arrive at any point, including while the other is being handled. In an allocation-heavy program, we could very easily see a stack looking something like this:

#0 suspend_signal_handler ()
#1 <signal handler called>
#2 profiler_signal_handler ()
#3 <signal handler called>
#4 AllocSmall ()
...

(Here, AllocSmall is the managed allocator.)

Under normal (non-profiling) circumstances, it would look like this:

#0 suspend_signal_handler ()
#1 <signal handler called>
#2 AllocSmall ()
...

Mono installs signal handlers using the sigaction function with the SA_SIGINFO flag. This means that the signal handler will receive a bunch of information in its second and third arguments. One piece of that information is the instruction pointer of the thread before the signal handler was invoked. This is the instruction pointer that is passed to the is_ip_in_managed_allocator function. So when we’re profiling, we pass an instruction pointer that points into the middle of profiler_signal_handler, while in the normal case, we pass an instruction pointer that points into AllocSmall as expected.

So now the problem is clear: In the normal case, SGen detects that the thread is executing a managed allocator, and therefore waits for it to finish executing before truly suspending the thread. But in the profiling case, SGen thinks that the thread is just executing any arbitrary, unimportant code that it can go right ahead and suspend, even though we know for a fact (from inspecting the stack in GDB) that it is actually in the middle of executing a managed allocator.

When this situation arises, SGen will go right ahead and perform a full garbage collection, even though it’ll see an inconsistent managed heap as a result of the managed allocator being in the middle of modifying the heap. Entire books could be written about the incorrect behavior that could result from this under different circumstances, but long story short, your program would crash in random ways.

Potential Solutions

At first glance, this didn’t seem like a hard problem to solve. My initial thought was to set up the signal handler for SIGPROF in such a way that the GC suspend signal would be blocked while the handler is executing. Thing is, this would’ve actually fixed the problem, but not in a general way. After all, who’s to say that some other signal couldn’t come along and trigger this problem? For example, a programmer might P/Invoke some native code that sets up a signal handler for some arbitrary purpose. The user would then want to send that signal to the Mono process and not have things break left and right. We can’t reasonably ask every Mono user to block the GC suspend signal in any signal handlers they might set up, directly or indirectly. Even if we could, it isn’t a good idea, as we really want the GC suspend signal to arrive in a timely fashion so that the STW process doesn’t take too long (resulting in long pause times).

OK, so that idea wasn’t really gonna work. Another idea that we considered was to unwind the stack of the suspended thread to see if any stack frame matches a managed allocator. This would solve the problem and would work for any kind of signal that might arrive before a GC suspend signal. Unfortunately, unwinding the stack of a suspended thread is not particularly easy, and it can’t be done in a portable way at all. Also, stack unwinding is not exactly a cheap affair, and the last thing we want is to make the STW machinery slower - nobody wants longer pause times.

Finally, there was the nuclear option: Disable managed allocators completely when running the profiler in sampling mode. Sure, it would make this problem go away, but as with the first idea, SGen would still be susceptible to this problem with other kinds of signals when running normally. More importantly, this would result in misleading profiles since the program would spend more time in the native SGen allocation functions than it would have in the managed allocators when not profiling.

None of the above are really viable solutions.

The Fix

Conveniently, SGen has this feature called critical regions. They’re not quite what you might think - they have nothing to do with mutexes. Let’s take a look at how SGen allocates a System.String in the native allocation functions:

MonoString *
mono_gc_alloc_string (MonoVTable *vtable, size_t size, gint32 len)
{
    TLAB_ACCESS_INIT;

    ENTER_CRITICAL_REGION;

    MonoString *str = sgen_try_alloc_obj_nolock (vtable, size);

    if (str)
        str->length = len;

    EXIT_CRITICAL_REGION;

    if (!str) {
        LOCK_GC;

        str = sgen_alloc_obj_nolock (vtable, size);

        if (str)
            str->length = len;

        UNLOCK_GC;
    }

    return str;
}

The actual allocation logic (in sgen_try_alloc_obj_nolock and sgen_alloc_obj_nolock) is unimportant. The important bits are TLAB_ACCESS_INIT, ENTER_CRITICAL_REGION, and EXIT_CRITICAL_REGION. These macros are defined as follows:

#define TLAB_ACCESS_INIT SgenThreadInfo *__thread_info__ = mono_native_tls_get_value (thread_info_key)
#define IN_CRITICAL_REGION (__thread_info__->client_info.in_critical_region)
#define ENTER_CRITICAL_REGION do { mono_atomic_store_acquire (&IN_CRITICAL_REGION, 1); } while (0)
#define EXIT_CRITICAL_REGION do { mono_atomic_store_release (&IN_CRITICAL_REGION, 0); } while (0)

As you can see, they simply set a variable on the current thread for the duration of the attempted allocation. If this variable is set, the STW machinery will refrain from suspending the thread in much the same way as it would when checking the thread’s instruction pointer against the code ranges of the managed allocators.

So the fix to this whole problem is actually very simple: We just set up a critical region in the managed allocator, just like we do in the native SGen functions. That is, we wrap all the code we emit in the managed allocator like so:

static MonoMethod *
create_allocator (int atype, ManagedAllocatorVariant variant)
{
    // ... snip ...

    MonoMethodBuilder *mb = mono_mb_new (mono_defaults.object_class, name, MONO_WRAPPER_ALLOC);
    int thread_var;

    // ... snip ...

    EMIT_TLS_ACCESS_VAR (mb, thread_var);

    EMIT_TLS_ACCESS_IN_CRITICAL_REGION_ADDR (mb, thread_var);
    mono_mb_emit_byte (mb, CEE_LDC_I4_1);
    mono_mb_emit_byte (mb, MONO_CUSTOM_PREFIX);
    mono_mb_emit_byte (mb, CEE_MONO_ATOMIC_STORE_I4);
    mono_mb_emit_i4 (mb, MONO_MEMORY_BARRIER_NONE);

    // ... snip: allocation logic ...

    EMIT_TLS_ACCESS_IN_CRITICAL_REGION_ADDR (mb, thread_var);
    mono_mb_emit_byte (mb, CEE_LDC_I4_0);
    mono_mb_emit_byte (mb, MONO_CUSTOM_PREFIX);
    mono_mb_emit_byte (mb, CEE_MONO_ATOMIC_STORE_I4);
    mono_mb_emit_i4 (mb, MONO_MEMORY_BARRIER_REL);

    // ... snip ...

    MonoMethod *m = mono_mb_create (mb, csig, 8, info);
    mono_mb_free (mb);

    return m;
}

With that done, SGen will correctly detect that a managed allocator is executing no matter how many signal handlers may be nested on the thread. Checking the in_critical_region variable also happens to be quite a bit cheaper than looking up JIT info for the managed allocators.

Performance Implications

I ran this small program before and after the changes:

using System;

class Program {
    public static object o;

    static void Main ()

        for (var i = 0; i < 100000000; i++)
            o = new object ();
    }
}

Before:

real    0m0.625s
user    0m0.652s
sys     0m0.032s

After:

real    0m0.883s
user    0m0.948s
sys     0m0.012s

So we’re observing about a 40% slowdown from this change on a microbenchmark that does nothing but allocate. The slowdown comes from the managed allocators now having to do two atomic stores, one of which also carries with it a memory barrier with release semantics.

So on one hand, the slowdown is not great. But on the other hand, there is no point in being fast if Mono is crashy as a result. To put things in perspective, this program is still way slower without managed allocators (MONO_GC_DEBUG=no-managed-allocator):

real    0m7.678s
user    0m8.529s
sys     0m0.024s

Managed allocators are still around 770% faster. The 40% slowdown doesn’t seem like much when you consider these numbers, especially when it’s for the purpose of fixing a crash.

More detailed benchmark results from the Xamarin benchmarking infrastructure can be found here.

Conclusion

By setting up a critical region in SGen’s managed allocator methods, we’ve fixed a number of random crashes that would occur when using sample profiling together with the SGen GC. A roughly 40% slowdown was observed on one microbenchmark, however, managed allocators are still around 770% faster than SGen’s native allocation functions, so it’s a small price to pay for a more reliable Mono.


In the next post, we’ll take a look at an issue that plagued profiler users on OS X: A crash when sending sampling signals to threads that hadn’t yet finished basic initialization in libc, leading to broken TLS access. This issue forced us to rewrite the sampling infrastructure in the runtime.


Jon Purdy performance

In Mono 4.4.0, we improved the performance of GC handles by changing to a lock-free implementation. In this post we’ll take a look at the original implementation and its limitations, and how the new implementation solved these problems.

Background

For those unfamiliar, the GCHandle API provides access to the low-level GC handle primitive used to implement several types of “handles” to managed objects:

  • Normal handles prevent an object from being collected, even if no managed references to the object exist.

  • Pinned handles prevent an object from being moved in memory.

  • Weak references reference an object without preventing it from being collected.

In addition to programmers’ regular use of handles, Mono uses them in its implementation of the Monitor class, the basis of synchronization using Monitor.Enter or the C♯ lock statement. As such, it’s important that accesses to GC handles be as fast as possible.

Original Implementation

A GCHandle object consists of a type and an index, packed together into a 32-bit unsigned integer. To get the value of a handle, say with WeakReference.Target, we first look up the array of handle data corresponding to its type, then look up the value at the given index; in pseudocode:

(type, index) = unpack(handle)
value = handles[type][index]

The original implementation of GC handles was based on a bitmap allocator. For each handle type, it stored a bitmap indicating the available slots for allocating new handles, and an array of pointers to their target objects:

bitmap   = 11010…
pointers = [0xPPPPPPPP, 0xPPPPPPPP, NULL, 0xPPPPPPPP, NULL, …]

There’s an interesting constraint, though: when we unload an AppDomain, we want to be able to free all of the weak references that point to objects in that domain, because we know they’ll never be accessed again.

But if the weak reference has expired, we can’t tell what domain it came from, because we no longer have an object to look at! So for weak references, we kept a parallel array of domain pointers:

domains  = [0xDDDDDDDD, 0xDDDDDDDD, NULL, 0xDDDDDDDD, NULL, …]

Unfortunately, this implementation was wasteful in a few ways:

  • To synchronize access to the handle allocator, we would lock a mutex on every access to a GC handle—GCHandle.Alloc, GCHandle.Target, and so on. This was especially expensive on OS X, where pthread_mutex_lock can be very costly.
  • For correctness, the domain-pointer array always had to be allocated for weak references, spending memory even though it was unused most of the time.

  • For historical reasons related to Mono’s support of the Boehm GC, much of this information was duplicated in a separate hash table, wasting even more memory.

New Implementation

After removing the redundant hash table, the first step toward a new lock-free implementation was to use one array to store the information from the three arrays of the previous implementation. We did so using tagged pointers: because objects are aligned to multiples of 8 bytes, the lower 3 bits of any object reference are guaranteed to be zero—so we can store extra information in those bits.

We ended up with a single array of slots in the following bit format:

PPPPPPPP…0VX

Where PPPP… are pointer bits, V is the “valid” flag, and X is the “occupied” flag, packed together with bitwise OR:

slot = pointer | valid_bit | occupied_bit

If the “occupied” flag is clear, the slot is free to be claimed by GCHandle.Alloc. To allocate a handle, we use a CAS (“compare and swap”, also known as Interlocked.CompareExchange) to replace a null slot with a tagged pointer, where the “occupied” and “valid” flags are set:

cas(slot, tag(pointer), NULL)

00000000…000
     ↓
PPPPPPPP…011

If the CAS succeeds, we now own a valid handle. If it fails, it means that another thread happened to be allocating a handle at the same time, so we just try the next free slot until we can successfully claim one. Unless you have many threads allocating many handles, allocating a handle will almost always succeed on the first try, without waiting to take a lock. And setting the target of a handle, with WeakReference.Target for example, works similarly.

As for AppDomain unloading, we can observe that we only need to store a domain pointer for expired weak references. If the reference hasn’t expired, then we have a valid object, and we can inspect it to find out which domain it came from.

Therefore, when a weak reference expires, all we have to do is clear the “valid” flag and replace the object pointer with a domain pointer:

PPPPPPPP…011
     ↓
DDDDDDDD…001

So these are the possible states of a slot:

Occupied? Valid? Description
1 1 This slot is occupied and points to an object.
1 0 This slot is occupied, but its object is expired, so it points to an AppDomain.
0 0 This slot is free and null.

Now that we have our representation of slots, how do we grow the handle array when we run out of slots? Because the original implementation locked the handle array, it was safe to simply allocate a new array, copy the old contents into it, and store the pointer to the new array. But without a lock, this wouldn’t be thread-safe! For example:

  • Thread 1 sees that the handle array needs to grow.
  • Thread 1 allocates a new handle array and copies its current contents.
  • Thread 2 changes the Target property of some weak reference.
  • Thread 1 stores the pointer to the new handle array, discarding Thread 2’s change.

To solve this, instead of a single handle array, we use a handle table consisting of an array of buckets, each twice the size of the last:

[0] → xxxxxxxx
[1] → xxxxxxxxxxxxxxxx
[2] → xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[3] → NULL
…

Now we can grow the table in a thread-safe way:

  • Thread 1 sees that the handle table needs to grow.
  • Thread 1 optimistically allocates a new empty bucket.
  • Thread 2 changes the Target property of some weak reference in an existing bucket.
  • Thread 1 uses a compare-and-swap to store the new bucket, leaving Thread 2’s change intact.

If the CAS fails, then another thread has already allocated a new bucket, so we can just free the extra one we allocated, because it won’t contain any data yet. And this will only happen if the handle table is highly contended, which is rare.

Performance Comparisons

sgen-weakref-stress, in the Mono runtime test suite, is a microbenchmark that allocates weak references from many threads.

Before this change was implemented, these were the average timings over 5 runs:

real    0m2.441s
user    0m1.591s
sys     0m0.959s

After the change:

real    0m0.358s
user    0m0.406s
sys     0m0.063s

Cool! We got about an 80% improvement.

Let’s look at monitor-stress, which stress-tests Monitor operations using the C♯ lock statement. Before the change, average of 5 runs:

real    0m2.714s
user    0m6.963s
sys     0m0.244s

Now, with the change:

real    0m2.681s
user    0m6.783s
sys     0m0.242s

It looks like these measurements are within error bounds, so we can’t claim any more than a modest improvement of 1–2%. The numbers are similar for our macrobenchmarks of roslyn and fsharp. On the bright side, we haven’t introduced any regressions.

Conclusions

Converting the GC handle code to a lock-free implementation let us delete a sizable chunk of old code, save memory, and dramatically improve the performance of weak references by avoiding expensive locking.

This optimization didn’t improve the performance of Monitor much; in the future, we’ll talk about how another optimization, thin locks, gave us much greater improvements to locking performance.


Miguel de Icaza releases

At Microsoft Build today, we announced that we are re-releasing Mono under the MIT license and have contributed it to the .NET Foundation. These are major news for Mono developers and contributors, and I am incredibly excited about the opportunities that this will create for the Mono project, and for other projects that will be able to benefit from this.

Mono Runtime Released under MIT License

While Mono’s class libraries have always been available under the MIT license, the Mono runtime was dual-licensed. Most developers could run their apps on Windows, Linux or Mac OS X on the LGPL version of the runtime, but we also offered Mono’s runtime under commercial terms for scenarios where the LGPL was not suitable.

Moving the Mono runtime to the MIT license removes barriers to the adoption of C# and .NET in a large number of scenarios, embedded applications, including embedding Mono as a scripting engine in game engines or other applications.

Open Sourcing Proprietary Mono Extensions

Over the past 5 years, Xamarin has developed a number of proprietary extensions to Mono, including:

  • ARM64 port of the Mono runtime
  • Workarounds for bugs in some ARM chips
  • Use of Apple’s CommonCrypto to implement the crypto classes in the .NET API
  • Integration with X509 certificates on Apple platforms
  • Support for “Native Types” on Apple platforms
  • Generic Value Type Sharing
  • Offset tool to maintain the cross compiler

These have been integrated with the main Mono codebase, contributed along with Mono to the .NET Foundation, and are being released under the MIT license today.


Miguel de Icaza releases
Mono 4.2 is out in the [preview channel](/download/preview/).

Mono 4.2 is out in the stable channel

Check out our release notes for details about what is new on Mono 4.2.1.

This is the second Mono release that integrates large portions of code from Microsoft’s open sourced .NET code.

This release was made up of over 2,338 commits since Mono 4.0.0 and as usual, it is our best release so far. Enjoy!