Jon Purdy performance

In Mono 4.4.0, we improved the performance of GC handles by changing to a lock-free implementation. In this post we’ll take a look at the original implementation and its limitations, and how the new implementation solved these problems.

Background

For those unfamiliar, the GCHandle API provides access to the low-level GC handle primitive used to implement several types of “handles” to managed objects:

  • Normal handles prevent an object from being collected, even if no managed references to the object exist.

  • Pinned handles prevent an object from being moved in memory.

  • Weak references reference an object without preventing it from being collected.

In addition to programmers’ regular use of handles, Mono uses them in its implementation of the Monitor class, the basis of synchronization using Monitor.Enter or the C♯ lock statement. As such, it’s important that accesses to GC handles be as fast as possible.

Original Implementation

A GCHandle object consists of a type and an index, packed together into a 32-bit unsigned integer. To get the value of a handle, say with WeakReference.Target, we first look up the array of handle data corresponding to its type, then look up the value at the given index; in pseudocode:

(type, index) = unpack(handle)
value = handles[type][index]

The original implementation of GC handles was based on a bitmap allocator. For each handle type, it stored a bitmap indicating the available slots for allocating new handles, and an array of pointers to their target objects:

bitmap   = 11010…
pointers = [0xPPPPPPPP, 0xPPPPPPPP, NULL, 0xPPPPPPPP, NULL, …]

There’s an interesting constraint, though: when we unload an AppDomain, we want to be able to free all of the weak references that point to objects in that domain, because we know they’ll never be accessed again.

But if the weak reference has expired, we can’t tell what domain it came from, because we no longer have an object to look at! So for weak references, we kept a parallel array of domain pointers:

domains  = [0xDDDDDDDD, 0xDDDDDDDD, NULL, 0xDDDDDDDD, NULL, …]

Unfortunately, this implementation was wasteful in a few ways:

  • To synchronize access to the handle allocator, we would lock a mutex on every access to a GC handle—GCHandle.Alloc, GCHandle.Target, and so on. This was especially expensive on OS X, where pthread_mutex_lock can be very costly.
  • For correctness, the domain-pointer array always had to be allocated for weak references, spending memory even though it was unused most of the time.

  • For historical reasons related to Mono’s support of the Boehm GC, much of this information was duplicated in a separate hash table, wasting even more memory.

New Implementation

After removing the redundant hash table, the first step toward a new lock-free implementation was to use one array to store the information from the three arrays of the previous implementation. We did so using tagged pointers: because objects are aligned to multiples of 8 bytes, the lower 3 bits of any object reference are guaranteed to be zero—so we can store extra information in those bits.

We ended up with a single array of slots in the following bit format:

PPPPPPPP…0VX

Where PPPP… are pointer bits, V is the “valid” flag, and X is the “occupied” flag, packed together with bitwise OR:

slot = pointer | valid_bit | occupied_bit

If the “occupied” flag is clear, the slot is free to be claimed by GCHandle.Alloc. To allocate a handle, we use a CAS (“compare and swap”, also known as Interlocked.CompareExchange) to replace a null slot with a tagged pointer, where the “occupied” and “valid” flags are set:

cas(slot, tag(pointer), NULL)

00000000…000
     ↓
PPPPPPPP…011

If the CAS succeeds, we now own a valid handle. If it fails, it means that another thread happened to be allocating a handle at the same time, so we just try the next free slot until we can successfully claim one. Unless you have many threads allocating many handles, allocating a handle will almost always succeed on the first try, without waiting to take a lock. And setting the target of a handle, with WeakReference.Target for example, works similarly.

As for AppDomain unloading, we can observe that we only need to store a domain pointer for expired weak references. If the reference hasn’t expired, then we have a valid object, and we can inspect it to find out which domain it came from.

Therefore, when a weak reference expires, all we have to do is clear the “valid” flag and replace the object pointer with a domain pointer:

PPPPPPPP…011
     ↓
DDDDDDDD…001

So these are the possible states of a slot:

Occupied? Valid? Description
1 1 This slot is occupied and points to an object.
1 0 This slot is occupied, but its object is expired, so it points to an AppDomain.
0 0 This slot is free and null.

Now that we have our representation of slots, how do we grow the handle array when we run out of slots? Because the original implementation locked the handle array, it was safe to simply allocate a new array, copy the old contents into it, and store the pointer to the new array. But without a lock, this wouldn’t be thread-safe! For example:

  • Thread 1 sees that the handle array needs to grow.
  • Thread 1 allocates a new handle array and copies its current contents.
  • Thread 2 changes the Target property of some weak reference.
  • Thread 1 stores the pointer to the new handle array, discarding Thread 2’s change.

To solve this, instead of a single handle array, we use a handle table consisting of an array of buckets, each twice the size of the last:

[0] → xxxxxxxx
[1] → xxxxxxxxxxxxxxxx
[2] → xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[3] → NULL
…

Now we can grow the table in a thread-safe way:

  • Thread 1 sees that the handle table needs to grow.
  • Thread 1 optimistically allocates a new empty bucket.
  • Thread 2 changes the Target property of some weak reference in an existing bucket.
  • Thread 1 uses a compare-and-swap to store the new bucket, leaving Thread 2’s change intact.

If the CAS fails, then another thread has already allocated a new bucket, so we can just free the extra one we allocated, because it won’t contain any data yet. And this will only happen if the handle table is highly contended, which is rare.

Performance Comparisons

sgen-weakref-stress, in the Mono runtime test suite, is a microbenchmark that allocates weak references from many threads.

Before this change was implemented, these were the average timings over 5 runs:

real    0m2.441s
user    0m1.591s
sys     0m0.959s

After the change:

real    0m0.358s
user    0m0.406s
sys     0m0.063s

Cool! We got about an 80% improvement.

Let’s look at monitor-stress, which stress-tests Monitor operations using the C♯ lock statement. Before the change, average of 5 runs:

real    0m2.714s
user    0m6.963s
sys     0m0.244s

Now, with the change:

real    0m2.681s
user    0m6.783s
sys     0m0.242s

It looks like these measurements are within error bounds, so we can’t claim any more than a modest improvement of 1–2%. The numbers are similar for our macrobenchmarks of roslyn and fsharp. On the bright side, we haven’t introduced any regressions.

Conclusions

Converting the GC handle code to a lock-free implementation let us delete a sizable chunk of old code, save memory, and dramatically improve the performance of weak references by avoiding expensive locking.

This optimization didn’t improve the performance of Monitor much; in the future, we’ll talk about how another optimization, thin locks, gave us much greater improvements to locking performance.


Miguel de Icaza releases

At Microsoft Build today, we announced that we are re-releasing Mono under the MIT license and have contributed it to the .NET Foundation. These are major news for Mono developers and contributors, and I am incredibly excited about the opportunities that this will create for the Mono project, and for other projects that will be able to benefit from this.

Mono Runtime Released under MIT License

While Mono’s class libraries have always been available under the MIT license, the Mono runtime was dual-licensed. Most developers could run their apps on Windows, Linux or Mac OS X on the LGPL version of the runtime, but we also offered Mono’s runtime under commercial terms for scenarios where the LGPL was not suitable.

Moving the Mono runtime to the MIT license removes barriers to the adoption of C# and .NET in a large number of scenarios, embedded applications, including embedding Mono as a scripting engine in game engines or other applications.

Open Sourcing Proprietary Mono Extensions

Over the past 5 years, Xamarin has developed a number of proprietary extensions to Mono, including:

  • ARM64 port of the Mono runtime
  • Workarounds for bugs in some ARM chips
  • Use of Apple’s CommonCrypto to implement the crypto classes in the .NET API
  • Integration with X509 certificates on Apple platforms
  • Support for “Native Types” on Apple platforms
  • Generic Value Type Sharing
  • Offset tool to maintain the cross compiler

These have been integrated with the main Mono codebase, contributed along with Mono to the .NET Foundation, and are being released under the MIT license today.


Miguel de Icaza releases
Mono 4.2 is out in the [alpha channel](/download/alpha).

Mono 4.2 is out in the stable channel

Check out our release notes for details about what is new on Mono 4.2.1.

This is the second Mono release that integrates large portions of code from Microsoft’s open sourced .NET code.

This release was made up of over 2,338 commits since Mono 4.0.0 and as usual, it is our best release so far. Enjoy!



Miguel de Icaza releases

Mono 4.0 is out.

Check out our release notes for details about what is new on Mono 4.0.

This is the first Mono release that contains code from Microsoft’s open sourced .NET code. We are only getting started with this work. We are swiftly moving ahead in mono/master much more code that is being replaced and ported.

This version also is the first one to ship with C# 6.0 enabled by default. Learn all about C# 6.0 in only eight minutes on this presentation