Jo Shields linux, plans

Up to now, Linux packages on mono-project.com have come in two flavours – RPM built for CentOS 7 (and RHEL 7), and .deb built for Debian 7. Universal packages that work on the named distributions, and anything newer.

Except that’s not entirely true.

Firstly, there have been “compatibility repositories” users need to add, to deal with ABI changes in libtiff, libjpeg, and Apache, since Debian 7. Then there’s the packages for ARM64 and PPC64el – neither of those architectures is available in Debian 7, so they’re published in the Debian 7 repo but actually built on Debian 8.

A large reason for this is difficulty in our package publishing pipeline – apt only allows one version-architecture mix in the repository at once, so I can’t have, say, 4.8.0.520-0xamarin1 built on AMD64 on both Debian 7 and Ubuntu 16.04.

We’ve been working hard on a new package build/publish pipeline, which can properly support multiple distributions, based on Jenkins Pipeline. This new packaging system also resolves longstanding issues such as “can’t really build anything except Mono” and “Architecture: All packages still get built on Jo’s laptop, with no public build logs”

Mono repo pipeline

So, here’s the old build matrix:

Distribution Architectures
Debian 7 ARM hard float, ARM soft float, ARM64 (actually Debian 8), AMD64, i386, PPC64el (actually Debian 8)
CentOS 7 AMD64

And here’s the new one:

Distribution Architectures
Debian 7 ARM hard float (v7), ARM soft float, AMD64, i386
Debian 8 ARM hard float (v7), ARM soft float, ARM64, AMD64, i386, PPC64el
Raspbian 8 ARM hard float (v6)
Ubuntu 12.04 (*) ARM hard float (v7), AMD64, i386
Ubuntu 14.04 ARM hard float (v7), ARM64, AMD64, i386, PPC64el
Ubuntu 16.04 ARM hard float (v7), ARM64, AMD64, i386, PPC64el
CentOS 6 AMD64, i386
CentOS 7 AMD64

The compatibility repositories will no longer be needed on recent Ubuntu or Debian – just use the right repository for your system. If your distribution isn’t listed… sorry, but we need to draw a line somewhere on support, and the distributions listed here are based on heavy analysis of our web server logs and bug requests.

You’ll want to change your package manager repositories to reflect your system more accurately, once Mono 5.0 is published (Update: Mono 5.0 is now live, please check out the new instructions on the download page).

We’re debating some kind of automated handling of this, but I’m loathe to touch users’ sources.list without their knowledge. CentOS builds are going to be late – I’ve been doing all my prototyping against the Debian builds, as I have better command of the tooling. Hopefully no worse than a week or two.

(*) this is mainly for Travis CI support which still uses Ubuntu 12.04 for now (despite it being EOL)



Miguel de Icaza runtime

While Mono has had support for SIMD instructions in the form of the Mono.SIMD API, this API was limited to run on x86 platforms.

.NET introduced the System.Numeric.Vectors API which sports a more general design that adapts to the SIMD registers available on different platforms.

The master branch of Mono now treats the various Vector operations as runtime intrinsics, so they are hardware accelerated. They are supported by both the Mono JIT compiler on x86-x64 platforms and via LLVM’s optimizating compiler on x86-64 and every other Mono/LLVM supported platform.

We would love to see you try it and share your experience with us.


Miguel de Icaza plans

Since the first release of the .NET open source code, the Mono project has been integrating the published Reference Source code into Mono.

We have been using the Reference Source code instead of the CoreFX, as Mono implements a larger surface area than what is exposed in there - Mono implements the .NET Desktop API.

Integrating the code sometimes is easy, we replace the code in Mono with the reference source code. When the code is not exactly portable, we need to make it portable and either write missing code, or integrate some of the work that we did in Mono, with the code that existed in .NET. There are some cases where porting the code is just too complicated, and we have not been able to do the work.

We keep track of the major items in Trello.

Originally, we forked the reference source code and kept a branch with our code, but it was too large of an external dependency and it was also a module that was quite static. So recently, we started copying the code that we needed directly into Mono.

While this has worked fine for a while, the .NET Reference Source is only updated when a major .NET release takes place, and it tracks the version of .NET that ships along with Windows. This means that we are missing on many of the great optimizations and improvements that are happening as part of .NET Core.

The optimizations and fine tuning typically take place in two places, work that goes into mscorlib.dll is maintained in the coreclr module while the higher level frameworks are maintained in the corefx module.

A New Approach

Many of the APIs that were originally removed from CoreFX are now being added back, so we can start to consider switching away from referencesource and into CoreFX.

After discussing with the .NET Core team, we came up with a better approach for the long-term maintenance of the shared code.

The .NET team has now setup a new repository where the cleaned up and optimized version of mscorlib.dll will live in the corert module.

What we will do is submodule both CoreRT and CoreFX and replace our manually copied code from the Reference Source with code from CoreRT and CoreFX.

The twist is that for scenarios where Mono’s API surface is larger, we will contribute changes back to CoreRT/CoreFX where we either add support for the larger API surface, or we make the API pluggable (likely with a tasteful use of the partial modifier).

One open issue is that Mono has historically used a single set of framework libraries (like mscorlib.dll, System.dll etc.) that work across Linux, MacOS, Unix and Windows and they dynamically detect how to behave based on the platform. This is useful on scenarios where you want to bootstrap work in one platform by using another one, as the framework libraries are identical.

CoreFX takes a different approach, the libraries are tied to a particular platform flavor.

This means that some of the work that we will have to do will involve either adjusting the CoreFX code to work in the way that Mono works, or give up on our tradition of having the same assemblies work across all platforms.


Rodrigo Kumpera and Bernhard Urban runtime

When someone says multi-core, we unconsciously think SMP. That worked out well for us until recently when ARM announced big.LITTLE. ARM’s big.LITTLE architecture is the first mass produced AMP architecture and as we’ll see next, it raises the bar for how hard multi-core programing is.

A tale of an impossible bug

It all started with a bug report against a phone with such a CPU, the Exynos chipset used on Samsung phones in Europe. Apps created with our software were dying with SIGILL at all completely random places. Nothing could reasonably explain what was happening, and the crash was happening with valid instructions. This immediately made us suspect bad instruction cache flushing.

After reviewing all JIT code around cache flushing we were sure that we were calling __clear_cache properly. That lead us to look around for how other virtual machines or compilers do cache flushing on ARM64, and we found out about some related errata on the Cortex A53. ARM’s description of those issues is both cryptic and vague, but we tried the workaround anyways. No luck there.

Next we went with the other usual suspects. A lying signal handler? Nope. Funky userspace CPU emulation? No. Broken libc implementation? Nice try. Faulty hardware? We reproduced it on multiple devices. Bad luck or karma? Yes!

Some of us could not sleep with such amazing puzzle in front of us and kept staring at memory dumps around failure sites. And there was this funny thing: the fault address was always on the third or fourth line of the memory dumps.

hexdump

This was our only clue, and there are no coincidences when it comes to this sort of byzantine bug. Our memory dumps were of 16 bytes per line and the SIGILL would always happen to be somewhere between 0x40-0x7f or 0xc0-0xff. We aligned the memory dump to help verify whether the code allocator was doing something funky:

$ grep SIGILL *.log
custom_01.log:E/mono           (13964): SIGILL at ip=0x0000007f4f15e8d0
custom_02.log:E/mono           (13088): SIGILL at ip=0x0000007f8ff76cc0
custom_03.log:E/mono           (12824): SIGILL at ip=0x0000007f68e93c70
custom_04.log:E/mono           (12876): SIGILL at ip=0x0000007f4b3d55f0
custom_05.log:E/mono           (13008): SIGILL at ip=0x0000007f8df1e8d0
custom_06.log:E/mono           (14093): SIGILL at ip=0x0000007f6c21edf0
[...]

With that we came to our first good hypothesis: Bad cache flushing was happening only on the upper 64 bytes of every 128-byte block. Those numbers, if you deal with low level programming, immediately remind you of cache line sizes. And that is where it all started to make sense.

Here is a pseudo version of how libgcc does cache flushing on arm64:

void __clear_cache (char *address, size_t size)
{
    static int cache_line_size = 0;
    if (!cache_line_size)
        cache_line_size = get_current_cpu_cache_line_size ();

    for (int i = 0; i < size; i += cache_line_size)
        flush_cache_line (address + i);
}

In the above pseudo-code get_current_cpu_cache_line_size is a CPU instruction that returns the line size of its caches, and flush_cache_line flushes the cache line that contains the supplied address.

At that point we were using our own version of this function, so we instrumented it to print the cache line size as returned by the CPU and, lo and behold, it printed both 128 and 64. We double verified that this was indeed the case. So we went to see that particular CPU manual and it turns out that the big core has a 128 bytes cache line but on the LITTLE core it is only 64 bytes for the instruction cache.

So what was happening is that __clear_cache would be called first on a big core and cache 128 as the instruction cache line size. Later it would be called on one of the LITTLE cores and would skip every other cache line when flushing. It doesn’t get simpler than that. We removed the caching and it all worked.

Summary

Some ARM big.LITTLE CPUs can have cores with different cache line sizes, and pretty much no code out there is ready to deal with it as they assume all cores to be symmetrical.

Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore. Therefore, we have to try to figure out a global minimum of the cache line sizes across all CPUs. Here is our fix for Mono: Pull Request. Other projects adopted our fix as well already: Dolphin and PPSSPP.