Ahead of Time Compilation (AOT)
Mono Ahead Of Time Compiler
The Ahead of Time compilation feature in Mono allows Mono to precompile assemblies to minimize JIT time, reduce memory usage at runtime and increase the code sharing across multiple running Mono application.
To precompile an assembly use the following command:
mono --aot -O=all assembly.exe
The `–aot’ flag instructs Mono to ahead-of-time compile your assembly, while the -O=all flag instructs Mono to use all the available optimizations.
Besides code, the AOT file also contains cached metadata information which allows the runtime to avoid certain computations at runtime, like the computation of generic vtables. This reduces both startup time, and memory usage. It is possible to create an AOT image which contains only this cached information and no code by using the ‘metadata-only’ option during compilation:
mono --aot=metadata-only assembly.exe
This works even on platforms where AOT is not normally supported.
The code generated by Ahead-of-Time compiled images is position-independent code. This allows the same precompiled image to be reused across multiple applications without having different copies: this is the same way in which ELF shared libraries work: the code produced can be relocated to any address.
The implementation of Position Independent Code has a performance impact on Ahead-of-Time compiled images but compiler bootstraps are still faster than JIT-compiled images, specially with all the new optimizations provided by the Mono engine.
The AOT File Format
We use the native object format of the platform. That way it is possible to reuse existing tools like as/ld and the dynamic loader. On ELF platforms, the AOT compiler can generate an ELF .so file directly, on other platforms, it generates an assembly (.s) file which is then assembled and linked by as/ld into a shared library.
The precompiled image is stored in a file next to the original assembly that is precompiled with the native extension for a shared library (on Linux its “.so” to the generated file).
For example: basic.exe -> basic.exe.so; corlib.dll -> corlib.dll.so
There is one global symbol in each AOT image named ‘mono_aot_file_info’. This points to a MonoAotFileInfo structure which contains pointers to all the AOT data structures. In the latter parts of this document, fields of this structure are referenced using info-><FIELD NAME>.
Binary data other than code is stored in one giant blob. Data items inside the blob can be found using several tables called ‘XXX_offsets’, like ‘method_info_offsets’. These tables contain offsets into the blob, stored in a compact format using differential encoding plus an index.
Source file structure
The AOT infrastructure is split into two files, aot-compiler.c and aot-runtime.c. aot-compiler.c contains the AOT compiler which is invoked by –aot, while aot-runtime.c contains the runtime support needed for loading code and other things from the aot files. The file image-writer.c contains the ELF writer/ASM writer code.
AOT compilation consists of the following stages:
- collecting the methods to be compiled.
- compiling them using the JIT.
- emitting the JITted code and other information
- emitting the output file either directly, or by executing the system assembler/linker.
There are two kinds of methods handled by AOT:
- Normal methods are methods from the METHODDEF table.
- ‘Extra’ methods are either runtime generated methods (wrappers) or methods of inflated generic classes/inflated generic methods.
Each method is identified by a method index. For normal methods, this is equivalent to its index in the METHOD metadata table. For extra methods, it is an arbitrary number. Compiled code is created by invoking the JIT, requesting it to created AOT code instead of normal code. This is done by the compile_method () function. The output of the JIT is compiled code and a set of patches (relocations). Each relocation specifies an offset inside the compiled code, and a runtime object whose address is accessed at that offset. Patches are described by a MonoJumpInfo structure. From the perspective of the AOT compiler, there are two kinds of patches:
- calls, which require an entry in the PLT table.
- everything else, which require an entry in the GOT table.
How patches is handled is described in the next section. After all the method are compiled, they are emitted into the output file into a byte array called ‘methods’. Each piece of compiled code is identified by the local symbol .Lm_<method index>. While compiled code is emitted, all the locations which have an associated patch are rewritten using a platform specific process so the final generated code will refer to the plt and got entries belonging to the patches. This is done by the emit_and_reloc_code () function. The compiled code array can be accessed using the ‘methods’ global symbol.
Before a piece of AOTed code can be used, the GOT entries used by it must be filled out with the addresses of runtime objects. Those objects are identified by MonoJumpInfo structures. These stuctures are saved in a serialized form in the AOT file, so the AOT loader can deconstruct them. The serialization is done by the encode_patch () function, while the deserialization is done by the decode_patch_info () function. Every method has an associated method info blob stored inside the global blob. This contains all the information required to load the method at runtime:
- the first got entry used by the method.
- the number of got entries used by the method.
- the indexes of the got entries used by the method
Each GOT entry is described by a serialized description stored in the global blob. The ‘got_info_offsets’ table maps got offsets to the offsets of their description.
The Procedure Linkage Table (PLT)
Our PLT is similar to the elf PLT, it is used to handle calls between methods. If method A needs to call method B, then an entry is allocated in the PLT for method B, and A calls that entry instead of B directly. This is useful because in some cases the runtime needs to do some processing the first time B is called. The processing includes:
- if B is in another assembly, then it needs to be looked up, then JITted or the corresponding AOT code needs to be found.
- if B is in the same assembly, but has got slots, then the got slots need to be initialized.
If none of these cases is true, then the PLT is not used, and the call is made directly to the native code of the target method. A PLT entry is usually implemented by a jump through a GOT entry, these entries are initially filled up with the address of a trampoline so the runtime can get control, and after the native code of the called method is created/found, the jump table entry is changed to point to the native code. All PLT entries also embed a integer offset after the jump which indexes into the ‘plt_info’ table, which stores the information required to find the called method. The PLT is emitted by the emit_plt () function.
Each compiled method has some additional info generated by the JIT, usable for debugging (IL offset-native offset maps) and exception handling (saved registers, native offsets of try/catch clauses). These are stored in the blob, and the ‘ex_info_offsets’ table can be used to find them.
When the runtime loads a class, it needs to compute a variety of information which is not readily available in the metadata, like the instance size, vtable, whenever the class has a finalizer/type initializer etc. Computing this information requires a lot of time, causes the loading of lots of metadata, and it usually involves the creation of many runtime data structures (MonoMethod/MonoMethodSignature etc), which are long living, and usually persist for the lifetime of the app. To avoid this, we compute the required information at aot compilation time, and save it into the aot image, into an array called ‘class_info’. The runtime can query this information using the mono_aot_get_cached_class_info () function, and if the information is available, it can avoid computing it. To speed up mono_class_from_name (), a hash table mapping class names to class indexes is constructed and saved in the AOT file pointed to by the symbol ‘class_name_table’.
Things saved into the AOT file which are not covered elsewhere:
info->assembly_guid A copy of the assembly GUID. When loading an AOT image, this GUID must match with the GUID of the assembly for the AOT image to be usable.
info->version The version of the AOT file format. This is checked against the MONO_AOT_FILE_VERSION constant in mini.h before an AOT image is loaded. The version number must be incremented when an incompatible change is made to the AOT file format.
info->image_table A list of assemblies referenced by this AOT module.
info->plt The Program Linkage Table
It is possible to use LLVM in AOT mode. This is implemented by compiling methods using LLVM instead of the JIT, saving the resulting LLVM bytecode into an LLVM .bc file, compiling it using LLVM tools into a .s file, then appending our own AOT data structures to that file.
Full AOT mode
Some platforms like the iphone prohibit JITted code, using technical and/or legal means. This is a significant problem for the mono runtime, since it generates a lot of code dynamically, using either the JIT or more low-level code generation macros. To solve this, the AOT compiler is able to function in full-aot or aot-only mode, where it generates and saves all the neccesary code in the aot image, so at runtime, no code needs to be generated. There are two kinds of code which needs to be considered:
- wrapper methods, that is methods whose IL is generated dynamically by the runtime. They are handled by generating them in the add_wrappers () function, then emitting them as ‘extra’ methods.
- trampolines and other small hand generated pieces of code. They are handled in an ad-hoc way in the emit_trampolines () function.
Emitting assembly/object code
The output emission functionality is in the file image-writer.c. It can either emit assembly code (.s), or it can produce a shared image directly. The latter is only supported on x86/amd64 ELF. The emission of debug information is in the file dwarfwriter.c.
Using AOT code is a trade-off which might lead to higher or slower performance, depending on a lot of circumstances. Some of these are:
- AOT code needs to be loaded from disk before being used, so cold startup of an application using AOT code MIGHT be slower than using JITed code. Warm startup (when the code is already in the machines cache) should be faster. Also, JITing code takes time, and the JIT compiler also need to load additional metadata for the method from the disk, so startup can be faster even in the cold startup case.
- AOT code is usually compiled with all optimizations turned on, while JITted code is usually compiled with default optimizations, so the generated code in the AOT case could be faster.
- JITted code can directly access runtime data structures and helper functions, while AOT code needs to go through an indirection (the GOT) to access them, so it will be slower and somewhat bigger as well.
- When JITting code, the JIT compiler needs to load a lot of metadata about methods and types into memory.
- JITted code has better locality, meaning that if A method calls B, then the native code for A and B is usually quite close in memory, leading to better cache behavior thus improved performance. In contrast, the native code of methods inside the AOT file is in a somewhat random order.
Generated native code needs to reference various runtime structures/functions whose address is only known at run time. JITted code can simple embed the address into the native code, but AOT code needs to do an indirection. This indirection is done through a table called the Global Offset Table (GOT), which is similar to the GOT table in the Elf spec. When the runtime saves the AOT image, it saves some information for each method describing the GOT table entries used by that method. When loading a method from an AOT image, the runtime will fill out the GOT entries needed by the method.
Computing the address of the GOT
Methods which need to access the GOT first need to compute its address. On the x86 it is done by code like this:
call <IP + 5> pop ebx add <OFFSET TO GOT>, ebx <save got addr to a register>
The variable representing the got is stored in cfg->got_var. It is allways allocated to a global register to prevent some problems with branches + basic blocks.
Referencing GOT entries
Any time the native code needs to access some other runtime structure/function (i.e. any time the backend calls mono_add_patch_info ()), the code pointed by the patch needs to load the value from the got. For example, instead of:
call <ABSOLUTE ADDR>
it needs to do:
call *<OFFSET>(<GOT REG>)
Here, the <OFFSET> can be 0, it will be fixed up by the AOT compiler.
For more examples on the changes required, see
svn diff -r 37739:38213 mini-x86.c
Back end functionality
Loading informarion from the GOT tables is done by the OP_AOTCONST opcode. Since the opcode implementation needs to reference the GOT symbol, which is not available during JITting, the backend should emit some placeholder code in mono_arch_output_basic_block (), and emit the real implementation in arch_emit_got_access () in aot-compiler.c.
AOTed code cannot contain literal constants like addresses etc. All occurences of those should be replaced by an OP_AOTCONST.
PLT entries are emitted by arch_emit_plt_entry () in aot-compiler.c. Each PLT entry has a corresponding slot in the GOT. The PLT entry should load this GOT slot, and branch to it, without clobbering any argument registers or the return value. Since the return address is not updated, the AOT code obtains the address of the PLT entry by disassembling the call site which branched to the PLT entry. This is done by the mono_arch_get_call_target () function in tramp-<ARCH>.c. The information needed to resolve the target of the PLT entry is in the AOT tables, and an offset into these tables should be emitted as a word after the PLT entry. The mono_arch_get_plt_info_offset () function in tramp-<ARCH>.c is responsible for retrieving this offset. After the call is resolved, the GOT slot used by the PLT entry needs to be updated with the new address. This is done by the mono_arch_patch_plt_entry () function in tramp-<ARCH>.c.
- Currently, when an AOT module is loaded, all of its dependent assemblies are also loaded eagerly, and these assemblies need to be exactly the same as the ones loaded when the AOT module was created (‘hard binding’). Non-hard binding should be allowed.
- On x86, the generated code uses call 0, pop REG, add GOTOFFSET, REG to materialize the GOT address. Newer versions of gcc use a separate function to do this, maybe we need to do the same.
- Currently, we get vtable addresses from the GOT. Another solution would be to store the data from the vtables in the .bss section, so accessing them would involve less indirection.
- When saving information used to identify classes/methods, we use an add-hoc encoding. An encoding similar to the metadata encoding should be used instead.