ASCII Mono
Introduction
This is a proposal for an optimisation to System.String
.
For historical reasons, System.String
uses the UCS-2 character encoding, that is, UTF-16 without surrogate pairs.
However, most strings in typical .NET applications consist solely of ASCII characters, leading to wasted space: half of the bytes in a string are likely to be null bytes!
Since strings are immutable, we can scan the character data when the string is constructed, then dynamically select an encoding, thereby saving 50% of string memory in most cases.
Working Version
A working version of this work is currently hosted here:
https://github.com/evincarofautumn/mono/commits/feature-strings
Updating String
Strings currently have the following representation:
class String {
int length;
char firstChar;
}
Where &firstChar
is the starting address of the co-allocated string data. First we can observe that the length
field is a signed 32-bit integer (System.Int32
). Changing this to an unsigned integer (System.UInt32
) gives us a free bit, which we can use to tell whether the string is normal (UCS-2) or compact (ASCII):
class String {
uint taggedLength;
byte firstByte;
}
Here, (taggedLength & 1) == 0
indicates the non-compact encoding, for which (char*)&firstByte
is the start of the UCS-2 character data; (taggedLength & 1) == 1
indicates the compact encoding, for which the ASCII character data starts at (byte*)&firstByte
.
I use the low-order bit instead of the sign bit because it lets us get the length with a simple shift, regardless of encoding:
public int Length {
get {
return (int)(taggedLength >> 1);
}
}
Getting There: Updating Native Code
Many places in Mono unsafely access String
data, but they can be updated fairly easily: we can rename the fields, and use accessors that assert that a particular encoding is in use. However, we must be careful to verify that all those paths are covered by the test suite.
Getting There: Disabling fixed
on Strings
The following is a technique that helped us bootstrap the effort.
Every managed method that unsafely accesses String
character data must be updated to account for whether the String
is compact. This is tractable within corlib
, but there is some third-party code that uses strings unsafely.
The fixed
statement on strings calls a method get_OffsetToStringData
, which is used to adjust the fixed
pointer to refer to the character data, rather than the String
object. In ASCII Mono, we can make this method throw a NotSupportedException
with a message like
Unsafe access to string data is not supported by this runtime.
Now we’re sure that only corlib
-internal methods can access the String
data, because only those methods have access to the firstByte
field.
Once we have completed this auditing work, we are going to replace the get_OffsetToStringData
with a method that duplicates
any ASCII-strings into UTF-16 strings if the user happens to call fixed on a comapct string.
Getting there: Adding UnsafeApply
API
In order to update existing third-party code that uses strings unsafely, we need some kind of UnsafeApply
API:
public unsafe T UnsafeApply<T>
(Func<BytePtr, T> compact, Func<CharPtr, T> noncompact)
This accepts two callbacks, one for the case of the compact encoding, and one for the non-compact encoding. This isn’t ideal, because it’s neither safe nor particularly efficient (involving the allocation of delegates). But, on the bright side, that may discourage people from continuing to use unsafe code.
Adding Iterator
API
In order to simplify updating existing corlib
code, we add a private Iterator
API that allows iterating over String
data regardless of encoding, so we can efficiently avoid duplicating the code for char*
and byte*
.
The String.Iterator
interface would provide methods such as:
Iterator Advance (int offset = 1)
void CopyFrom (Iterator that, int count)
long Difference (Iterator that)
char Get (int index = 0)
void Set (char value, int index = 0)
int CharSize ()
IntPtr Pointer ()
And have two concrete implementations, CompactIterator
and NonCompactIterator
, returned by a new String
method GetIterator
like so:
private static unsafe Iterator GetIterator (IntPtr data, bool compact)
{
if (compact)
return new CompactIterator (data);
return new NonCompactIterator (data);
}
This requires the character data pointer be pinned from the outside. This ensures that it’s pinned for the lifetime of the iterator, and that only corlib
can use this API.
Phrasing the API in this way should let the JIT inline operations on concrete iterator types.
Updating StringBuilder
StringBuilder
is a linked list of mutable character arrays that can be frozen into a single String
using the ToString
method.
We add an additional Boolean to each chunk, indicating whether it’s compact (the default) or non-compact. When inserting non-ASCII characters into an ASCII chunk, the chunk degrades to UCS-2.
If all chunks of a StringBuilder
are compact, as they are most of the time, then the result of ToString
is compact.
Scanning Character Data
At first blush it may seem very costly to scan every string. However, each string should only be scanned at most once, and the longer the string, the bigger the memory savings when it (probably) turns out to be compact-representable.
Moreover, we can avoid scanning strings if we know ahead of time what the encoding should be; for example, concatenating two compact strings always yields a compact string.
Scanning UCS-2 data for compact-representability is as simple as testing every character with the mask (c & 0xFF80) == 0
, which is trivially unrollable and vectorizable. Likewise, we can scan UTF-8 data with the mask (c & 0x80) == 0
.
Real-world Testing
I’ve implemented a fairly stable prototype of this feature in Mono. It includes the stated changes to String
and StringBuilder
, as well as a fast vectorized scanner. It can build corlib
and run the Mono and corlib
test suites. With some effort, and patches to third-party libraries, it can run Xamarin Studio. For a large project using Roslyn code analysis, this leads to a ~10% savings in memory usage, with a small speed overhead.
Next Steps
- Deduplicate code by using the iterator API.
- Avoid allocating intermediate
char[]
arrays by using the iterator API. - Upstream changes to third-party libraries.
- Get feedback and harden code for correctness, safety, and security.