One of the things that GLib-based APIs got right from early on is that generally speaking, our API boundaries expect UTF-8. When transitioning between API boundaries, UTF-8 validation is a common procedure.
Therefore, it is unsurprising that GLib does a lot of UTF-8 string validation of all sorts of string lengths.
The implementation we’ve had is fairly straight forward to read and reason about. Though it is not winning any performance awards.
UTF-8 validation performance has been on my radar for some time. Back when I was working on VTE performance it was an issue there. Recently with GVariant
I had to look for ways to mitigate that.
There are many options out there which can be borrowed-from or inspired-by with varying levels of complexity.
Deterministic Finite Automaton
Björn Höhrmann wrote a branchless UTF-8 validator years ago which is the basis for many UTF-8 validators. Though once you really start making it integrate well with other API designs you’ll often find you need to add branches outside of it.
This DFA approach can be relatively fast, but sometimes slower than the venerable g_utf8_validate()
.
VTE uses a modified version of this currently for its UTF-8 validation. It allows some nice properties when processing streaming input like that of a PTY character stream.
Chromium uses a modified version of this which removes the ternary for even better performance.
simdutf
One of the more boundary-pushing implementations is simdutf used by the WebKit project. Simdutf is written in C++ and performance-wise it is extremely well regarded.
Being written in C++ does present challenges to integrate into a C abstraction library, though not insurmountable.
The underlying goal of a SIMD (single-instruction, multiple-data) approach is to look at more than a single character at a time using wide-registers and clever math.
c-utf8
Another implementation out there is the c-utf8 project. It is used by dbus-broker and libvarlink.
Like simdutf
it attempts to take advantage of SIMD by using some very simple compiler built-ins. It is also simple C code (no assembly) with just the use of a few GCC’isms all of which can be removed or alternatives used for other compilers such as MSVC.
What to Choose
Just to throw up an idea to see if it could be shot down, I suggested we look at these for GLib (issue #3481) to replace what we have.
I haven’t been able to get as good of performance with the DFA approach as the SIMD approach. So I felt I needed to choose between c-utf8
and simdutf
.
Given the ease of integration I went with the c-utf8
implementation. It required removing the case 0x01 ... case 0x7F:
GCC-ism. Additionally I simplified a number of the macros which are part of sibling c-util
projects in favor of GLib equivalents.
I opted to drop __builtin_assume_aligned(expr,n)
and use __attribute__((aligned(n)))
instead. It generates the same code on GCC but has the benefit of having an equivalent syntax on MSVC in the form of __declspec(align(n))
in the same syntax position (before the declarator
). That means a simple preprocessor macro per-compiler gets the same effect.
Performance Numbers
The c-utf8
project has a nice benchmark suite available for testing UTF-8 validation performance. It will benchmark the performance against a trivial implementation which matches what GLib has at the time of writing this. It also will benchmark it against the performance of strlen()
which is an extremely optimized piece of the stack.
I modeled the same tests but using g_utf8_validate()
with my integration patches to ensure we were getting the same performance as c-utf8
, which we are.
For long ASCII strings, you can expect more than 10x speed-ups. (On my Xeon it was 13x and on my Apple Silicon it was 12x). For long multi-byte (2-to-4 byte sequences) you can expect 4x-6x improvements.
I wouldn’t expect too much difference for small strings as you wont see much of a benefit until you can get to word-size operations. As long as we don’t regress there, I think this is a good direction to go in.
Merge Request !4319.