The other day I found myself perusing through some disassembly to get an idea of the code’s complexity. I do that occasionally because I find it the quickest way to determine if something is out of whack.
While I was there, I noticed a rather long _get_type()
function. It looked a bit long and more importantly, I only saw one exit point (retq
instruction on x86_64).
That piqued my interest because _get_type()
functions are expected to be fast. In the fast-path (when they’ve already been registered), I’d expect them to check for a non-zero static GType type_id
and if so return it. That is just a handful of instructions at most, so what gives?
The first thing that came to mind is that I use -O0 -g -fno-omit-frame-pointer
on my local builds so that I get good debug symbols and ensure that Linux-perf can unwind the stack when profiling. So let’s disable that.
Now I’ve got everything rebuilt with the defaults (-O2 -fomit-frame-pointer
). Now I see a couple exit points, but still what appears to be too many for the fast path and what is this __stack_chk_fail@plt
I see?
A quick search yields some information about -fstack-protector
which is a recent (well half-decade) compiler feature that employs various tricks to detect stack corruption. Distributions seem to enable this by default using -fstack-protector-strong
. That tries to only add stack checks to code it thinks is accessing stack allocated data.
So quick recompile with -fno-stack-protector
to disable the security feature and sure enough, a proper fast path emerges, We drop from 15 instructions (with 2 conditional jumps) to 5 instructions (with 1 conditional jump).
So the next question is: “Well okay, is this worth making faster?”
That’s a complicated question. The code was certainly faster before that feature was enabled by default. The more instructions you execute, the more code has to be loaded into the instruction cache (which is very small). To load instructions means pushing others out. So even if the code itself isn’t much faster, it can prevent the surrounding code from being faster.
But furthermore, we do a lot of _get_type()
calls. They are used when doing function precondition checks, virtual methods, signals, type checks, interface lookups, checking a type for conformance, altering the reference count, marshaling data, accessing properties, … you get the idea.
So I mucked through about 5 different ways trying to see if I could make things faster without disabling the stack protector, without much luck. The way types are registered access some local data via macros. Nothing seemed to get me any closer to those magic 5 instructions.
GCC, since version 4.4, has allowed you to disable the stack-protector on a per-function basis. Just add __attribute__((optimize("no-stack-protector")))
to the function prototype.
#if G_GNUC_CHECK_VERSION(4, 4) # define G_GNUC_NO_STACK_PROTECTOR \ __attribute__((optimize("no-stack-protector"))) #else # define G_GNUC_NO_STACK_PROTECTOR #endif GType foo_get_type (void) G_GNUC_NO_STACK_PROTECTOR;
Now we get back to our old (faster) version of the code.
48 8b 05 b9 15 20 00 mov 0x2015b9(%rip),%rax 48 85 c0 test %rax,%rax 74 0c je 400ac8 48 8b 05 ad 15 20 00 mov 0x2015ad(%rip),%rax c3 retq
But what’s the difference you ask?
I put together a test that I can run on a number of machines. It was unilaterally faster in each case (as expected), but some by as much as 50% (likely due to various caches).
Arch | OS | Type | Speedup |
---|---|---|---|
ARM | Ubuntu 14.04 | Odroid X2 | +24% |
x64_64 | Fedora 28 | X1 Carbon Gen3 | +25.5% |
x86_64 | Fedora 28 | i7 gen7 NUC | +12.25% |
x86_64 | Fedora 28 | Surface Book 2 i7 (gen8) | +12.5% |
x86_64 | Fedora 27 | Onda Tablet | +50.6% |
x86 | Debian | netbook | +12.5% |
It’s my opinion that one place where it makes sense to keep things very fast (and reduce instruction cache blow-out) is a type system. That code gets run a lot intermixed between all the code you really care about.