More compiler fun

Yesterday’s post about -fstack-protector challenges leads us to today’s post. If you haven’t read it, go back and read it first.

Basically, the workaround I had at the time was to just disable -fstack-protector for the get_type() functions. It certainly made things faster, but it was a compromise. The get_type() functions can have user-provided code inserted into them via macros like G_DEFINE_TYPE_EXTENDED() and friends.

A real solution should manage to return the performance of the hot-path back to pre-stack-protector performance without sacrificing the the protection gained by using it.

So I spent some time today to make that happen. In bugzilla #795180 I’ve added some patches which break the get_type() functions into two. One containing the hot path, and a second that does the full type registration which is protected by our g_once_init_enter(). If we add the magic __attribute__((noinline)) to the function doing the full type registration, it can’t be inline’d into the fast path allowing it to pass the stack-protector sniff test (and therefore, not incur the stack checking wrath).

The best part is that applications and libraries only need to be recompiled to get the speedup (due to macro expansion).

Not a bad experience diving into some compiler bits this week.

Compiler complexities

The other day I found myself perusing through some disassembly to get an idea of the code’s complexity. I do that occasionally because I find it the quickest way to determine if something is out of whack.

While I was there, I noticed a rather long _get_type() function. It looked a bit long and more importantly, I only saw one exit point (retq instruction on x86_64).

That piqued my interest because _get_type() functions are expected to be fast. In the fast-path (when they’ve already been registered), I’d expect them to check for a non-zero static GType type_id and if so return it. That is just a handful of instructions at most, so what gives?

The first thing that came to mind is that I use -O0 -g -fno-omit-frame-pointer on my local builds so that I get good debug symbols and ensure that Linux-perf can unwind the stack when profiling. So let’s disable that.

Now I’ve got everything rebuilt with the defaults (-O2 -fomit-frame-pointer). Now I see a couple exit points, but still what appears to be too many for the fast path and what is this __stack_chk_fail@plt I see?

A quick search yields some information about -fstack-protector which is a recent (well half-decade) compiler feature that employs various tricks to detect stack corruption. Distributions seem to enable this by default using -fstack-protector-strong. That tries to only add stack checks to code it thinks is accessing stack allocated data.

So quick recompile with -fno-stack-protector to disable the security feature and sure enough, a proper fast path emerges, We drop from 15 instructions (with 2 conditional jumps) to 5 instructions (with 1 conditional jump).

So the next question is: “Well okay, is this worth making faster?”

That’s a complicated question. The code was certainly faster before that feature was enabled by default. The more instructions you execute, the more code has to be loaded into the instruction cache (which is very small). To load instructions means pushing others out. So even if the code itself isn’t much faster, it can prevent the surrounding code from being faster.

But furthermore, we do a lot of _get_type() calls. They are used when doing function precondition checks, virtual methods, signals, type checks, interface lookups, checking a type for conformance, altering the reference count, marshaling data, accessing properties, … you get the idea.

So I mucked through about 5 different ways trying to see if I could make things faster without disabling the stack protector, without much luck. The way types are registered access some local data via macros. Nothing seemed to get me any closer to those magic 5 instructions.

GCC, since version 4.4, has allowed you to disable the stack-protector on a per-function basis. Just add __attribute__((optimize("no-stack-protector"))) to the function prototype.

#if G_GNUC_CHECK_VERSION(4, 4)
# define G_GNUC_NO_STACK_PROTECTOR \
  __attribute__((optimize("no-stack-protector")))
#else
# define G_GNUC_NO_STACK_PROTECTOR
#endif

GType foo_get_type (void) G_GNUC_NO_STACK_PROTECTOR;

Now we get back to our old (faster) version of the code.

 48 8b 05 b9 15 20 00    mov    0x2015b9(%rip),%rax   
 48 85 c0                test   %rax,%rax
 74 0c                   je     400ac8
 48 8b 05 ad 15 20 00    mov    0x2015ad(%rip),%rax   
 c3                      retq   

But what’s the difference you ask?

I put together a test that I can run on a number of machines. It was unilaterally faster in each case (as expected), but some by as much as 50% (likely due to various caches).

Arch OS Type Speedup
ARM Ubuntu 14.04 Odroid X2 +24%
x64_64 Fedora 28 X1 Carbon Gen3 +25.5%
x86_64 Fedora 28 i7 gen7 NUC +12.25%
x86_64 Fedora 28 Surface Book 2 i7 (gen8) +12.5%
x86_64 Fedora 27 Onda Tablet +50.6%
x86 Debian netbook +12.5%

It’s my opinion that one place where it makes sense to keep things very fast (and reduce instruction cache blow-out) is a type system. That code gets run a lot intermixed between all the code you really care about.