Frame pointers and other practical near-term solutions

I’m the author/maintainer of Sysprof. I took over maintainership from Søren Sandmann Pedersen years ago so we could integrate it with GNOME Builder. In the process we really expanded it’s use cases beyond just being a sampling profiler. It is a system-wide profiler that makes it very easy to see what is going on and augment that data with counters, logs, embedded files, marks, custom data sources and more.

It provides:

  • Callgraphs (based on stacktraces obtained via perf, previously via a custom kernel module).
  • A binary capture format which makes it very convenient to work with capture files to write custom tooling, unlike perf.dat or pprof.
  • Accessory collection of counters (cpu, net, mem, cpufreq, energy consumption, battery charge, etc), logs, files, and marks with visualizers for them.
  • A “control fd” that can be sendmsg()d to peers or inherited by subprocesses w/ SYSPROF_CONTROL_FD=n set allowing them to request a ring buffer (typically per-thread) to send accessory information which is muxed into the capture file.
  • Memory profilers (via LD_PRELOAD) which use the above to collect allocation records and callgraphs to visualize temporaries, leaks, etc.
  • Integration with language runtimes like GJS (GNOME’s SpiderMonkey used by GNOME shell and various applications) to use timers and SIGPROF to unwind/collect samples and mux into captures.
  • Integration with platform libraries like GLib, GTK, Mutter, and Pango which annotate the recordings with information about the GPU, input events, background operations, etc.
  • The ability to decode symbols at the tail of a recording and insert those mappings (including kallsyms) into the snapshot. This allows users to give me a .bz2 recording I can open locally without the same binary versions or Linux distribution.

This has been incredibly useful for us in GNOME because we can take someone who is having a problem, run Sysprof, and they can paste/upload the callgraph and we can address it quickly. It’s a major contributor to why we’ve been able to make GNOME so much faster in recent releases.

The Breakdown

Where this all breaks down is when we have poor/unreliable stack traces.

For example, most people want to view a callgraph upside down, starting from application’s entry points. If you can only unwind a few frames, you’re out of luck here because you can’t trace deep enough to reach the instruction pointer for main() (or similar).

You can, of course, ask perf to give you some 8 Kb of stack data on every sample. Sysprof does thousands of samples per second, so this grows quickly. Even more so the time to unwind it takes longer than reading this post. Nobody does this unless someone shows up with a
pile of money.

It’s such a problem that I made GLib/GTK avoid using libffi marshalers internally (which has exception unwind data but no frame pointers) so that it wouldn’t break Linux’s frame-pointer unwinder.

Beyond that, we started building the Flatpak org.freedesktop.Sdk with -fno-omit-frame-pointers so that we could profile software while writing it. (what a concept!) GNOME OS also is compiled with frame pointers so when really tricky things come up, many of us just use that instead of Fedora.

Yes there are cases where leaf functions are missed or not 100% correct, but it hasn’t been much of an issue compared to truncated stacks or stacks with giant holes at library boundaries because the Linux kernel frame-pointer unwinder fails.

Practical Solutions

It’s not that frame pointers are great or anything, it’s that reliability tends to be the most important characteristic. So many parts of our platform can cause profiling to give inaccurate results.

I’m not advocating for frame pointers, I’m advocating for “Works OOTB”. Fixing Python and BoringSSL to emit better instructions in the presence of frame pointers is minuscule compared to the alternatives suggested thus far.

Compiling our platforms with frame pointers is the single easiest thing we could do to ensure that we can make big wins going forward until we have a solution that reliably works across a number of failure scenarios I’ll layout below.

Profiling Needs

One necessity we have when doing desktop engineering is that some classes of problems occur in the interaction of components rather than one badly behaved component on it’s own.

Seeing profiling data across all applications, which may already be running, but also may be spawned by D-Bus or systemd during the profiling session is a must. Catching problematic processes during their startup (or failure to startup) is critical.

That means that pre-computing unwind tables is inherently unreliable for us unless we can stall any process until unwind tables are generated and uploaded for an eBPF unwinder. This may be possible, but in and of itself will skew some sorts of profiling results due to the additional latency as well as memory pressure for unwind tables (which are considerably larger than just emitting the frame pointers in the binaries .text section).

I suspect that doing something like QEMU’s live migration strategy may be an option, but again with all the caveats that it is going to perturb some sorts of results.

  1. Load eBPF program to cause all remapping of pages that are X^W to SIGSTOP. Notify an agent to setup unwind tables.
  2. Load unwind or DWARF data for all X^W pages mapped, generate system-wide tables
  3. Handle incoming requests for SIGSTOP’d processes
  4. Upload new unwind table data
  5. Repeat from #3

But, even if this were a solution today, it has a number of situations that it flat out doesn’t handle well.

Current Hurdles

  • Startup of new processes incur latency, which for some workloads relying on fork()/exec() may perturb results especially for file-based work queue processing.
  • Static binaries are a thing now. Even beyond C/C++ both Rust and golang are essentially statically linked and increasing in use across all our tooling (podman, toolbx, etc) as well as desktop applications (gtk-rs).

    This poses a huge issue. The amount of unwind data we need to load increases significantly because we can’t rely on MAP_SHARED from shared libraries to reduce the total footprint.

Again, we’re looking for whole system profiling here.

The tables become so large that they push out resident memory from the things you’re trying to profile to the point that you’re really not profiling what you think you are.

  • Containers, if allowed to version skew or if we’re unable to resolve mappings to the same inode, present a similar challenge (more below in Podman and how that is a disaster on Fedora today).

Thankfully from the Flatpak perspective, it’s very good at sharing inodes across the hard-link farm. However application binaries are increasingly Rust.

  • ORC overhead in the kernel comes in about 5Mb of extra memory at a savings of a few hundred Kb of frame pointer instructions. The value, of course, is that your instruction cache is tighter. Imagine fleets that are all static binaries and how intractable this becomes quickly. Machines at capacity will struggle to profile when you need it most.
  • Unwinding with eBPF appears to currently require exfiltrating the data via side-channels (BPF maps) rather than from the perf event stream. This can certainly be fixed with the right unwinder hooks in the kernel, but currently requires agents to not only setup unwind tables but to provide access to the unwind stacks too. My personal viewpoint is that these stacks should be part of the perf data stream, not a secondary data stream to be merged.
  • If we can’t do this thousands of times per second, it’s not fast enough.
  • If an RPM is upgraded, you lose access to the library mapped into memory from outside the process as both the inode and CRC will have changed on disk. You can’t build unwind tables, so accounting in that process breaks.

ELF/Dwarf Parsing and Privileged Processes

Parsing DWARF/.eh_frame data is very much an under researched problem by the security community. The process that sets up perf and/or BPF programs would need to do this so that unwind tables can be uploaded. You probably want the agent to be in control of that, but also very much want it sandboxed with something like bwrap (Bubblewrap) at minimum to protect the privileged agent.

Generating Missing .eh_frame Data

A fantastic entry from Oopsla 2019 talks about both validating .eh_frame data as well as synthesizing when missing by analyzing assembly instructions. This is very neat, but it also means that compiler tooling should be doing things like this to automatically generate proper .eh_frame data in the presence of inline assembly. Currently, you must get that correct by manually writing DWARF data in your assembly. Notable issues in both LLVM and glibc have created issues here.

Read more from the incredibly well written and implemented Oopsla 2019 submission [PDF].

libffi and .eh_frame

Libffi does dynamically generate enough information to unwind a stack in process across C++ exceptions. However, this is a lot more problematic if you have an agent generating unwind tables out of process. To get that data you have to map in user-space memory from the application (say /dev/$pid/mem) to access those pages and then trust that the memory isn’t malicious to the agent.

Blown FUSEs

Podman does this thing (at least for user-namespace containers on Fedora) where all the image content is served via FUSE. That means when you try to resolve the page table mappings in user-space to find the binary to locate symbols from, you are basically out of luck.

Sysprof goes through great pains by parsing layers of Podman’s JSON image information to discover what the mapping should have been. This is somewhat limited because we can only do it for the user that is running the sysprof client (as those stack frame instruction pointers are symbolized client-side in libsysprof). Doing this from an agent would require uploading that state to the agent to request integration into the unwind tables.

We have to do the same for flatpak, but thankfully /.flatpak-info contains everything we need to do that symbol resolution.

Subvolumes and OSTree deployments further complicate this matter because of indirection of mounts/mountinfo. We have to resolve through subvol=* for example to locate the correct file for the path provided in /proc/$pid/maps.

Again, since we need to build unwind tables up-front, this needs to be resolved when the profiler starts up. We can’t rely on /proc/$pid/mem because there is no guarantee the section will be mapped or that we’ll be able discover which map it was without the ELF header (which too may no longer be mapped). Since the process will likely have closed the FD after mmap(), we need to locate the proper files on disk.

Thankfully, in Sysprof we store the inode/CRC so that we can symbolize something useful if they’re incorrect, even if it’s a giant “Hey this is broken” callgraph entry.

In a world without frame-pointers, you have very little luck at making profilers reliably work in production unless you can resolve all of these issues.

There has been a lot of talk about how we can do new unwinders, and that is seriously great work! The issues above are real ones today and will become even bigger issues in the upcoming years and we’ll need all the creativity we can get to wrangle these things together.

Again, I don’t think any of us like frame pointers, just that they are generally lower effort today while also being reasonably reliable.

It might be the most practical near term solution to enable frame-pointers across Fedora today, while we push to get the rest of the system integrated to robustly support alternative unwinding capabilities.