Using Leak Sanitizer with JHBuild

For a subset of GNOME modules, I’m still using jhbuild. I also spend a great deal of time tracking down memory bugs in various libraries. So it is very handy to have libasan.so working with meson -Db_sanitize=address.

To make things work, you currently need to:

  • dnf install libasan liblsan (or similar for your OS).
  • Use meson from git (0.48 development), for this bug fix.
  • Configure your meson projects with -Db_sanitize=address.
  • Create a suppression file for leaks out of our control.
  • Set some environment variables in ~/.config/jhbuildrc.

Here is an example of what I put in ~/.config/lsan_suppressions.txt.

leak:FcCharSetCreate
leak:FcLangSetCreate
leak:__nptl_deallocate_tsd
leak:_g_info_new_full
leak:dconf_engine_watch_fast
leak:g_get_language_names_with_category
leak:g_intern_string
leak:g_io_module_new
leak:g_quark_init
leak:libfontconfig.so.1

And add this to ~/.config/jhbuildrc.

import os
os.environ['LSAN_OPTIONS'] = 'suppressions=' + \
    os.path.expanduser('~/.config/lsan_suppressions.txt')

This has helped me track down a number of bugs in various modules this week and it would be useful if other people were doing it too.

Keeping those headers aligned

One dead give-away of a GNOME/Gtk programmer is how they format their headers. For the better part of two decades, many of us have been trying to keep things aligned. Whether this is cargo-culted or of real benefit depends on the reader. Generally, I find them easier to filter through.

Unfortunately, none of indent, clang-format, nor uncrustify have managed to exactly represent our style which makes automated code formatting tools rather problematic. Especially in an automated fashion.

For example, notice how the types and trailing asterisks, stay aligned, in multiple directions.

FOO_EXPORT
void   foo_do_something_async  (Foo                  *self,
                                const gchar * const  *params,
                                GCancellable         *cancellable,
                                GAsyncReadyCallback   callback,
                                gpointer              user_data);
FOO_EXPORT
Bar   *foo_do_something_finish (Foo                  *self,
                                GAsyncResult         *result,
                                GError              **error);

Keeping that sort of code aligned is quite a pain. Even for vim users who can fairly easily repeat commands. Worse, it can explode patches into unreadable messiness.

Anyway, I added a new command in Builder last night that will format these in this style so long as you don’t do anything to trip it up. Just select a block of function declarations, and run format-decls from the command bar.

It doesn’t yet handle vtable entries, but that shouldn’t be too painful. Also, it doesn’t handle miscellaneous other C code in-between declarations (except G_GNUC_* macros, __attribute_() etc.

A new completion engine for Builder

Since my initial announcement of Builder at GUADEC in 2014, I’ve had a vision in the back of my mind about how I’d like completion to work in Builder. However, there have been more important issues to solve and I’m just one person. So it was largely put on the back burner because after a few upstream patches, the GtkSourceView design was good enough.

However, as we start to integrate more external tooling into Builder, the demands and design of what those completion layers expect of the application have changed. And some of that is in conflict with the API/ABI we have in the long-term stable versions of GtkSourceView.

So over the past couple of weeks, I’ve built a new completion engine for Builder that takes these new realities into account.

A screenshot of Builder's new completion engine showing results from clang in a C source file.

It has a number of properties I wanted for Builder such as:

Reduced Memory and CPU Usage

Some tooling wants to give you a large set of proposals for completion and then expects the IDE to filter in the UI process. Notably, this is how Clang works. That means that a typical Gtk application written in C could easily have 25,000 potential completion proposals.

In the past we mitigated this through a number of performance tricks, but it still required creating thousands of GObjects, linked lists, queues, and such. That is an expensive thing to do on a key-press, especially when communicating with a sub-process used for crash-isolation.

So the new completion provider API takes advantage of GListModel which is an interface that focuses on how to have a collection of GObjects which don’t need to be “inflated” until they’ve been requested. In doing so, we can get our GVariant IPC message from the gnome-builder-clang sub-process as a single allocation. Then, as results are requested by the completion display, a GObject is inflated on demand to reference an element of that larger GVariant.

In doing so, we provide a rough upper bound on how many objects need to be created at any time to display the results to the user. We can also still sort and filter the result set without having to create a GObject to represent the proposal. That’s a huge win on memory allocator churn.

Consistent and Convenient Refiltering

Now that we have external tooling that expects UI-side refiltering of proposals, we need to make that easier for tooling to do without having to re-query. So the fuzzy search and highlighting tools have been moved into IdeCompletion for easy access by completion providers.

As additional text is provided for completion, the providers are notified to perform filters on their result set. Since the results are GListModel-based, everything updates in the UI out-of-band nicely with a minimal number of gsignal emissions. Compare this to GtkTreeModel which has to emit signals for every row insertion, change, or deletion!

Alternative Styling

When working with completions for programming languages, we’re often dealing with 3 potential groups of content. The return value, the name and possible parameters, and miscellaneous data. To get the styling we want for all of this, I chose to forgo the use of GtkTreeView and use widgets directly. That means that we can use CSS like we do everywhere else. But also, it means that some additional engineering is required.

We only want to create widgets for the visible rows, because otherwise we’re wasting memory and time crunching CSS for things that won’t be seen. We also want to avoid creating new widgets every time the visible range of proposals is changed.

The result is IdeCompletionListBox which is a GtkBox containing GtkListBoxRow and some GtkSizeGroups to give things a columnar effect. Because the number of created widgets is small things stay fast and snappy while giving us the desired effect. Notably, it implements GtkScrollable so if you place it in a GtkScrolledWindow you still get the expected behavior.

Further more, we can adjust the window sizing and placement to be more natural for code-related proposals.

Dynamic Priority Control

We want the ability to change the priority of some completion providers based on the context of the completion. The new design allows for providers to increase their priority when they know they have something of high-importance given some piece of contextual knowledge.

Long term, we also want to provide an API for providers to register a small number of suggested completions that will be raised to the top-level, regardless of what provider gave them. This is necessary instead of having global “scoring” since that would require both O(n) scans of the data set as well as coming up with a strategy to score disparate systems (and search engines prove that rarely works well).

More to do

There are still a couple things that I think we should address that may influence the API design more. For example:

  • How should we handle string interpolation? A simplified API for completions when working inside of strings might be useful. Think strftime(), printf(), etc as potential examples here.
  • The upcoming Gtk+ 3.24 release will give us access to the move_to_rect() API. Combined with some Wayland xdg_popup improvements, this could allow us to make our display widget more flexible.
  • Parameter completion is still a bit of an annoying process. We could probably come up with a strategy to make the display look a lot better here.
  • Give some tweaks and knobs for how much and what to complete (just function name vs parameters and types).

Conclusions

Rarely do I write any code that doesn’t have bugs. Now that this is landing in Builder Nightly soon, I could use some more testing and bug filing from the community at large.

I’m very happy with the improvements over the past couple of months. Between getting Clang out of process and this to allow us to make clang completion fast, I think we’re in a much better place.

We can’t get this design into older GtkSourceView releases, but we can probably look at some form of integration into what will eventually integrate with Gtk4. I would be very happy if it influenced new API releases of the library so that we don’t need to carry the implementation downstream.

Engineering Journals vs User Support

A major thanks to everyone involved in the gitlab migration. It’s no doubted a huge leap forward for GNOME on so many fronts.

Before we lose that momentum, I’d like to bring up in the collective minds of our project, what I consider, a separate problem. That is Bug Tracking versus User Support.

We’ve gotten a lot of flack over the years for brash or abrupt comments in our bug trackers. And while I can see that from a certain angle, I think in many cases it’s due to the miss-use of bug trackers as a user support mechanism.

For example, I’ve always seen a bug tracker as an engineering journal: a way to keep track of symptoms, potential solutions and trade-offs, and finally, the chosen solution. One valuable aspect to this is that you don’t have to go mucking through unrelated information or prose to pick out the valuable bits. In other words, terseness.

Where as user support has a different focus. The focus is on the user and their symptom, making them feel heard, and do our best to convert that information into a succinct chain of information for the engineering journal.

Treating engineering journals and user support as equivalent is preventing us from doing either with proficiency.

We generally don’t have this focus in F/OSS. It requires a set of skills that many of us have not cultivated and probably should. In addition, we should encourage those that already have these skills to join us.

But that raises the question: is gitlab the right place to do user support?

If GNOME were to advance Free Software by taking user support to the next level, what would that look like and what tooling do we need? Is that worth investing in now that we have many applications to support in addition to the desktop plumbing?

Hopefully we can have some discussion about that on the beach in Almería, Spain for GUADEC 2018.

Moving clang out of process

For the past couple of weeks, Builder from git-master has come with a new gnome-builder-clang subprocess. Instead of including libclang in the UI process, we now proxy all of that work to the subprocess. This should have very positive effect on memory usage within the UI process. It will also simplify the process of using valgrind/ASAN and obtaining useful results. In the future, we’ll teach the subprocess supervisor to recycle subprocesses if they consume too much memory.

That is rather important because compilers wreak havoc on the memory allocator. However, it presents a series of challenges too.

Once you move all of that stuff out of process you need to find an efficient way to communicate between processes. The design that I came up with is plain GVariant messages delivered over a stdin/stdout pipe to the subprocess. This is very similar to the Language Server Protocol but lets us cheat in an important way.

We sometimes need to pass large messages between processes including changes to unsaved buffers or completion results. Clang prefers to do client-side filtering of completion results so the list often includes some 25,000+ results. Not exactly the type of thing you want to parse into lots of tiny json objects and strings. By using GVariant as our message format we ensure that we only have a single memory allocation for the entire result set (and can use C strings embedded in the variant too!).

By altering some of our APIs to use interfaces like GListModel we end up more lazy in terms of object creation, use less memory, and fragment significantly less. One area this will be changing in the near future is our new completion engine. It is based on what we learned about other async APIs in Builder along with using GListModel for efficiency.

One thing that is not well known is that g_variant_get_child_value() is O(1) due to how arrays are laid out in GVariant. This couples well with g_list_model_get_item().

This has been a big step forward, but there is more to do.

More compiler fun

Yesterday’s post about -fstack-protector challenges leads us to today’s post. If you haven’t read it, go back and read it first.

Basically, the workaround I had at the time was to just disable -fstack-protector for the get_type() functions. It certainly made things faster, but it was a compromise. The get_type() functions can have user-provided code inserted into them via macros like G_DEFINE_TYPE_EXTENDED() and friends.

A real solution should manage to return the performance of the hot-path back to pre-stack-protector performance without sacrificing the the protection gained by using it.

So I spent some time today to make that happen. In bugzilla #795180 I’ve added some patches which break the get_type() functions into two. One containing the hot path, and a second that does the full type registration which is protected by our g_once_init_enter(). If we add the magic __attribute__((noinline)) to the function doing the full type registration, it can’t be inline’d into the fast path allowing it to pass the stack-protector sniff test (and therefore, not incur the stack checking wrath).

The best part is that applications and libraries only need to be recompiled to get the speedup (due to macro expansion).

Not a bad experience diving into some compiler bits this week.

Compiler complexities

The other day I found myself perusing through some disassembly to get an idea of the code’s complexity. I do that occasionally because I find it the quickest way to determine if something is out of whack.

While I was there, I noticed a rather long _get_type() function. It looked a bit long and more importantly, I only saw one exit point (retq instruction on x86_64).

That piqued my interest because _get_type() functions are expected to be fast. In the fast-path (when they’ve already been registered), I’d expect them to check for a non-zero static GType type_id and if so return it. That is just a handful of instructions at most, so what gives?

The first thing that came to mind is that I use -O0 -g -fno-omit-frame-pointer on my local builds so that I get good debug symbols and ensure that Linux-perf can unwind the stack when profiling. So let’s disable that.

Now I’ve got everything rebuilt with the defaults (-O2 -fomit-frame-pointer). Now I see a couple exit points, but still what appears to be too many for the fast path and what is this __stack_chk_fail@plt I see?

A quick search yields some information about -fstack-protector which is a recent (well half-decade) compiler feature that employs various tricks to detect stack corruption. Distributions seem to enable this by default using -fstack-protector-strong. That tries to only add stack checks to code it thinks is accessing stack allocated data.

So quick recompile with -fno-stack-protector to disable the security feature and sure enough, a proper fast path emerges, We drop from 15 instructions (with 2 conditional jumps) to 5 instructions (with 1 conditional jump).

So the next question is: “Well okay, is this worth making faster?”

That’s a complicated question. The code was certainly faster before that feature was enabled by default. The more instructions you execute, the more code has to be loaded into the instruction cache (which is very small). To load instructions means pushing others out. So even if the code itself isn’t much faster, it can prevent the surrounding code from being faster.

But furthermore, we do a lot of _get_type() calls. They are used when doing function precondition checks, virtual methods, signals, type checks, interface lookups, checking a type for conformance, altering the reference count, marshaling data, accessing properties, … you get the idea.

So I mucked through about 5 different ways trying to see if I could make things faster without disabling the stack protector, without much luck. The way types are registered access some local data via macros. Nothing seemed to get me any closer to those magic 5 instructions.

GCC, since version 4.4, has allowed you to disable the stack-protector on a per-function basis. Just add __attribute__((optimize("no-stack-protector"))) to the function prototype.

#if G_GNUC_CHECK_VERSION(4, 4)
# define G_GNUC_NO_STACK_PROTECTOR \
  __attribute__((optimize("no-stack-protector")))
#else
# define G_GNUC_NO_STACK_PROTECTOR
#endif

GType foo_get_type (void) G_GNUC_NO_STACK_PROTECTOR;

Now we get back to our old (faster) version of the code.

 48 8b 05 b9 15 20 00    mov    0x2015b9(%rip),%rax   
 48 85 c0                test   %rax,%rax
 74 0c                   je     400ac8
 48 8b 05 ad 15 20 00    mov    0x2015ad(%rip),%rax   
 c3                      retq   

But what’s the difference you ask?

I put together a test that I can run on a number of machines. It was unilaterally faster in each case (as expected), but some by as much as 50% (likely due to various caches).

Arch OS Type Speedup
ARM Ubuntu 14.04 Odroid X2 +24%
x64_64 Fedora 28 X1 Carbon Gen3 +25.5%
x86_64 Fedora 28 i7 gen7 NUC +12.25%
x86_64 Fedora 28 Surface Book 2 i7 (gen8) +12.5%
x86_64 Fedora 27 Onda Tablet +50.6%
x86 Debian netbook +12.5%

It’s my opinion that one place where it makes sense to keep things very fast (and reduce instruction cache blow-out) is a type system. That code gets run a lot intermixed between all the code you really care about.

GTask and Threaded Workers

GTask is super handy, but it’s important you’re very careful with it when threading is involved. For example, the normal threaded use case might be something like this:

state = g_slice_new0 (State);
state->frob = get_frob_state (self);
state->baz = get_baz_state (self);

task = g_task_new (self, cancellable, callback, user_data);
g_task_set_task_data (task, state, state_free);
g_task_run_in_thread (task, state_worker_func);

The idea here is that you create your state upfront, and pass that state to the worker thread so that you don’t race accessing self-> fields from multiple threads. The “shared nothing” approach, if you will.

However, even this isn’t safe if self has thread usage requirements. For example, if self is a GtkWidget or some other object that is expected to only be used from the main-thread, there is a chance your object could be finalized in a thread.

Furthermore, the task_data you set could also be finalized in the thread. If your task data also holds references to objects which have thread requirements, those too can be unref’d from the thread (thereby cascading through the object graph should you hit this undesirable race).

Such can happen when you call g_task_return_pointer() or any of the other return variants from the worker thread. That call will queue the result to be dispatched to the GMainContext that created the task. If your CPU task-switches to that thread before the worker thread has released it’s reference you risk the chance the thread holds the last reference to the task.

In that situation self and task_data will both be finalized in that worker thread.

Addressing this in Builder

We already have various thread pools in Builder for work items so it would be nice if we could both fix the issue in our usage as well as unify the thread pools. Additionally, there are cases where it would be nice to “chain task results” to avoid doing duplicate work when two subsystems request the same work to be performed.

So now Builder has IdeTask which is very similar in API to GTask but provides some additional guarantees that would be very difficult to introduce back into the GTask implementation (without breaking semantics). We do this by passing the result and the threads last ownership reference to the IdeTask back to the GMainContext at the same time, ensuring the last unref happens in the expected context.

While I was at it, I added a bunch of debugging tools for myself which caught some bugs in my previous usage of GTask. Bugs were filed, GTask has been improved, yadda yadda.

But I anticipate the threading situation to remain in GTask and you should be aware of that if you’re writing new code using GTask.

Device Integration

I’ve been working on some groundwork features to sneak into Builder 3.28 so that we can build upon them for 3.30. In particular, I’ve started to land device abstractions. The goal around this is to make it easier to do cross-architecture development as well as support devices like phones, tablets, and IoT.

For every class and kind of device we want to support, some level of integration work will be required for a great experience. But a lot of it shares some common abstractions. Builder 3.28 will include some of that plumbing.

For example, we can detect various Qemu configurations and provide cross-architecture building for Flatpak. This isn’t using a cross-compiler, but rather a GCC compiled for aarch64 as part of the Flatpak SDK. This is much slower than a proper cross-compiler toolchain, but it does help prove some of our abstractions are working correctly (and that was the goal for this early stage).

Some of the big things we need to address as we head towards 3.30 include running the app on a remote device as well as hooking up gdb. I’d like to see Flatpak gain the support for providing a cross-compiler via an SDK extension too.

I’d like to gain support for simulators in 3.30 as well. My personal effort is going to be based around GNU-based systems. However, if others join in and want to provide support for other sorts of devices, I’d be happy to get the contributions.

Anyway, Builder 3.28 can do some interesting things if you know where to find them. So for 3.30 we’ll polish them up and make them useful to vastly more people.

Introducing deviced

Over the past couple of weeks I’ve been heads down working on a new tool along with Patrick Griffis. The purpose of this tool is to make it easier to integrate IDEs and other tooling with GNU-based gadgets like phones, tablets, infotainment, and IoT devices.

Years ago I was working on a GNOME-integrated home router with davidz which sadly we never finished. One thing that was obvious to me in that moment of time was that I will not do another large scale project until I have better tooling. That was Builder’s genesis and device integration is what will make it truly useful to myself and others who love playing with GNU-friendly gadgets.

Now, building an IDE is a long process. There is a ton of code to write, trade-offs to work through, and persistence beyond what any reasonable programmer would voluntarily sign up for. But the ends justify the slog.

So what we’ve created is uninterestingly called “deviced”. It currently has three components. A deviced daemon lives on the target device that we’re interested in writing software for. A GObject-based libdeviced library provides access to discover and connect to devices and do interesting things on them. Lastly, devicectl is a readline-based command line tool that allows you to interact with these devices without having to write a program using libdeviced.

The APIs in libdeviced are appropriately abstracted so that we can provide different transports in the future. Currently, we only have network-based communication but we will implement a USB transport in the not-too-distant future. Other protocols such as SSH or custom micro-controllers can be added. Although something like SSH is more complex because it’d be the combination of both a protocol and how to run commands to get the intended effect, which is non-portable. It will be possible to support devices that do not run deviced, but that is currently out of scope.

To allow devices to be discover-able, deviced will broadcast it’s presence using mDNS on networks it is configured to listen (based on network-manager connection UUID). Long term my goal is that you can configure deviced access in Control Center, similar to “Sharing and Privacy”. The network protocol is rather simple as it’s just JSON-RPC over TLS with self-signed certificates. When a client connects to the daemon, a gnome-shell notification is presented allowing you to accept the connection. At that point, the client certificate is saved for future validation.

Our libdeviced library is GObject introspectable and should therefore work with a number of languages.

Right now, only Flatpak applications are supported, but we have abstractions to allow for contributions to support additional application layers like docker or plain old .desktop files. Currently you can push flatpak applications and runtimes to the device and install them and run them. If you have a new enough Flatpak, you can do delta updates.

It can even bridge multiple PTY devices for a shell, which isn’t really meant to be an SSH replacement, but more of a single abstraction we can use to be able to control a debugger and inferior from the IDE tooling.

There are still lots of little bugs to shake out and more bits to implement, but this is a pretty sweet 2-week proof of concept.

https://gitlab.gnome.org/chergert/deviced/

Here is a 20 second demo running on a single machine. It’s the same when using multiple machines except you get the notification on the programmable device rather than on your workstation. Obviously for IoT devices we’d need to create some sort of freedesktop notification bridge or alternate notification mechanism.

Anytime you work on a new project people will inevitably ask “why not just use XYZ”. In this case, I would expect both SSH and ADB to fall into that category. Most importantly, libdeviced is going to be about providing a single “remote device” abstraction for us in Builder. So it’s reasonable that we could abstract both of those systems from libdeviced. But neither of those provide the work-flow I envision for out-of-box experience, hence the deviced daemon. In the ADB case, it will be very difficult to get code upstream and released to distributions as it is increasingly unlikely our use-case is interesting to upstream. There were experimental patches to ADB a couple years ago to support flatpak so we didn’t take on this effort without considering our options. Ultimately, this prototype was to see the feasibility of making something that solves our problems while not locking us out of supporting other systems in the future.