one more attempt at getting a blog going

January 26, 2016

So every year or two I try to get my blog going again, end up doing a post or two and then drop off the map again. 2016’s new years resolution is to blog more, so here’s one more go… I’ll just give a summary of some of things I did in the last few days (mostly rambling, hope that’s okay).

spontaneous sigbus in applications under wayland — Matthias was getting spontaneous crashes from wayland clients when using his desktop for an extended period of time. Normally when it hit him, several apps would go down at once. Stacktrace showed it dying in pixman in an SSE function. My first thought was some sort of memory alignment problem in the image buffer since I know SSE2’s has alignment restrictions. I read through the code as carefully as I could and did come up with some clean ups but nothing to suggested incorrect alignment. pixman handled 4 byte aligned image data fine, and we were using 4 bytes per pixel. My next guess was that perhaps the filesystem backing the shared memory was getting full. GDK was using /tmp, which gets used for a lot of other random things, so it was at least a plausible theory. Indeed, if I ran /usr/bin/fallocate to allocate a large file and fill up my own /tmp I could reproduce the failure. So my first inclination was to switch to shm_open() instead of open() so that /dev/shm would get used in place of /tmp. The downside to that, is there’s no shm_open() equivalent to mkstemp, so the patch included some hairy code for generating a non-clashing shm segment name. My next idea was to use memfd_create which landed in the kernel a couple of years ago, and still has that new car smell. Turns out there’s no glibc wrapper for it yet, though a patch was posted. I chatted with the libc maintainer and the author of the patch, it just fell through the cracks for non-technical reasons, so will hopefully land in the future. In the mean time, I did the patch using syscall() directly. Normally when resorting to something as ad hoc as a raw syscall() call, I would do configure checks and fallback code, but since this is all wayland anyway, and not relevant on legacy systems, we can get away without that baggage.

another round at a bug nvidia users encounter at the login screen — Last year I got a report of a very strange bug: for users of proprietary nvidia driver, after they type their password and hit enter, the login screen would just freeze up and login wouldn’t proceed until they switched vts or hit escape. This was an extremely strange bug that had multiple facets explained here

. Unfortunately, one of the four fixes ended up causing a mysterious crasher for a user, and it wasn’t strictly needed as long as one or more of the other fixes went in, so I reverted it until I could investigate the crash in more detail. Not having the patch did have a downside, though: It meant an animation stayed running in the background using up CPU. Recently, it’s been release time, and I wanted a fix in, so I threw the patch back to master and chatted with the reporter on IRC to debug with him the crasher. After some back and forth we discovered the problem and fixed a memory leak, too.

A couple of weeks ago someone on IRC asked for my help debugging a really strange mouse input issue. They told me their pointer wasn’t working very well, and I asked them to switch VTs and switch back to see if that helped. My thought was that perhaps, the X server wasn’t getting access to the input device back from logind and that switching VTs and switching back would fix it. He reported happily that it did fix the problem, and then went into greater detail explaining the symptoms. It wasn’t that the mouse didn’t work at all, it was that it only worked for one application at a time (usually firefox or gnome-terminal but also other applications). That application could change tabs and click, but its window couldn’t be dragged around and no other windows could gain focus. keyboard hot keys worked, but he couldn’t open up the activities overview, and no system modal dialogs would display. This sounded a lot like a stuck pointer grab, but it was weird that it was happening in more than one application, since stuck pointer grabs are normally application specific bugs. After switching VTs he couldn’t reproduce the problem anymore, so I told him to come grab me in person (we work in the same building) if it happened again. Earlier today he said it happened again, so I visited his cubicle. After some debugging, I discovered I could unbreak the system temporarily by gdb attaching to the X server and calling a function to clear all grabs. Then as long as I clicked on window title bars, the system stayed functional, but as soon as I clicked inside the application content area, the application clicked would get a stuck grab and the problem would manifest. I was able to reproduce the problem with xterm, so I knew it wasn’t toolkit specific. I was a little stumped, so I did a chvt call and got him going again, then went back to my cube and thought about it for a few minutes. That’s when I remembered two things:

when pressing the mouse button the x server should implicitly give the client a pointer grab until the button release (by design)

when changing vts, the X server sends mouse release events for every button of every input device

So the conclusion I drew was that, perhaps, the X server wasn’t releasing the implicit pointer grab when the button was released. To test that theory, I needed to see what the behavior of the system would be like if the mouse button was held down while he was using it. I asked him (with his now functional system) to plug a second mouse in, hold down the button from it and then interact with the system using his first mouse. Sure enough, the system behaved in the same broken way as when the problem manifested. So this odd problem really was as if the X server somehow missed a button release, or thought the mouse button was stuck down. It was at this point he realized he had a third, wireless mouse turned on, and smooshed into his backpack. oh. I guess that mystery is solved.

Compositor stack consolidation — one issue we have right now with wayland, is mode setting is split across two parts of the stack. Some of the drm calls happen in cogl and some of the drm calls happen in mutter. we actually get the drm fd from mutter, pass it into cogl, and then pull it back out from cogl at various times. We figure out output setup in mutter using drm apis, then stuff the configuration in cogl abstraction, then unpack it and apply it using drm apis back in cogl. We do cursor handling in mutter directly. It’s pretty haphazard where what happens, and it would be good to consolidate the drm code in one place. Mutter seems like the best place since it offers the dbus api for doing output configuration, and since it’s the more actively maintained component. I think consolidating the code will make it much easier down the line to rework how we do rendering to use EGL extensions instead of GBM for instance. One idea, since cogl isn’t that well maintained, and since it has a lot of code for other platforms that we’ll never need, is to ship a sort of cogl lite in the mutter tree. The problem, though, is clutter is well maintained, and it relies on cogl, too. Well it turns out Jonas was thinking about the same problem, and he came to the realization that there’s a lot of compositor functionality mutter needs that would be hard to add to clutter and not be useful to normal clutter clients. So he proposed Merging clutter and cogl forks into mutter. So I’m working on that, stay tuned.

Well, this post is a bit wordy so I’ll cut it off here.

Posted by halfline
Filed in General

5 Comments »

5 Responses to “one more attempt at getting a blog going”

Cole Robinson Says:

January 27, 2016 at 11:14 am
Here’s my vote for more rambling blog posts, I enjoyed this :) That mystery mouse story was fun

Reply
- Michael Catanzaro Says:
  
  January 29, 2016 at 9:21 pm
  Great ending :D
  
  Reply
Kristian Høgsberg Says:

January 27, 2016 at 1:18 pm
Big \o/ for merging cogl and clutter into mutter!

Reply
hashem Says:

January 28, 2016 at 6:33 pm
Great write up! I laughed out loud at the wireless mouse in bag. I hope you do more posts.

Reply
» leaking buffers in wayland Ray Strode Says:

February 1, 2016 at 1:15 pm
[…] in my last blog post I mentioned Matthias was getting SIGBUS when using wayland for a while. You may remember that I […]

Reply

Ray Strode