Fedora Workstation and the quest for stability and robustness

One of the things that makes me really happy in terms of the public reception to the Fedora Workstation is all the people calling out how stable and solid it is, as this was and is one of our big goals from the start of the Fedora Workstation effort.

From the start we wanted to bury the old idea of Fedora being only for people who didn’t mind risking a lot of instability in return for being on the so called bleeding edge. We also wanted to bury the related idea that by using Fedora you where basically alpha testing highly unstable and unfinished software for Red Hat Enterprise Linux. Yet at the same time we did want to preserve and build upon the idea that Fedora is a great operating system if you want to experience a lot of the latest and greatest new developments as they are happening. At first glance those two goals might seem a bit contradictory, but we decided that we should be able to do both by both adjusting our policies a bit and also by relying more on the Fedora retrace server as our bug fixing prioritization tool.

So in terms of policies the division of Fedora into a distinct server and workstation images and also the clearer separation of the spins, allowed us to start making decisions without worrying so much how they affected other usecases than our own. Because sometimes what from a user perspective seems like a bug or something being broken was non-workstation policy decisions getting in the way of the desktop behaving as expected, for instance firewall rules hindering basic desktop functions.

Secondly we incorporated a more careful approach into what and when we brought in new stuff, meaning we still try to keep on top of major upstream developments and be a leading edge system, but at the same time we do a little mental exercise for each decision to make sure its a decision that makes us ‘leading edge’ and not ‘bleeding edge’. And if we really want something in, but it isn’t 100% ready for prime time yet we do what we have done with Wayland or the GTK3 port of LibreOffice, we make it available as an option for early adopters, but we default to the safer choice while we work out the last wrinkles. (Btw, if you are interested in progress on Wayland, Kevin Martin, sent out an emailing with a link to a good Wayland development status just before the Holidays.

The final piece of the puzzle is regularly checking and identifying important bugs from the Fedora retrace server. Because like almost all developers we get way more bug reports than we realistically can ever address, so having the data from the retrace server allows us to easily identify the crashes that affect the most users, and just as importantly lets us filter out the bug reports that are likely caused by users installing weird stuff on their system. When we started using retrace various desktop modules tended to dominate the top 3 pages when sorting bugs based on count, but due to a continuous effort over the last few years desktop modules appearing in the top crashers list are few and far between and when they do appear we make sure to get fixes done quickly for them. So if you ever wonder if the data collected by these kind of systems are actually helping developers working on the software you use better, I can say that it is true for Fedora for sure.

That said I thought it could be interesting to explain a bit the challenges we have with tracking our progress in this area. So lets start by looking at a graph I pulled from the retrace server.
fedora-bug-statistics
Looking at that graph one could say that it is clear that we have made great strides in improving system stability and I do believe that is the case, however the graphs doesn’t truly prove that inconclusively, they are just an indication. The reason they are not hard evidence is that there are a lot of things you need to take into consideration when reading them. First of all they are not adjusted based on total user population, which means that if you win or lose a lot of users between releases it can create an appearance of increased instability or decreased instability which is actually due to increase or decrease in user population, not in ‘how well does the system run on an individual users system’. So from what we see through other metrics our user population has been increasing since we launched the Fedora Workstation which means we shouldn’t be getting any ‘help’ in these graphs from a declining user population.

A second reason is that there are a lot of false positives being reported here, for instance we had an issue for a long while that the Intel graphics drivers generating a ton of this crash reports without it actually being crashes as such. So while they did represent bugs that should ideally be fixed they where not issues you might actually have noticed as a user of the system. So we spent some effort between Fedora Workstation 21 and Fedora Workstation 22 to reduce the amount of noise caused by this, which was an useful effort for us in terms of reducing noise in our retrace server, but from a user perspective it didn’t really make a tangible difference. And even with our efforts there are a still a lot of kernel issues showing up here which are not impacting users in a way that they are likely to perceive as the system being unstable.

A third item that might in a given release skewer the statistics is that we currently don’t differentiate between Fedora Workstation and spins in the statistics, which means that there might be issues caused by one of the spins generating a lot of bug reports against a module, but that might be a bug or an API usage issue that is not triggered by the Workstation edition and thus those items appearing or disappearing might affect the statistics, but as a user of the Fedora Workstation you would never experience it.

So keeping this is mind the retrace server is an important tool for us and one that at least gives us a decent indication of how we are doing with quality. But we can always do better so we will keep reviewing the reports we get through the ABRT and retrace systems and I also do strong recommend any application or library maintainers out there to look into what major issues are reported against their own modules.