Update on Newton, the Wayland-native accessibility project

Several months ago, I announced that I would be developing a new accessibility architecture for modern free desktops. Now, I’m happy to provide an update on this project, code-named Newton. Before I begin, I’d like to thank the Sovereign Tech Fund for funding this work, and the GNOME Foundation for managing the contract.

A word on the name

When choosing a working name for this project, I decided to follow the convention established by Wayland itself, and followed by a couple of other projects including the Weston compositor, of naming Wayland and related projects after places in New England. Newton, Massachusetts is the town where the Carroll Center for the Blind is located.

Demo

Here’s a demo of me using a modified GNOME OS image with a couple of GTK 4 apps running inside Flatpak sandboxes without the usual accessibility exception.

 

Builds for testing

The following builds are based on GNOME 46.2 with my current unmerged Newton modifications. The corresponding Git branches are linked below.

I’ve also built a Flatpak repository, but it isn’t currently signed, so it doesn’t have a .flatpakrepo file. You can add it manually with this command:

flatpak remote-add --user --no-gpg-verify newton https://mwcampbell.us/gnome/gnome-46-newton/repo/

Because the Flatpak repository is based on GNOME 46, you can use Flatpak apps that were built for GNOME 46 with the Newton version of the org.gnome.Platform runtime. You can install that runtime with this command:

flatpak install newton org.gnome.Platform

Source repositories

Here are the links to the unmerged Newton branches of the individual components:

Here are my branches of the Buildstream metadata repositories, used to build the GNOME OS image and Flatpak runtime:

freedesktop-sdk gnome-build-meta

Only this last repository needs to be checked out directly. With it, one should be able to reproduce my builds.

If you want to do your own builds of the relevant components, my addition to the Orca README has instructions. The Orca GitLab project linked above is also a good place to provide end-user feedback.

What’s working so far

I’ve now implemented enough of the new architecture that Orca is basically usable on Wayland with some real GTK 4 apps, including Nautilus, Text Editor, Podcasts, and the Fractal client for Matrix. Orca keyboard commands and keyboard learn mode work, with either Caps Lock or Insert as the Orca modifier. Mouse review also works more or less. Flat review is also working. The Orca command to left-click the current flat review item works for standard GTK 4 widgets.

As shown in the recorded demo above, Newton-enabled applications can run inside a Flatpak sandbox without the usual exception for the AT-SPI bus, that is, with the --no-a11y-bus option to flatpak run. Support for such sandboxing was one of the major goals of this project.

The standard GTK text widgets, including GtkEntry and GtkTextView have fairly complete support. In particular, when doing a Say All command, the caret moves as expected. I was also careful to support the full range of Unicode, including emoji with combining characters such as skin tone modifiers.

What’s broken or not done yet

The GNOME Shell UI itself is not yet using Newton, but AT-SPI. The UI is still accessible with the Newton versions of Mutter and Orca, but it’s being accessed via AT-SPI, meaning that performance in this UI is not representative of Newton, and mouse review doesn’t work for this UI.

Synthesizing mouse events isn’t yet supported on Wayland. This means that while the Orca command for left-clicking the current flat review item is expected to work for standard GTK 4 widgets, that command doesn’t work for widgets that don’t support the appropriate accessible action, and the right-click command doesn’t work.

AccessKit doesn’t currently support sentences as text boundaries. This means that Orca’s Say All command fallback so treading by line, leading to unnatural breaks in the speech.

The GTK AccessKit implementation doesn’t yet support out-of-tree text widgets that implement the GtkAccessibleText interface, such as the GTK 4 version of the vte terminal widget. This means that GTK 4-based terminal apps like kgx don’t yet work with Newton. I don’t yet know how I’ll solve this, as the current GtkAccessibleText interface is not a good fit for the push-based approach of AccessKit and Newton.

Text attributes such as font, size, style, and color aren’t yet exposed. AccessKit has properties for these attributes, but the AccessKit AT-SPI backend, much of which is reused by the Newton AT-SPI compatibility library, doesn’t yet support them.

Tables aren’t yet supported. AccessKit has properties for tables, and the GTK AccessKit backend is setting these properties, but the AccessKit AT-SPI backend doesn’t yet expose these properties.

Some states, such as “expanded”, “has popup”, and “autocomplete available”, aren’t yet exposed.

I’m aware that some GTK widgets don’t have the correct roles yet.

When Caps Lock is set as the Orca modifier, you can’t yet toggle the state of Caps Lock itself by pressing it twice quickly.

Orca is the only assistive technology supported so far. In particular, assistive technologies that are implemented inside GNOME Shell, like the screen magnifier, aren’t yet supported.

Bonus: Accessible GTK apps on other platforms

Because we decided to implement Newton support in GTK by integrating AccessKit, this also means that, at long last, GTK 4 apps will be accessible on Windows and macOS as well. The GTK AccessKit implementation is already working on Windows, and it shouldn’t be much work to bring it up on macOS. To build and test on Windows, check out the GTK branch I linked above and follow the instructions in its README. I’ve built and tested this GTK branch with both Visual Studio (using Meson and the command-line Visual C++ tools) and MSYS 2. I found that the latter was necessary for testing real-world apps like gnome-text-editor.

Architecture overview

Toolkits, including GTK, push accessibility tree updates through the new accessibility Wayland protocol in the wayland-protocols repository linked above. The latest accessibility tree update is part of the surface’s committed state, so the accessibility tree update is synchronized with the corresponding visual frame. The toolkit is notified when any accessibility clients are interested in receiving updates for a given surface, and when they want to stop receiving updates, so the toolkit can activate and deactivate its accessibility implementation as needed. This way, accessibility only consumes CPU time and memory when it’s actually being used. The draft Wayland protocol definition includes documentation with more details.

Assistive technologies or other accessibility clients currently connect to the compositor through a D-Bus protocol, defined in the Mutter repository linked above. By exposing this interface via D-Bus rather than Wayland, we make it easy to withhold this communication channel from sandboxed applications, which shouldn’t have this level of access. Currently, an assistive technology can find out about a surface when it receives keyboard focus or when the pointer moves inside it, and can then ask to start receiving accessibility tree updates for that surface.

The same D-Bus interface also provides an experimental method of receiving keyboard events and intercepting (“grabbing”) certain keys. This is essential functionality for a screen reader such as Orca. We had originally planned to implement a Wayland solution for this requirement separately, but decided to prototype a solution as part of the Newton project to unblock realistic testing and demonstration with Orca. We don’t yet know how much of this design for keyboard event handling will make it to production.

The compositor doesn’t process accessibility tree updates; it only passes them through from applications to ATs. This is done using file descriptor passing. Currently, the file descriptors are expected to be pipes, but I’ve thought about using shared memory instead. The latter would allow the AT to read the accessibility tree update without having to block on I/O; this could be useful for ATs that run inside Mutter itself, such as the screen magnifier. (The current Newton prototype doesn’t yet work with such ATs.) I don’t know which approach is overall better for performance though, especially when one takes security into account.

The serialization format for accessibility tree updates is currently JSON, but I’m open to alternatives. Obviously we need to come to a decision on this before this work can go upstream. The JSON schema isn’t yet documented; so far, all code that serializes and deserializes these tree updates is using AccessKit’s serialization implementation.

In addition to tree updates, this architecture also includes one other form of serialized data: accessibility action requests. These are passed in the reverse direction, from the AT to the application via the compositor, again using file descriptor passing. Supported actions include moving the keyboard focus, clicking a widget, setting the text selection or caret position, and setting the value of a slider or similar widget. The notes about serialization of tree updates above also apply to action requests.

Note that the compositor is the final authority on which tree updates are sent to the ATs at what time, as well as on which surface has the focus. This is in contrast with AT-SPI, where ATs receive tree updates directly from applications, and any application can claim to have the focus at any time. This is important for security, especially for sandboxed applications.

Open architectural issues

The biggest unresolved issue at this point is whether the push-based approach of Newton, the motivation for which I described in the previous post, will have unsolvable drawbacks, e.g. for large text documents. The current AccessKit implementation for GtkTextView pushes the full content of the document, with complete text layout information. On my brand new desktop computer, this has good performance even when reading an 800 KB ebook, but of course, there are bigger documents, and that’s a very fast computer. We will likely want to explore ways of incrementally pushing parts of the document based on what’s visible, adding and removing paragraphs as they go in and out of view. The challenge is to do this without breaking screen reader functionality that people have come to depend on, such as Orca’s Say All command. My best idea about how to handle this didn’t occur to me until after I had finished the current implementation. Anyway, we should start testing the current, naive implementation and see how far it takes us.

The current AT protocol mentioned above doesn’t provide a way for ATs to explore all accessible surfaces on the desktop; they can only find out about an accessible surface if it receives keyboard focus or if the pointer moves inside it. A solution to this problem may be necessary for ATs other than Orca, or for automated testing tools which currently use AT-SPI.

The current architecture assumes that each Wayland surface has a single accessibility tree. There isn’t yet an equivalent to AT-SPI’s plugs and sockets, to allow externally generated subtrees to be plugged into the surface’s main tree. Something like this may be necessary for web rendering engines.

I’m not yet sure how I’ll implement Newton support in the UI of GNOME Shell itself. That UI runs inside the same process as the compositor, and isn’t implemented as Wayland surfaces but as Clutter actors (the Wayland surfaces themselves map to Clutter actors). So the existing AccessKit Newton backend won’t work for this UI as it did for GTK. One option would be for Mutter to directly generate serialized tree updates without going through AccessKit. That would require us to finalize the choice of serialization format sooner than we otherwise might. While not as convenient as using the AccessKit C API as I did in GTK, that might be the least difficult option overall.

Newton doesn’t expose screen coordinates, for individual accessible nodes or for the surfaces themselves. ATs are notified when the pointer moves, but the compositor only gives them the accessible surface ID that the pointer is inside, and the coordinates within that surface. I don’t yet have a solution for explore-by-touch, alternative input methods like eye-tracking, or ATs that want to draw overlays on top of accessible objects (e.g. a visual highlight for the screen reader cursor).

Next steps

The earlier section on what’s broken or not done yet includes several issues that should be straightforward to fix. I’ll fix as many of these as possible in the next several days.

But the next major milestone is to get my GTK AccessKit integration reviewed and merged. Since Newton itself isn’t yet ready to go upstream, the immediate benefit of merging GTK AccessKit support would be accessibility on Windows and macOS. The current branch, which uses the prototype Newton backend for AccessKit, can’t be merged, but it wouldn’t be difficult to optionally support AccessKit’s AT-SPI backend instead, while keeping the Newton version on an unmerged branch.

The main challenge I need to overcome before submitting the GTK AccessKit integration for review is that the current build system for the AccessKit C bindings is not friendly to distribution packagers. In particular, one currently has to have rustup and a Rust nightly toolchain installed in order to generate the C header file, and there isn’t yet support for installing the header file, library, and CMake configuration files in FHS-compliant locations. Also, that build process should ideally produce a pkg-config configuration file. My current gnome-build-meta branch has fairly ugly workarounds for these issues, including a pre-generated C header file checked into the repository. My current plan for solving the nightly Rust requirement is to commit the generated header file to the AccessKit repository. I don’t yet know how I’ll solve the other issues; I might switch from CMake to Meson.

The other major thing I need to work on soon is documentation. My current contract with the GNOME Foundation is ending soon, and we need to make sure that my current work is documented well enough that someone else can continue it if needed. This blog post itself is a step in that direction.

Help wanted: testing and measuring performance

I have not yet systematically tested and measured the performance of the Newton stack. To be honest, measuring performance isn’t something that I’m particularly good at. So I ask that Orca users try out the Newton stack in scenarios that are likely to pose performance problems, such as large documents as discussed above. Then, when scenarios that lead to poor performance are identified, it would be useful to have someone who is skilled with a profiler or similar tools help me investigate where the bottlenecks actually are.

Other desktop environments

While my work on Newton so far has been focused on GNOME, I’m open to working with other desktop environments as well. I realize that the decision to use D-Bus for the AT client protocol won’t be universally liked; I suspect that wlroots-based compositor developers in particular would rather implement a Wayland protocol extension. Personally, I see the advantages and disadvantages of both approaches, and am not strongly attached to either. One solution I’m considering is to define both D-Bus and Wayland protocols for the interface between the compositor and ATs, and support both protocols in the low-level Newton client library, so each compositor can implement whichever one it prefers. Anyway, I’m open to feedback from developers of other desktop environments and compositors.

Conclusion

While the Newton project is far from done, I hope that the demo, builds, and status update have provided a glimpse of its potential to solve long-standing problems with free desktop accessibility, particularly as the free desktop world continues to move toward Wayland and sandboxing technologies like Flatpak. We look forward to testing and feedback from the community as we keep working to advance the state of the art in free desktop accessibility.

Thanks again to the Sovereign Tech Fund and the GNOME Foundation for making this work possible.

Automated testing of GNOME accessibility features

GNOME is partipating in the December 2023 – February 2024 round of Outreachy. As part of this project, our interns Dorothy Kabarozi and Tanju Achaleke have extended our end-to-end tests to cover some of GNOME’s accessibility features.

End-to-end testing, also known as UI testing, involves simulating user interactions with GNOME’s UI. In this case we’re using a virtual machine which runs GNOME OS, so the tests run on the latest, in-development version of GNOME built from the gnome-build-meta integration repo. The tests send keyboard & mouse events to trigger events in the VM, and use fuzzy screenshot comparisons to assert correct behavior. We use a tool called openQA to develop and run the tests.

Some features are easier to test than others. So far we’ve added tests for the following accessibility features:

  • High contrast theme
  • Large text theme
  • Always-visible scrollbars
  • Audio over-amplification (boost volume above 100%)
  • Visual alerts (flash screen when the error ‘bell’ sound plays)
  • Text-to-speech using Speech Dispatcher
  • Magnifier (zoom)
  • On-screen keyboard

In this screenshot you can see some of the tests:

Screenshot of openqa.gnome.org showing tests like a11y_high_contrast, a11y_large_text, etc.

Here’s a link to the actual test run from the screenshot: https://openqa.gnome.org/tests/3058

These tests run every time the gnome-build-meta integration repo is updated, so we can very quickly detect if a code change in the ‘main’ branch of any GNOME module has unintentionally caused a regression in some accessibility feature.

  • GNOME’s accessibility features are seeing some design and implementation improvements at the moment, thanks to several volunteer contributors, investments from the Sovereign Tech Fund and Igalia, and more. As improvements land, the tests will need updating too. Screenshots can be updated using openQA’s web UI, available at https://openqa.gnome.org, there are instructions available. The tests themselves live in openqa-tests.git and are simple Perl programs using openQA’s testapi. Of course merge requests to extend and improve the tests are very welcome.

One important omission from the testsuite today is Orca, the GNOME screen reader. Tanju spent a long time trying to get this to work, and we do have a test that verifies text-to-speech using Speech Dispatcher. Orca itself is more complicated and we’ll need to spend more time to figure out how best to set up end-to-end tests for screen reading.

If you have feedback on the tests, we’d love to hear from you over on the GNOME Discourse forum.

A new accessibility architecture for modern free desktops

My name is Matt Campbell, and I’m delighted to announce that I’m joining the GNOME accessibility team to develop a new accessibility architecture. After providing some brief background information on myself, I’ll describe what’s wrong with the current Linux desktop accessibility architecture, including a design flaw that has plagued assistive technology developers and users on multiple platforms, including GNOME, for decades. Then I’ll describe how two of the three current browser engines have solved this problem in their internal accessibility implementations, and discuss my proposal to extend this solution to a next-generation accessibility architecture for GNOME and other free desktops.

Introducing myself

While I’m new to the GNOME development community, I’m no stranger to accessibility. I’m visually impaired myself, and I’ve been working on accessibility in one form or another for more than 20 years. Among other things:

  • I contributed to the community of blind Linux users from 1999 through 2001. I modified the ZipSlack mini-distro to include the Speakup console screen reader, developed the trplayer command-line front-end for RealPlayer, and helped several new users get started.
  • In 2003 to 2004, I developed a talking browser based on the Mozilla Gecko engine; it ran on both Windows and Linux.
  • Starting in 2004, I developed a Windows screen reader, called System Access, for Serotek (which has since been acquired by my current company, Pneuma Solutions).
  • I later worked on the Windows accessibility team at Microsoft, where I contributed to the Narrator screen reader and the UI Automation API, from mid 2017 to late 2020. (Rest assured, the non-compete clause in my employment agreement with Microsoft expired long ago.)
  • For the past two years, I have also been the lead developer of AccessKit, a cross-platform accessibility abstraction for GUI toolkits. My upcoming work on the GNOME accessibility architecture will build on the work I’ve been doing on AccessKit.

The problems we need to solve

The free desktop ecosystem has changed dramatically since the original GNOME accessibility team, led by Sun Microsystems, designed the original Assistive Technology Service Provider Interface (AT-SPI) in the early 2000s. Back then, per-application security sandboxing was, at best, a research project. X11 was the only free windowing system in widespread use, and it was taken for granted that each application would both know and control the position of each of its windows in global screen coordinates. Obviously, with the rise of Flatpak and Wayland, all of these things have changed, and the GNOME accessibility stack must adapt.

But in my opinion, AT-SPI also has the same fatal flaw as most other accessibility APIs, going back to the 1990s. The first programmatic accessibility API, Microsoft Active Accessibility (MSAA), was introduced in 1997. Sun implemented the Java Access Bridge (JAB) for Windows not long after. What MSAA, the JAB, and AT-SPI all have in common is that their performance is severely limited by the latency of multiple inter-process communication (IPC) round trips. The more recent UI Automation API (introduced in 2005) mitigated this problem somewhat with a caching system, as did AT-SPI, but that has never been a complete solution, especially when assistive technologies need to traverse text documents. For decades, those of us who have developed Windows screen readers have been so determined to work around this IPC bottleneck that we’ve relied, to varying degrees, on the ability to inject some of our code into application processes, so we can more efficiently fetch the information we need and do our own IPC. Needless to say, this approach has grave drawbacks for security and robustness, and it’s not an option on any platform other than Windows. We need a better solution.

The solution: Push-based accessibility

We can find such a solution in the internal accessibility architecture of some modern browsers, particularly Chromium and (more recently) Firefox. As you may know, these browsers have a multi-process architecture, where a sandboxed process, known as the content process or renderer process, renders web content and executes JavaScript. These sandboxed processes have all of the information needed to produce an accessibility tree, but for various reasons, it’s still optimal or even necessary to implement the platform accessibility APIs in the main, unsandboxed browser process. So, to prevent the multi-process architecture from further degrading browser performance for assistive technology users, these browsers internally implement what I’ll call a push architecture. There’s still IPC happening internally, but the renderer process initially pushes a complete serialized snapshot of an accessibility tree, followed by incremental tree updates. The platform accessibility API implementations in the main browser process can then respond immediately to queries using a local copy of the accessibility tree. This is in contrast with the pull-based platform accessibility APIs, where assistive technologies or other clients pull information about one node at a time, sometimes incurring IPC round trips for one property at a time. In terms of latency, the push-based approach is far more efficient. You can learn more in Jamie Teh’s blog post about Firefox’s Cache the World project.

Ever since I learned about Chromium’s internal accessibility architecture more than a decade ago, I have believed that assistive technologies would be more robust and responsive if the push-based approach were applied across the platform accessibility stack, all the way from the application to the assistive technology. If you’re a screen reader user, you have likely noticed that if an application becomes unresponsive, you can’t find out anything about what is currently in the application window, while, in a modern, composited windowing system, a sighted user can still see the last rendered frame. A push-based approach would ensure that the latest snapshot of the accessibility tree is likewise always available, in its entirety, to be queried by the assistive technology in any way that the AT developer wants to. And because an AT would have access to a local copy of the accessibility tree, it can quickly perform complex traversals of that tree, without having to go back and forth with the application to gather information. This is especially useful when implementing advanced commands for navigating web pages and other complex documents.

What’s not changing

Before I go further, I want to reassure readers that many of the fundamentals of accessibility are not changing with this new architecture. An accessible UI is still defined by a tree of nodes, each of which has a role, a bounding rectangle, and other properties. Many of these properties, such as name, value, and the various state flags, will be familiar to anyone who has already worked with AT-SPI, the GTK 4 accessibility API, or the legacy ATK. It’s true that we’ll have to add several new properties, especially for text nodes. However, I believe that most of the work that application and toolkit developers have already done to implement accessibility will still be applicable.

Risks and benefits

I’ve written a more detailed proposal for this new architecture, including a discussion of various risks. The risk that concerns me most at this point is that I’m not yet entirely sure how we’ll handle large, complex documents in non-web applications such as LibreOffice. I specifically mention non-web applications here because web applications, such as the various online office suites, are already limited in this respect by the performance of the browser itself, including the internal push-based accessibility implementations of Chromium and Firefox. My guess is that, with this new architecture, applications such as LibreOffice will need to present a virtualized accessible view of the document, similar to what web applications are already doing. We’ll need to make sure that we implement this without giving up features that assistive technology users have come to expect, particularly when it comes to efficiently navigating large documents. But I believe this is feasible.

I’d like to close with a couple of exciting possibilities that my proposal would enable, which I believe make it worth the risks. Let’s start with accessible screenshots. Anyone who uses a screen reader knows how common, and how frustrating, it is to come across screenshots with no useful alternate text (alt text). Even when an author has made an effort to provide useful alt text, it’s still just a flat string. Imagine how much more useful it would be to have access to the full content and structure of the information in the screenshot, as if you were accessing the application that the screenshot came from (assuming the app itself was accessible). With a push-based accessibility architecture, a screenshot tool can quickly grab a full snapshot of the accessibility tree from the window being captured. From there, I don’t think it would be too difficult to propose a way of including a serialized accessibility tree in the screenshot image file. Getting such a thing standardized might be more difficult, but the push architecture would at least eliminate a major technical barrier to accessible screenshots.

Taking this idea a step further, because the proposed push architecture includes incremental tree updates as well as full snapshots, it would also become feasible to implement accessibility in streaming screen-sharing applications. This obviously includes one-on-one remote desktop use cases; imagine extending VNC or RDP with the ability to push accessibility tree updates. But what’s more exciting to me is the potential to add accessibility in one-to-many use cases, such as the screen-sharing features found in online meeting spaces. And while this too might be difficult to standardize, one could even imagine including accessibility information in standard video container formats such as MP4 or WebM, making visual information accessible in everything from conference talks to product demo videos and online courses.

Conclusion

Too often, free desktop platforms struggle to keep up with their proprietary counterparts. That’s not a criticism; it’s just a fact that free desktop projects don’t have the resources of the proprietary giants. But here, we have an opportunity to leap ahead, by betting on a different accessibility architecture. The push approach has already been proven by two of the three major browser engines, and while there are risks in extending this approach outside of the browser context, I strongly believe that the potential benefits are too big to ignore.

Over the next year we will be experimenting with this new approach in a prototype, collect feedback from various stakeholders across the ecosystem, and hopefully build a new, stronger foundation for state-of-the-art accessibility in the free desktop world. If you have questions or ideas don’t hesitate to reach out or leave a comment!

A new blog!

Welcome to the new blog GNOME accessibility team blog! People have been organizing over the past year or so to improve a11y across the project, and this blog will be a venue for making this kind of work more visible.

We’ll also have some exciting announcements to share soon, so stay tuned!