Wandering in the symlink forest forever

Last week, Philip Withnall told me that Meson has built-in support for generating code coverage reports: just configure with -Db_coverage=true, run your tests with ninja test, then run ninja coverage-{text,html,xml} to generate the report in the format of your choice. The XML format is compatible with Cobertura’s output, which is convenient since Endless’s Jenkins is already configure to consume Cobertura XML generated by Autotools projects using our EOS_COVERAGE_REPORT macro. So it was a simple matter of adding gcovr to the build enviroment, running ninja coverage-xml after the tests, and moving the report to the right place for Jenkins to find it. It worked well on the projects I tested, so I decided to enable it for all Meson projects built in our CI. Sure, I thought, it’s not so useful for our forks of GNOME and third-party projects, but it’s harmless and saves adding per-project config, right?

Fast-forward to yesterday, when someone noticed that a systemd build had been stuck on the ninja coverage-xml step for 16 hours. Uh oh.

It turns out that gcovr follows symlinks when scanning for coverage files, but didn’t check for cycles. systemd’s test suite generates a fake sysfs tree, with many circular references via symlinks. For example, there are 64 self-referential ttyX trees:

$ ls -l build/test/sys/devices/virtual/tty/tty1
total 12
-rw-r--r-- 1 wjt wjt    4 Oct  9 12:16 dev
drwxr-xr-x 2 wjt wjt 4096 Oct  9 12:16 power
lrwxrwxrwx 1 wjt wjt   21 Oct  9 12:16 subsystem -> ../../../../class/tty
-rw-r--r-- 1 wjt wjt   16 Oct  9 12:16 uevent
$ ls -l build/test/sys/devices/virtual/tty/tty1/subsystem/tty1
lrwxrwxrwx 1 wjt wjt 30 Oct  9 12:16 build/test/sys/devices/virtual/tty/tty1/subsystem/tty1 -> ../../devices/virtual/tty/tty1
$ readlink -f build/test/sys/devices/virtual/tty/tty1/subsystem/tty1
/home/wjt/src/endlessm/systemd/build/test/sys/devices/virtual/tty/tty1

And, worse, all other ttyY trees are accessible via the symlinks from each ttyX tree. The kernel caps the number of symlinks per path to 40 before lookups fail with ELOOP, but that’s still 6440 paths to resolve, just for the fake ttys. Quite a big number!

The fix is straightforward: maintain a set of visited (st_dev, st_ino) pairs while walking the tree, and prune subtrees we’ve already visited. I tried adding a similar highly self-referential symlink graph to the gcovr test suite, so that it would run in reasonable time if the fix works and essentially never terminate if it does not. Unfortunately, pytest has exactly the same bug: while searching for tests to run, it gets lost wandering in the symlink forest forever.

This bug is a good metaphor for my habit of starting supposedly-quick side-projects.