London Airports

So the airports in the UK went crazy today after a terror plot was uncovered. The upshot is severe restrictions on what you can take on as hand luggage, and a fair number of flight cancellations.

The restrictions mean you can’t carry laptop computers on board. Instead they want you to check them through and trust them to the baggage handlers …

I’m meant to be flying back to Australia on Saturday, so we’ll see what happens. I’m not particularly looking forward to getting home with a broken laptop.

Launchpad enterered into Python bug tracker competition

The Python developers have been looking for a new bug tracker, and essentially put out a tender for people interested in providing a bug tracker. Recently I have been working on getting Launchpad‘s entry ready, which mainly involved working on SourceForge import.

The entry is now up, and our demonstration server is up and running with a snapshot of the Python bug tracker data.

As a side effect of this, we’ve got fairly good SourceForge tracker import support now, which we should be able to use if other projects want to switch away from SF.

Vote Counting and Board Expansion

Recently one of the Gnome Foundation directors quit, and there has been a proposal to expand the board by 2 members. In both cases, the proposed new members have been taken from the list of candidates who did not get seats in the last election from highest vote getter down.

While at first this sounds sensible, the voting system we use doesn’t provide a way of finding out who would have been selected for the board if a particular candidate was removed from the ballot.

The current voting system gives each foundation member N votes to assign to N candidates (where N is the number of seats on the board). The votes are then tallied for each candidate, and the N candidates with the most votes get the seats.

If we look at last year’s results, there were 119 people who voted for Luis. If Luis had not been a candidate, then those 119 people would have used that vote to pick other candidates. The difference in the number of votes received by Vincent (the board member receiving the least votes) and Quim (the unsuccessful candidate with the most votes) was just 16, so those extra 119 votes could easily have affected the ordering of the remaining candidates.

Furthermore, if the election was for nine seats rather than seven then each foundation member would have had an additional two votes to cast.

This particular problem would not be an issue with a preferential voting system where each foundation member lists all the candidates in their order of preference. If a board member drops out, it is trivial to recalculate the results with that candidate removed: the relative orderings of the other candidates on the ballot are preserved. It is also possible to calculate the results for a larger number of seats.

Of course, all the candidates from the last election would make great board members so it isn’t so much of an issue in this case, but it might be worth considering for next time.

JHBuild Updates

The progress on JHBuild has continued (although I haven’t done much in the last week or so). Frederic Peters of JhAutobuild fame now has a CVS account to maintain the client portion of that project in tree.

Perl Modules (#342638)

One of the other things that Frederic has been working on is support for building Perl modules (which use a Makefile.PL instead of a configure script). His initial patchworked fine for tarballs, but by switching over to the new generic version control code in jhbuild it was possible to support Perl modules maintained in any of the supported version control systems without extra effort.

Speed Up Builds (#313997)

One of the other suggestions for jhbuild that came up a while ago was to make it “eleventy billion times faster”. In essence, adding a mode where it would only rebuild modules that had changed. While the idea has merrit, the proposed implementation had some problems (it used the output of “cvs update” to decide whether things had changed).

I’d like to get something like this implemented, preferably with three possible behaviours:

  1. Build everything (the current behaviour).
  2. Build only modules that have changed.
  3. Build only modules that have changed, or have dependencies that have changed.

The second option is obviously the fastest, and is a useful option for collections of modules that should be API stable. The third option is essentially an optimisation of the first option. For both the second and third option, it is necessary to be able to tell if the code in a module has been updated. The easiest way to do this is to record an identifier for the tree state, and the identifier is different after an update.

The identifier should be cheap to calculate too, so will probably be dependent on the underlying version control system:

  • CVS – a hash of the names and versions of all files in the tree. Something like this, maybe (can be constructed by reading the CVS/Root, CVS/Repository and CVS/Entries files in the tree).
  • Subversion – a combination of (a) the repository UUID, (b) the path of the tree inside the repository and (c) the youngest revision for this subtree in the repository.
  • Arch – the output of “baz tree-id“.
  • Bzr – the working tree’s revision ID.
  • Darcs – a hash of the sequence of patches representing the tree, maybe?
  • Tarballs – the version number for the tarball.

On a successful build, the ID for the tree would be recorded. On subsequent builds, the ID gets recalculated after updating the tree. The new and old IDs are then used to decide on whether to build the module or not, according to the chosen policy.

Statistics of Breath Testing

Yesterday there were some news reports about the opposition party in Victoria issuing a FOI request and finding that the breath testers used to test blood alcohol content routinely under report the readings by up to 20%. They used this fact to show that it was giving negative readings for some people who are a little over the limit. On the face of it this sounds like a problem, but when you look at the statistics the automatic reduction makes sense.

The main point is that breathalyzer tests are not completely accurate. Let’s consider the case where the breathalyzer which does no adjustments gives a 0.05 BAC reading. We’d expect the probability distribution for the real BAC reading to be a normal distribution with 0.05 BAC as the mean:

So the real BAC may be either above or below 0.05. Given that it is only an offence to have a BAC above 0.05, the test would only give even odds that the person had broken the law. That would make it pretty useless for getting a conviction.

If you automatically reduce the displayed reading on the breathalyzer by 2 standard deviations, there is a different picture. For a BAC reading of 0.05, the real BAC will still be normally distributed but the mean will be offset:

So this gives a 97.5% probability that the BAC is above 0.05. So while removing the automatic result reduction might catch more people over the 0.05 limit, it would also drastically increase the number of people caught while below the limit.

To reduce the number of false negatives without increasing the false positives, the real answer is to use a more accurate test so that the error margins are lower.

JHBuild Improvements

I’ve been doing most JHBuild development in my bzr branch recently. If you have bzr 0.8rc1 installed, you can grab it here:

bzr branch http://www.gnome.org/~jamesh/bzr/jhbuild/jhbuild.dev

I’ve been keeping a regular CVS import going at http://www.gnome.org/~jamesh/bzr/jhbuild/jhbuild.cvs using Tailor, so changes people make to module sets in CVS make there way into the bzr branch. I’ve used a small hack so that merges back into CVS get recorded correctly in the jhbuild.cvs branch:

  1. Apply the diff between jhbuild.cvs and jhbuild.dev to my CVS checkout and commit.
  2. Modify tailor to pause before committing the to jhbuild.cvs.
  3. While tailor is paused, run bzr revert followed by a merge of the same changes from jhbuild.dev.
  4. Let tailor complete the commit.

It’s a bit of a hack, but it allows me to do repeated merges from the CVS import to my development branch (and back again). It also means that any file moves I do in my bzr branch are reflected in the CVS import when merged.

So now when filing bug reports on jhbuild, you can submit fixes in the form of bzr branches as well as patches.

So, on to the improvements:

Generic Version Control Interface

Previously, to add support for a new version control system the following additions were needed:

  • Some code to invoke the version control utility to make checkouts and update working trees.
  • Code to implement the build state machine for modules using the version control system (these classes would generally derive from AutogenModule which implemented most of the build logic).
  • Code to create instances of the above module type when parsing .modules files.

This was quite a bit of work, and in the end would only help if the code in question was set up to build the same way as most Gnome modules (i.e. with a autogen.sh script and autotools). If you wanted to build a module using Python distutils out of Subversion, a new module type would be needed. If you wanted to build a distutils module from a tarball, that would be another module type again.

With the new system, the different version control support modules provide a common interface. This means that a single module type is capable of implementing the build state machine for any version control system. Similarly, it should now be possible to implement distutils module support such that it will work with any supported version control system.

This work is not yet finished though. A bit more work needs to be done to parse version control system agnostic module definitions from .modules files. When this is done, a fair bit of the current syntax can be deprecated and eventually removed. When this is done, adding support for a new version control system shouldn’t take more than 100-200 lines.

Module Type Simplifications

As well as reducing the number of module types that need to be maintained in JHBuild, I’ve been working on simplifying the code in these module types. Previously, each stage of a module build was represented by a method call on the module type. The return value of the method was used to say (a) whether the stage succeeded or not, (b) what the next state would be and (c) if an error occurred some alternative next states to go to (e.g. offer to rerun autogen.sh).

With the new system, the next state and error states are declared as attributes on the method object. The method can indicate a failure by raising a particular exception. This greatly simplifies the cases where a build stage involves a number of separate actions that could each fail individually, since the exception cuts processing short without the error checks getting in the way of the code.

There are still a few module build stages not converted to the new system since their next state depends on various config settings (e.g. if running “make check” has been enabled or not). Since these generally involve skipping a stage based on some criteria, the plan is to move the logic to the stage being skipped, which should simplify things further.

New Default Branch Format in Bzr

One of the new features in the soon to be released bzr 0.8 is the new “knit” storage format.

When comparing the size of the repository data for jhbuild with “knit” and “metadir” formats (metadir is just the old storage format with repository, branch and checkout bookkeeping separated), I see the following:

metadir knit
Size 9.9MB 5.5MB
Number of files 1267 307

The reason for the smaller number of files is that information about all revisions in the repository is now stored together rather than in separate files. So the file count comes out at a constant plus 2 times the number of tracked files (a knit index file plus the knit data file). For comparison, the CVS repository I imported this from was 4.4MB, and comprised 143 files.

As well as reducing storage requirements, the new knit repository format is designed to reduce network traffic. With the current weave repository format, the weave file for each file touched by a commit gets rewritten to include the contents of the new revision. In contrast to this, the information about the new revision can simply be appended to the knit data file and the knit index file updated to match. This means publishing a branch to a server via sftp mainly involves append operations, resulting in a nice speed up.

Similarly when pulling new changes from a published branch, bzr only needs to download a knit index to find out which sections of the knit data are missing locally. It can then ask for just the changed sections (by an HTTP range request or a partial read with sftp), rather than downloading the entire contents of the changed weaves.

Overall, this should make bzr 0.8 a lot more usable than 0.7 for various network operations.

Repositories in Bzr

One of the new features comming up in the next release of bzr is support for shared repositories. This provides a way to reduce disk space needed to store multiple related branches. To understand how repositories work, it helps to know a bit about how branches are stored by bzr.

[bzr repository diagram]

There are three concepts that make up a bzr branch:

  1. A checkout or working tree. This is the source files you are working with. It represents the state of the source code at some recorded revision plus any local changes you’ve made. In the diagram on the right, it is represented as the red node.
  2. The branch, consisting of a linear sequence of revisions. This is represented by the blue nodes in the diagram. Note that there may be multiple paths from the first revision to the current revision due to branching and merging. The branch revision history indicates the path that was taken by this particular branch.
  3. The repository, being a store of the text of all the revisions in the ancestry of the branch, plus metadata about those revisions. This essentially stores information about every node and edge in the diagram.

In previous versions of bzr, this information was not clearly separated. However with the new default branch format in bzr 0.8 they are separated, and a particular directory need not contain all three parts, which is what makes the space savings and performance improvements possible.

One of the biggest space savings is achieved from sharing the repository data between branches. If a particular branch does not contain any repository information, bzr will recursively check the parent directory til it finds a repository. If a collection branches share some of their history, then the single shared repository will be significantly smaller than the space used if each branch had its own repository data.

Another way to reduce disk usage is to create branches without checkouts. This is useful when publishing a branch, since people pulling or merging from that branch don’t use the checkout files.

Finally, it is possible to create a checkout which does not contain branch or repository data, instead containing a pointer to where that data is located. This is quite useful when combined with a central shared repository.

So how big is this space saving? When I converted JHBuild to bzr, the repository data totals to 10MB, the branch data totals 100KB and a checkout is 1.4MB.

So to publish a second branch without the use of shared repositories means another 10MB of storage (a bit more if I include a checkout at the published location). If I use shared repositories, the cost of the second branch is 100KB plus an amount proportional to the size of the changes I make on that branch. So for many projects, the cost of publishing another branch is lost in the noise.