States in Version Control Systems

Elijah has been writing an interesting series of articles comparing different version control systems. While the previous articles have been very informative, I think the latest one was a bit muddled. What follows is an expanded version of my comment on that article.

Elijah starts by making an analogy between text editors and version control systems, which I think is quite a useful analogy. When working with a text editor, there is a base version of the file on disk, and the version you are currently working on which will become the next saved version.

This does map quite well to the concepts of most VCS’s. You have a working copy that starts out identical to a base tree from the branch you are editing. You make local changes and eventually commit, creating a new base tree for future edits.

In addition to these two “states”, Elijah goes on to list three more states that are actually orthogonal to the original two. These additional states refer to certain categorisations of files within the working copy, rather than particular versions of files or trees. Rather than simplifying things, I believe that mingling the two concepts together is more likely to cause confusion. I think this is evident from the fact that the additional states do not fit the analogy we started with.

Versioned and Unversioned Files

If you are going to use a version control system seriously, it is worth understanding how files within a working copy are managed. Rather than thinking of a flat list of possible states, I think it is helpful to think of a hierarchy of categories. The most basic categorisation is whether a file is versioned or not.

Versioned files are those whose state will be saved when committing a new version of the tree. Conversely, unversioned files exist in the working copy but are not recorded when committing new versions of the tree.

This concept does not map very well to the original text editor analogy. If text editors did support such a feature, it would be the ability to add paragraphs to the document that do not get stored to disk when you save, but would persist inside the editor.

Types of Versioned Files

There are various ways to categorise versioned files, but here are some fairly generic ones that fit most VCS’s.

  1. unchanged
  2. modified
  3. added
  4. removed

Each of these categorisations is relative to the base tree for the working copy. The modified category contains both files whose contents have changed and whose metadata has changed (e.g. files that have been renamed).

The removed category is interesting because files in this category don’t actually exist in the working copy. That said the VCS knows that such files did exist, so it knows to delete the files when committing the next version of the tree.

Types of Unversioned Files

There are two primary categories for unversioned files:

  1. ignored
  2. unknown

The ignored category consists of unversioned files that the VCS knows the user does not want added to the tree (either through a set of default file patterns, or because the user explicitly said the file should be ignored). Object files and executables built from source code in the tree are prime examples of files that the user would want to ignore.

The unknown category is a catch-all for any other unversioned file in the tree. This is what Elijah referred to as “limbo” in his article.

Differences between VCS’s

These concepts are roughly applicable to most version control systems, but there are differences in how the categories are handled. Some of the areas where they differ are:

  • Are newly created files in the working copy counted as added or unknown?
    Some VCS’s (or configurations of VCS’s) don’t have a concept of unknown files. In such a system, newly created files will be treated as added rather than unknown.
  • Are unknown files allowed in the working copy when committing?
    One of the issues Elijah brought up was forgetting to add new files before commit. Some VCS’s avoid this problem by not letting you commit a tree with unknown files.
  • When renaming a versioned file, does it count as a single modified file, or a removed file and an added file?
    This one is a basic question of whether the VCS supports renames or not.
  • If I delete a versioned file, is it put in the removed category automatically?
    With some VCS’s you need to explicitly tell them that you are removing a file. With others it is enough to delete the file on disk.

These differences are the sorts of things that affect the workflow for the VCS, so are worth investigating when comparing different systems.

Signed Revisions with Bazaar

One useful feature of Bazaar is the ability to cryptographically sign revisions. I was discussing this with Ryan on IRC, and thought I’d write up some of the details as they might be useful to others.

Anyone who remembers the past security of GNOME and Debian servers should be able to understand the benefits of being able to verify the integrity of a source code repository after such an incident. Rather than requiring all revisions made since the last known safe backup to be examined, much of the verification could be done mechanically.

Turning on Revision Signing

The first thing you’ll need to do is get a PGP key and configure GnuPG to use it. The GnuPG handbook is a good reference on doing this. As the aim is to provide some assurance that the revisions you publish were really made by you, it’d be good to get the key signed by someone.

Once that is done, it is necessary to configure Bazaar to sign new revisions. The easiest way to do this is to edit ~/.bazaar/bazaar.conf to look something like this:

[DEFAULT]
email = My Name <me@example.com>
create_signatures = always

Now when you run “bzr commit“, a signature for the new revision will be stored in the repository. With this configuration change, you will be prompted for your pass phrase when making commits. If you’d prefer not to enter it repeatedly, there are a few options available:

  1. install gpg-agent, and use it to remember your pass phrase in the same way you use ssh-agent.
  2. install the gnome-gpg wrapper, which lets you remember your pass phrase in your Gnome keyring. To use gnome-gpg, you will need to add an additional configuration value: “gpg_signing_command = gnome-gpg“.

Signatures are transferred along with revisions when you push or pull a branch, perform merges, etc.

How Does It Work?

So what does the signature look like, and what does it cover? There is no command for printing out the signatures, but we can access them using bzrlib. As an example, lets look at the signature on the head revision of one of my branches:

>>> from bzrlib.branch import Branch
>>> b = Branch.open('http://bazaar.launchpad.net/~jamesh/storm/reconnect')
>>> b.last_revision()
'james.henstridge@canonical.com-20070920110018-8e88x25tfr8fx3f0'
>>> print b.repository.get_signature_text(b.last_revision())
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

bazaar-ng testament short form 1
revision-id: james.henstridge@canonical.com-20070920110018-8e88x25tfr8fx3f0
sha1: 467b78c3f8bfe76b222e06c71a8f07fc376e0d7b
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG8lMHAa+T2ZHPo00RAsqjAJ91urHiIcu4Bim7y1tc5WtR+NjvlACgtmdM
9IC0rtNqZQcZ+GRJOYdnYpA=
=IONs
-----END PGP SIGNATURE-----

>>>

If we save this signature to a file, we can verify it with a command like “gpg --verify signature.txt” to prove that it was made using my PGP key. Looking at the signed text, we see three lines:

  1. An identifier for the checksum algorithm. This is included to future proof old signatures should the need arise to alter the checksum algorithm at a later date.
  2. The revision ID that the signature applies to. Note that this is the full globally unique identifier rather than the shorter numeric identifiers that are only unique in the context of an individual branch.
  3. The checksum, in SHA1 form.

For the current signing algorithm, the checksum is made over the long form testament for the revision, which can easily be verified:

$ bzr branch http://bazaar.launchpad.net/~jamesh/storm/reconnect
$ cd reconnect
$ bzr testament --long > testament.txt
$ sha1sum testament.txt
467b78c3f8bfe76b222e06c71a8f07fc376e0d7b  testament.txt

Looking at the long form testament, we can see what the signature ultimately covers:

  1. The revision ID
  2. The name of the committer
  3. The date of the commit
  4. The parent revision IDs
  5. The commit message
  6. A list of the files that comprise the source tree for the revision, along with SHA1 sums of their contents
  7. Any revision properties

So if the revision testament matches the revision signature and the revision signature validates, you can be sure that you are looking at the same code as the person who made the signature.

It is worth noting that while the signature makes an assertion about the state of the tree at that revision — the only thing it tells you about the ancestry is the revision IDs of the parents. If you need assurances about those revisions, you will need to check their signatures separately. One of the reasons for this is that you might not know the full history of a branch if it has ghost revisions (as might happen when importing code from certain foreign version control systems).

Signing Past Revisions

If you’ve already been using Bazaar but had not enabled revision signing, it is likely that you’ve got a bunch of unsigned revisions lying around. If that is the case, you can sign the revisions in bulk using the “bzr sign-my-commits” command. It will go through all revisions in the ancestry, and generate signatures for all the commits that match your committer ID.

Verifying Signatures in Bulk

To verify all signatures found in a repository, John Arbash Meinel’s signing plugin can be used, which provides a “bzr verify-sigs” command. It can be installed with the following commands:

$ mkdir -p ~/.bazaar/plugins
$ bzr branch http://bzr.arbash-meinel.com/plugins/signing/ ~/.bazaar/plugins/signing

When the command is run it will verify the integrity of all the signatures, and give a summary of how many revisions each person has signed.

Bazaar bundles as part of a review process

In my previous article, I outlined Bazaar‘s bundle feature. This article describes how the Bazaar developers use bundles as part of their development and code review process.

Proposed changes to Bazaar are generally posted as patches or bundles to the development mailing list. Each change is discussed on the mailing list (often going through a number of iterations), and ultimately approved or rejected by the core developers. To aide in managing these patches Aaron Bentley (one of the developers wrote a tool called Bundle Buggy.

Bundle Buggy watches messages sent to the mailing list, checking for messages containing patches or bundles. It then creates an entry on the web site displaying the patch, and lets developers add comments (which get forwarded to the mailing list).

Now while Bundle Buggy can track plain patches, a number of its time saving features only work for bundles:

  1. Automatic rejection of superseded patches: when working on a feature, it is common to go through a number of iterations. When going through the list of pending changes, the developers don’t want to see all the old versions. Since a bundle describes a Bazaar branch, and it is trivial to check if one branch is an extension of another though, Bundle Buggy can tell which bundles are obsolete and remove them from the list.
  2. Automatically mark merged bundles as such: the canonical way to know that a patch has been accepted is for it to be merged to mainline. Each Bazaar revision has a globally unique identifier, so we can easily check to see if the head revision of the bundle is in the ancestry of mainline. When this happens, Bundle Buggy automatically marks them as merged.

Using these techniques the list of pending bundles is kept under control.

Further Possibilities

Of course, these aren’t the only things that can be done to save time in the review process. Another useful idea is to automatically try and merge pending bundles or branches to see if they can still be merged without conflicts. This can be used as a way to put the ball back in the contributors court, obligating them to fix the problem before the branch can be reviewed.

This sort of automation is not only limited to projects using a mailing list for code review. The same techniques could be applied to a robot that scanned bug reports in the bug tracker (e.g. Bugzilla) for bundles, and updated their status accordingly.

Bazaar Bundles

This article follows on from the series of tutorials on using Bazaar that I have neglected for a while. This article is about the bundle feature of Bazaar. Bundles are to Bazaar branches what patches are to tarballs or plain source trees.

Context/unified diffs and the patch utility are arguably one of most important inventions that enable distributed development:

  • The patch is a self contained text file, making it easy to send as an email attachment or attach to a bug report.
  • The size of the patch is proportional to the size of the changes rather than the size of the source tree. So submitting a one line fix to the Linux kernel is as easy as a one line fix for a small one person project.
  • Even if the destination source tree has moved forward since the patch was created, the patch utility does a decent job of applying the changes using heuristics to match the surrounding context. Human intervention is only needed if the edits are to the same section of code.
  • As patches are human readable text files, they are a convenient form to review the code changes.

Of course, patches do have their limitations:

  • The unified diff format doesn’t convey file moves, instead showing the entire file content being removed and then added again. If the file was changed in addition to being moved, the change can easily be missed when reviewing the patch.
  • Changes to binary files are omitted from the patch. While we can’t expect such changes to be represented in a human readable form, it’d be nice for them to be represented in a way that they can be applied at the other end.
  • The patch doesn’t record any intermediate steps in the creation of the change. This can be worked around by sending a sequence of patches that each build on the previous one, but this requires a fair bit of attentiveness on the part of the patch creator.
  • If the project in question is using some form of version control, the changes in the patch will likely be attributed to the person who applied the patch rather than the person who made the patch.

Using distributed version control solves these limitations, but simply publishing a branch and telling someone to pull from it does not provide all the benefits of a patch. For one, the person reviewing the changes needs to be online to merge the branch and evaluate the changes.

Second, the contributor of the change needs somewhere to host the branch. Even though finding a place to host the branch may not be difficult (for example, anyone can host their branches on Launchpad), uploading the branch may be more effort than the contributor cares for (uploading a branch the size of the Linux kernel will take a while, for instance). That branch would need to remain available until the changes were accepted.

For Bazaar, bundles provide a solution to this problem. A bundle is effectively a “branch diff”, which can then be used to integrate a set of revisions into a repository assuming it contains the revisions from the target branch. At this point, those changes can be merged or pulled.

So how do we produce a bundle? Lets start by creating a branch of the project we want to contribute to. For this example, we’ll create a branch of Mailman to make our changes. As Mailman is using Launchpad to host its branches, I can use the shorthand implemented by the Launchpad Bazaar plugin to create my branch:

bzr branch lp:mailman mailman.jamesh
cd mailman.jamesh
# make my changes here
bzr commit

After I am happy with my changes, I can create a bundle of those changes:

bzr bundle > my-changes.diff

As mentioned earlier, a bundle is essentially a diff between two branches. As I did not specify any branch in the above command, Bazaar uses the parent branch, which in this case will be the upstream Mailman branch. If we look at my-changes.diff, we will see a text file with three general sections:

  1. A short header identifying the file as a bundle and giving the last commit message, author and date
  2. A unified diff made between the last common revision with the parent and the head of our branch (this bit is also convenient to review).
  3. Some extra book keeping data. If I’d made multiple commits, this would include data needed to reconstruct the other revisions in the bundle.

I can now submit this bundle in the same way that I’d submit a patch: as an email attachment or in the bug tracker.

To merge the bundle, a developer simply needs to save the bundle to disk and use “bzr merge” on it:

bzr merge my-changes.diff
bzr commit

This will have the same effect as if they merged a branch with those changes. The “bzr log” output will show the merged revisions and “bzr annotate” will credit the changes to the person who made them rather than the person who merged it.

So next time you want to submit a patch to a project that uses Bazaar, consider submitting a bundle instead.

FM Radio in Rhythmbox – The Code

Previously, I posted about the FM radio plugin I was working on. I just posted the code to bug 168735. A few notes about the implementation:

  • The code only supports Video4Linux 2 radio tuners (since that’s the interface my device supports, and the V4L1 compatibility layer doesn’t work for it). It should be possible to port it support both protocols if someone is interested.
  • It does not pass the audio through the GStreamer pipeline. Instead, you need to configure your mixer settings to pass the audio through (e.g. unmute the Line-in source and set the volume appropriately). It plugs in a GStreamer source that generates silence to work with the rest of the Rhythmbox infrastructure. This does mean that the volume control and visualisations won’t work
  • No properties dialog yet. If you want to set titles on the stations, you’ll need to edit rhythmdb.xml directly at the moment.
  • The code assumes that the radio device is /dev/radio0.

Other than that, it all works quite well (I’ve been using it for the last few weeks).

Development

I developed this plugin in Bazaar using Jelmer‘s bzr-svn plugin. It produces a repeatable import, so I should be able to cross merge with anyone else producing branches with it.

It is also possible to use bzr-svn to merge Bazaar branches back into the original Subversion repository through the use of a lightweight checkout.

For anyone wanting to play with my Bazaar branch, it is published in Launchpad and can be grabbed with the following command:

bzr branch lp:~jamesh/rhythmbox/fmradio rhythmbox