ZeroConf support for Bazaar

When at conferences and sprints, I often want to see what someone else is working on, or to let other people see what I am working on. Usually we end up pushing up to a shared server and using that as a way to exchange branches. However, this can be quite frustrating when competing for outside bandwidth when at a conference.

It is possible to share the branch from a local web server, but that still means you need to work out the addressing issues.

To make things easier, I wrote a simple Bazaar/Avahi plugin. It provides a command “bzr share“, which does the following:

  • Scan the directory for any Bazaar branches it contains.
  • Start up the Bazaar server to listen on a TCP port and share the given directory.
  • Advertise each of the branches via mDNS using Avahi. They are shared using the branch nickname.

For the client side, the plugin implements a “bzr browse” command that will list the Bazaar branches being advertised on the local network (the name and the bzr:// URL). Using the two commands together, it is trivial to share branches locally or find what branches people are sharing.

I am not completely satisfied with how things work, and have a few ideas for how to improve things:

  1. Provide a dummy transport that lets people pull from branches by their advertised service name. This would essentially just redirect from scheme://$SERVICE/ to bzr://$HOST:$PORT/$PATH.
  2. Maybe provide more control over the names the branches get advertised with. Perhaps this isn’t so important though.
  3. Make “bzr share” start and stop advertising branches as they get added/removed, and handle branch nicknames changing (at this point, it is pretty much blue sky though).
  4. Perhaps some form of access control. I’m not sure how easy this is within the smart server protocol, but it should be possible to query the user over whether to accept a connection or not.

It will be interesting to see how well this works at the next sprint or conference.

American Hoax

One of the books I read over the holidays was American Hoax by Charles Firth. This book is apparently what he was working on while the rest of the Chaser guys were doing The Chaser’s War on Everything.

The book is an exploration of different aspects of U.S. politics from an outsider’s perspective. Of course, it is a bit difficult to get a good view of the different political movements from the outside, so he creates a number of characters representing different points on the political spectrum as a way to become part of the various movements. Each of these characters was given a suitable back story, including Wikipedia articles for some and websites for others.

The book is definitely worth a read. I am not sure if it has been released outside of Australia yet, but is available from Dymocks.

Python time.timezone / time.altzone edge case

While browsing the log of one of my Bazaar branches, I noticed that the commit messages were being recorded as occurring in the +0800 time zone even though WA switched over to daylight savings.

Bazaar stores commit dates as a standard UNIX seconds since epoch value and a time zone offset in seconds. So the problem was with the way that time zone offset was recorded. The code in bzrlib that calculates the offset looks like this:

def local_time_offset(t=None):
    """Return offset of local zone from GMT, either at present or at time t."""
    # python2.3 localtime() can't take None
    if t is None:
        t = time.time()

    if time.localtime(t).tm_isdst and time.daylight:
        return -time.altzone
    else:
        return -time.timezone

Now the tm_isdst flag was definitely being set on the time value, so it must have something to do with one of the time module constants being used in the function. Looking at the values, I was surprised:

>>> time.timezone
-28800
>>> time.altzone
-28800
>>> time.daylight
0

So the time module thinks that I don’t have daylight saving, and the alternative time zone has the same offset as the main time zone (+0800). This seems a bit weird since time.localtime() says that the time value is in daylight saving time.

Looking at the Python source code, the way these variables are calculated on Linux systems goes something like this:

  1. Get the current time as seconds since the epoch.
  2. Round this to the nearest year (365 days plus 6 hours, to be exact).
  3. Pass this value to localtime(), and record the tm_gmtoff value from the resulting struct tm.
  4. Add half a year to the rounded seconds since epoch, and pass that to localtime(), recording the tm_gmtoff value.
  5. The earlier of the two offsets is stored as time.timezone and the later as time.altzone. If these two offsets differ, then time.daylight is set to True.

Unfortunately, the UTC offset used in Perth at the beginning of 2006 and the middle of 2006 was +0800, so +0800 gets recorded as the daylight saving time zone too. In the new year, the problem should correct itself, but this highlights the problem of relying on these constants.

Unfortunately, the time.localtime() function from the Python standard library does not expose tm_gmtoff, so there isn’t an easy way to correctly calculate this value.

With the patch I did for pytz to parse binary time zone files, it would be possible to use the /etc/localtime zone file with the Python datetime module without much trouble, so that’s one option. It would be nice if the Python standard library provided an easy way to get this information though.

Recovering a Branch From a Bazaar Repository

In my previous entry, I mentioned that Andrew was actually publishing the contents of all his Bazaar branches with his rsync script, even though he was only advertising a single branch. Yesterday I had a need to actually do this, so I thought I’d detail how to do it.

As a refresher, a Bazaar repository stores the revision graph for the ancestry of all the branches stored inside it. A branch is essentially just a pointer to the head revision of a particular line of development. So if the branch has been deleted but the data is still in the repository, recovering it is a simple matter of discovering the identifier for the head revision.

Finding the head revision

Revisions in a Bazaar repository have string identifiers. While the identifiers can be almost arbitrary strings (there are some restrictions on the characters they can contain), the ones Bazaar creates when you commit are of the form “$email-$date-$random“. So if we know the person who committed the head revision and the date it was committed, we can narrow down the possibilities.

For these sort of low level operations, it is easiest to use the Python bzrlib interface (this is the guts of Bazaar). Lets say that we want to recover a head revision committed by foo@example.com on 2006-12-01. We can get all the matching revision IDs like so:

>>> from bzrlib.repository import Repository
>>> repo = Repository.open('repository-directory')
>>> possible_ids = [x for x in repo.all_revision_ids()
...                 if x.startswith('foo@example.com-20061201')]

Now if you’re working on multiple branches in parallel, it is likely that the matching revisions come from different lines of development. To help work out which revision ID we want, we can look at the branch-nick revision property of each revision, which is recorded in each commit. If the nickname hadn’t been set explicitly for the branch we’re after, it will take the base directory name of the branch as a default. We can easily loop through each of the revisions and print a the nicknames:

>>> for rev_id in sorted(possible_ids):
...     rev = repo.get_revision(rev_id)
...     print rev_id
...     print rev.properties['branch-nick']

We can then take the last revision ID that has the nickname we are after. Since lexical sorting of these revision IDs will have sorted them in date order, it should be the last revision. We can check the log message on this revision to make sure:

>>> rev = repo.get_revision('head-revision-id')
>>> print rev.message

If it doesn’t look like the right revision, you can try some other dates (the dates in the revision identifiers are in UTC, so it might have recorded a different date to the one you remembered). If it is the right revision, we can proceed onto recovering the branch.

Recovering the branch

Once we know the revision identifier, recovering the branch is easy. First we create a new empty branch inside the repository:

$ cd repositorydir
$ bzr init branchdir

We can now use the pull command with a specific revision identifier to recover the branch:

$ cd branchdir
$ bzr pull -r revid:head-revision-id .

It may look a bit weird that we are pulling from a branch that contains no revisions into itself, but since the repository for this empty branch contains the given revision it does the right thing. And since bzr pull canonicalises the branch’s history, the new branch should have the same linear revision history as the original branch.

Recovering the branch from someone else’s repository

The above method assumes that you can create a branch in the repository. But what if the repository belongs to someone else, and you only have read-only access to the repository? You might want to do this if you are trying to recover one of the branches from Andrew’s Java GNOME repository 🙂

The easy way is to copy all the revisions from the read-only repository into one you control. First we’ll create a new repository:

$ bzr init-repo repodir

Then we can use the Repository.fetch() bzrlib routine to copy the revisions:

>>> from bzrlib.repository import Repository
>>> remote_repo = Repository.open('remote-repo-url')
>>> local_repo = Repository.open('repodir')
>>> local_repo.fetch(remote_repo)

When that command completes, you’ll have a local copy of all the revisions and can proceed as described above.

Re: Pushing a bzr branch with rsync

This article responds to some of the points in Andrew’s post about Pushing a bzr branch with rsync.

bzr rspush and shared repositories

First of all, to understand why bzr rspush refuses to operate on a non-standalone branch, it is worth looking at what it does:

  1. Download the revision history of the remote branch, and check to see that the remote head revision is an ancestor of the local head revision. If it is not, error out.
  2. If it is an ancestor, use rsync to copy the local branch and repository information to the remote location.

Now if you bring shared repositories into the mix, and there is a different set of branches in the local and remote repositories, then step (2) is liable to delete revision information needed by those branches that don’t exist locally. This is not a theoretical concern if you do development from multiple machines (e.g. a desktop and a laptop) and publish to the same repository.

Storage Formats and Hard linking

The data storage format used by Bazaar was designed to be cross platform and compact. The compactness is important for the dumb/passive server mode, since the on-disk representation has a large impact on how much data needs to be transferred to pull or update a branch.

The representation chosen effectively has one “knit” file per file in the repository, which is only ever appended to (with deltas to the previous revision, and occasional full texts), plus a “knit index” file per knit that describes the data stored inside the knit. Knit index files are much smaller than their corresponding knit files.

When pushing changes, it is a simple matter of downloading the knit index, working out which revisions are missing, append those to the knit and update the index. When pulling changes, the knit index is downloaded and the required sections of the knit file are downloaded (e.g. via an HTTP range request).

The fact that the knit files get appended to is what causes problems with hard linked trees. Unfortunately the SFTP protocol doesn’t really provide a way to tell whether a file has multiple links or a way to do server side file copies, so while it would be possible to break the links locally, it would not be possible when updating a remote branch.

Furthermore, relying on hard links for compact local storage of related branches introduces platform compatibility problems. Win32 does not support hard links (update: apparently they are supported, but hidden in the UI), and while MacOS X does support them its HFS+ file system has a very inefficient implementation (see this article for a description).

Rsync vs. The Bazaar smart server

As described above, Bazaar is already sending deltas across the wire. However, it is slower than rsync due due to it waiting on more round trips. The smart server is intended to eventually resolve this discrepancy. It is a fairly recent development though, so hasn’t achieved its full potential (the development plan has been to get Bazaar to talk to the smart server first, and then start accelerating certain operations).

When it is more mature, a push operation would effectively work out which revisions the remote machine is missing, and then send a bundle of just that revision data in one go, which is about the same amount of round trips as you’d get with rsync.

This has the potential to be faster than an equivalent rsync:

  • Usually each revision only modifies a subset of the files in the tree. By checking which files have been changed in the revisions to be transferred, Bazaar will only need to open those knit files. In contrast, rsync will check every file in the repository.
  • In Andrew’s rsync script, the entire repository plus a single branch are transferred to the server. While only one branch is transferred, the revision information for all branches will be transferred. It is not too difficult to reconstruct the branches from that data (depending on what else is in the repository, this could be a problem). In contrast, Bazaar only transfers the revisions that are part of the branch being transferred.

So it is worth watching the development of the smart server over the next few months: it is only going to get faster.