December 2006 – James Henstridge

Python time.timezone / time.altzone edge case

Post author:James Henstridge
Post published:31 December, 2006
Post category:Uncategorised

While browsing the log of one of my Bazaar branches, I noticed that the commit messages were being recorded as occurring in the +0800 time zone even though WA switched over to daylight savings.

Bazaar stores commit dates as a standard UNIX seconds since epoch value and a time zone offset in seconds. So the problem was with the way that time zone offset was recorded. The code in bzrlib that calculates the offset looks like this:

def local_time_offset(t=None):
    """Return offset of local zone from GMT, either at present or at time t."""
    # python2.3 localtime() can't take None
    if t is None:
        t = time.time()

    if time.localtime(t).tm_isdst and time.daylight:
        return -time.altzone
    else:
        return -time.timezone

Now the tm_isdst flag was definitely being set on the time value, so it must have something to do with one of the time module constants being used in the function. Looking at the values, I was surprised:

>>> time.timezone
-28800
>>> time.altzone
-28800
>>> time.daylight
0

So the time module thinks that I don’t have daylight saving, and the alternative time zone has the same offset as the main time zone (+0800). This seems a bit weird since time.localtime() says that the time value is in daylight saving time.

Looking at the Python source code, the way these variables are calculated on Linux systems goes something like this:

Get the current time as seconds since the epoch.
Round this to the nearest year (365 days plus 6 hours, to be exact).
Pass this value to localtime(), and record the tm_gmtoff value from the resulting struct tm.
Add half a year to the rounded seconds since epoch, and pass that to localtime(), recording the tm_gmtoff value.
The earlier of the two offsets is stored as time.timezone and the later as time.altzone. If these two offsets differ, then time.daylight is set to True.

Unfortunately, the UTC offset used in Perth at the beginning of 2006 and the middle of 2006 was +0800, so +0800 gets recorded as the daylight saving time zone too. In the new year, the problem should correct itself, but this highlights the problem of relying on these constants.

Unfortunately, the time.localtime() function from the Python standard library does not expose tm_gmtoff, so there isn’t an easy way to correctly calculate this value.

With the patch I did for pytz to parse binary time zone files, it would be possible to use the /etc/localtime zone file with the Python datetime module without much trouble, so that’s one option. It would be nice if the Python standard library provided an easy way to get this information though.

Recovering a Branch From a Bazaar Repository

Post author:James Henstridge
Post published:18 December, 2006
Post category:Uncategorised

In my previous entry, I mentioned that Andrew was actually publishing the contents of all his Bazaar branches with his rsync script, even though he was only advertising a single branch. Yesterday I had a need to actually do this, so I thought I’d detail how to do it.

As a refresher, a Bazaar repository stores the revision graph for the ancestry of all the branches stored inside it. A branch is essentially just a pointer to the head revision of a particular line of development. So if the branch has been deleted but the data is still in the repository, recovering it is a simple matter of discovering the identifier for the head revision.

Finding the head revision

Revisions in a Bazaar repository have string identifiers. While the identifiers can be almost arbitrary strings (there are some restrictions on the characters they can contain), the ones Bazaar creates when you commit are of the form “$email-$date-$random“. So if we know the person who committed the head revision and the date it was committed, we can narrow down the possibilities.

For these sort of low level operations, it is easiest to use the Python bzrlib interface (this is the guts of Bazaar). Lets say that we want to recover a head revision committed by foo@example.com on 2006-12-01. We can get all the matching revision IDs like so:

>>> from bzrlib.repository import Repository
>>> repo = Repository.open('repository-directory')
>>> possible_ids = [x for x in repo.all_revision_ids()
...                 if x.startswith('foo@example.com-20061201')]

Now if you’re working on multiple branches in parallel, it is likely that the matching revisions come from different lines of development. To help work out which revision ID we want, we can look at the branch-nick revision property of each revision, which is recorded in each commit. If the nickname hadn’t been set explicitly for the branch we’re after, it will take the base directory name of the branch as a default. We can easily loop through each of the revisions and print a the nicknames:

>>> for rev_id in sorted(possible_ids):
...     rev = repo.get_revision(rev_id)
...     print rev_id
...     print rev.properties['branch-nick']

We can then take the last revision ID that has the nickname we are after. Since lexical sorting of these revision IDs will have sorted them in date order, it should be the last revision. We can check the log message on this revision to make sure:

>>> rev = repo.get_revision('head-revision-id')
>>> print rev.message

If it doesn’t look like the right revision, you can try some other dates (the dates in the revision identifiers are in UTC, so it might have recorded a different date to the one you remembered). If it is the right revision, we can proceed onto recovering the branch.

Recovering the branch

Once we know the revision identifier, recovering the branch is easy. First we create a new empty branch inside the repository:

$ cd repositorydir
$ bzr init branchdir

We can now use the pull command with a specific revision identifier to recover the branch:

$ cd branchdir
$ bzr pull -r revid:head-revision-id .

It may look a bit weird that we are pulling from a branch that contains no revisions into itself, but since the repository for this empty branch contains the given revision it does the right thing. And since bzr pull canonicalises the branch’s history, the new branch should have the same linear revision history as the original branch.

Recovering the branch from someone else’s repository

The above method assumes that you can create a branch in the repository. But what if the repository belongs to someone else, and you only have read-only access to the repository? You might want to do this if you are trying to recover one of the branches from Andrew’s Java GNOME repository 🙂

The easy way is to copy all the revisions from the read-only repository into one you control. First we’ll create a new repository:

$ bzr init-repo repodir

Then we can use the Repository.fetch() bzrlib routine to copy the revisions:

>>> from bzrlib.repository import Repository
>>> remote_repo = Repository.open('remote-repo-url')
>>> local_repo = Repository.open('repodir')
>>> local_repo.fetch(remote_repo)

When that command completes, you’ll have a local copy of all the revisions and can proceed as described above.

Re: Pushing a bzr branch with rsync

Post author:James Henstridge
Post published:15 December, 2006
Post category:Uncategorised

This article responds to some of the points in Andrew’s post about Pushing a bzr branch with rsync.

bzr rspush and shared repositories

First of all, to understand why bzr rspush refuses to operate on a non-standalone branch, it is worth looking at what it does:

Download the revision history of the remote branch, and check to see that the remote head revision is an ancestor of the local head revision. If it is not, error out.
If it is an ancestor, use rsync to copy the local branch and repository information to the remote location.

Now if you bring shared repositories into the mix, and there is a different set of branches in the local and remote repositories, then step (2) is liable to delete revision information needed by those branches that don’t exist locally. This is not a theoretical concern if you do development from multiple machines (e.g. a desktop and a laptop) and publish to the same repository.

Storage Formats and Hard linking

The data storage format used by Bazaar was designed to be cross platform and compact. The compactness is important for the dumb/passive server mode, since the on-disk representation has a large impact on how much data needs to be transferred to pull or update a branch.

The representation chosen effectively has one “knit” file per file in the repository, which is only ever appended to (with deltas to the previous revision, and occasional full texts), plus a “knit index” file per knit that describes the data stored inside the knit. Knit index files are much smaller than their corresponding knit files.

When pushing changes, it is a simple matter of downloading the knit index, working out which revisions are missing, append those to the knit and update the index. When pulling changes, the knit index is downloaded and the required sections of the knit file are downloaded (e.g. via an HTTP range request).

The fact that the knit files get appended to is what causes problems with hard linked trees. Unfortunately the SFTP protocol doesn’t really provide a way to tell whether a file has multiple links or a way to do server side file copies, so while it would be possible to break the links locally, it would not be possible when updating a remote branch.

Furthermore, relying on hard links for compact local storage of related branches introduces platform compatibility problems. ~~Win32 does not support hard links~~ (update: apparently they are supported, but hidden in the UI), and while MacOS X does support them its HFS+ file system has a very inefficient implementation (see this article for a description).

Rsync vs. The Bazaar smart server

As described above, Bazaar is already sending deltas across the wire. However, it is slower than rsync due due to it waiting on more round trips. The smart server is intended to eventually resolve this discrepancy. It is a fairly recent development though, so hasn’t achieved its full potential (the development plan has been to get Bazaar to talk to the smart server first, and then start accelerating certain operations).

When it is more mature, a push operation would effectively work out which revisions the remote machine is missing, and then send a bundle of just that revision data in one go, which is about the same amount of round trips as you’d get with rsync.

This has the potential to be faster than an equivalent rsync:

Usually each revision only modifies a subset of the files in the tree. By checking which files have been changed in the revisions to be transferred, Bazaar will only need to open those knit files. In contrast, rsync will check every file in the repository.
In Andrew’s rsync script, the entire repository plus a single branch are transferred to the server. While only one branch is transferred, the revision information for all branches will be transferred. It is not too difficult to reconstruct the branches from that data (depending on what else is in the repository, this could be a problem). In contrast, Bazaar only transfers the revisions that are part of the branch being transferred.

So it is worth watching the development of the smart server over the next few months: it is only going to get faster.

Chilli Beer

Post author:James Henstridge
Post published:11 December, 2006
Post category:Uncategorized

Got around to tasting the latest batch of home-brew beer recently: a chilli beer. It came out very nicely: very refreshing but with a chilli aftertaste in the back of your throat. You can definitely taste the chilli after drinking a pint 🙂.

I used a beer kit as a base, since I haven’t yet had the patience to do a brew from scratch. The ingredients were:

A Black Rock Mexican Lager beer kit.
1kg of Coopers brewing sugar.
About 20 red chillis.
Caster sugar for carbonation.

I took half the chillis and cut off the stems and cut them up roughly (in hind sight, it probably would have been enough to cut them lengthwise). I then covered them with a small amount of water in a pot and pasteurised them in the oven at 80°C for about half an hour. The wort was then prepared as normal, but with the pasteurised chillis added before the yeast.

After the fermentation was complete (about a week later), I cut up the remaining chillis (a fair bit smaller this time – they need to easily fit through the neck of a bottle) and pasteurised them the same way as the first batch. This lot was added to the bottles along with the priming sugar.

The beer tasted pretty good 4 weeks after bottling, and it should improve further with time.

UTC+9

Post author:James Henstridge
Post published:4 December, 2006
Post category:Uncategorized

Daylight saving started yesterday: the first time since 1991/1992 summer for Western Australia. The legislation finally passed the upper house on 21st November (12 days before the transition date). The updated tzdata packages were released on 27th November (6 days before the transition). So far, there hasn’t been an updated package released for Ubuntu (see bug 72125).

One thing brought up in the Launchpad bug was that not all applications used the system /usr/share/zoneinfo time zone database. So other places that might need updating include:

Evolution has a database in /usr/share/evolution-data-server-$version/zoneinfo/ that is in iCalendar VTIMEZONE format.
Java has a database in /usr/lib/jvm/java-$version/jre/lib/zi. This uses a different binary file format.
pytz (used by Zope 3 and Launchpad among others) has a database consisting of generated Python source files for its database.

All the above rules time zone databases are based on the same source time zone information, but need to be updated individually and in different ways.

In a way, this is similar to the zlib security problems from a few years back: the same problem duplicated in many packages and needing to be fixed over and over again. Perhaps the solution is the same too: get rid of the duplication so that in future only one package needs updating.

As a start, I put together a patch to pytz so that it uses the same format binary time zone files as found in /usr/share/zoneinfo (bug 71227). This still means it has its own time zone database, but it goes a long way towards being able to share the system time zone database. It’d be nice if the other applications and libraries with their own databases could make similar changes.

For people using Windows, there is an update from Microsoft. Apparently you need to install one update now, and then a second update next year — I guess Windows doesn’t support multiple transition rules like Linux does. The page also lists a number of applications that will malfunction and not know about the daylight saving shift, so I guess that they have similar issues of some applications ignoring the system time zone database.