urlparse considered harmful

Over the weekend, I spent a number of hours tracking down a bug caused by the cache in the Python urlparse module. The problem has already been reported as Python bug 1313119, but has not been fixed yet.

First a bit of background. The urlparse module does what you’d expect and parses a URL into its components:

>>> from urlparse import urlparse
>>> urlparse('http://www.gnome.org/')
('http', 'www.gnome.org', '/', '', '', '')

As well as accepting byte strings (which you’d be using at the HTTP protocol level), it also accepts Unicode strings (which you’d be using at the HTML or XML content level):

>>> urlparse(u'http://www.ubuntu.com/')
(u'http', u'www.ubuntu.com', u'/', '', '', '')

As the result is immutable, urlparse implements a cache of up to 20 previous results. Unfortunately, the cache does not distinguish between byte strings and Unicode strings, so parsing a byte string may return unicode components if the result is in the cache:

>>> urlparse('http://www.ubuntu.com/')
(u'http', u'www.ubuntu.com', u'/', '', '', '')

When you combine this with Python’s automatic promotion of byte strings to unicode when concatenating with a unicode string, can really screw things up when you do want to work with byte strings. If you hit such a problem, the code may all look correct but the problem was introduced 20 urlparse calls ago. Even if your own code never passes in Unicode strings, one of the libraries you use might be doing so.

The problem affects more than just the urlparse function. The urljoin function from the same module is also affected since it uses urlparse internally:

>>> from urlparse import urljoin
>>> urljoin('http://www.ubuntu.com/', '/news')
u'http://www.ubuntu.com/news'

It seems safest to avoid the module all together if possible, or at least until the underlying bug is fixed.

Storm Released

This week at the EuroPython conference, Gustavo Niemeyer announced the release of Storm and gave a tutorial on using it.

Storm is a new object relational mapper for Python that was developed for use in some Canonical projects, and we’ve been working on moving Launchpad over to it. I’ll discuss a few of the nice features of the package:

Loose Binding Between Database Connections and Classes

Storm has a much looser binding between database connections and the classes used to represent records in particular tables. The standard way of querying the database uses a store object:

for obj in store.find(SomeClass, conditions):
    # do something with obj (which will be a SomeClass instance)

Some things to note about this syntax:

  • The class used to represent rows in the table is passed to find(), so it is possible to have multiple classes representing a single table. This can be useful with large tables where you are only interested in a few columns in some cases.
  • The class used to represent the table is not bound to a particular connection. So instances of it can come from different stores.

Lockstep Iteration

As well as iterating over a single table, a Storm result set can iterate over multiple tables together. For instance, if we have a table representing people and a table representing email addresses (where each person can have multiple email addresses), it is possible to iterate over them in lockstep:

for person, email in store.find((Person, Email), Person.id == Email.person):
    print person.name, email.address

Automatic Flushing Before Queries

One of the gotchas when using SQLObject was the way it locally cached updates to tables. This is a great way to reduce the number of updates sent to the database, but could result in unexpected results when performing subsequent SELECT queries. It was up to the programmer to remember to flush changes before doing a query.

With Storm, the store will flush pending changes automatically before performing the query.

Easy To Execute Raw SQL

An ORM can really help when developing a database driven application, but sometimes plain old SQL is a better fit. Storm makes it easy to execute raw SQL against a particular store with the store.execute() method. This method returns an object that you can iterate over to get the tuples from the result set. It also makes sure that any local changes have been flushed before executing the query.

Nice Clean Code

After working with SQLObject for a while, Storm has been a breath of fresh air. The internals are clean and nicely laid out, which makes hacking on it very easy. It was developed using test-driven development methodology, so there is an extensive test suite that makes it easy to validate changes.

ZeroConf support for Bazaar

When at conferences and sprints, I often want to see what someone else is working on, or to let other people see what I am working on. Usually we end up pushing up to a shared server and using that as a way to exchange branches. However, this can be quite frustrating when competing for outside bandwidth when at a conference.

It is possible to share the branch from a local web server, but that still means you need to work out the addressing issues.

To make things easier, I wrote a simple Bazaar/Avahi plugin. It provides a command “bzr share“, which does the following:

  • Scan the directory for any Bazaar branches it contains.
  • Start up the Bazaar server to listen on a TCP port and share the given directory.
  • Advertise each of the branches via mDNS using Avahi. They are shared using the branch nickname.

For the client side, the plugin implements a “bzr browse” command that will list the Bazaar branches being advertised on the local network (the name and the bzr:// URL). Using the two commands together, it is trivial to share branches locally or find what branches people are sharing.

I am not completely satisfied with how things work, and have a few ideas for how to improve things:

  1. Provide a dummy transport that lets people pull from branches by their advertised service name. This would essentially just redirect from scheme://$SERVICE/ to bzr://$HOST:$PORT/$PATH.
  2. Maybe provide more control over the names the branches get advertised with. Perhaps this isn’t so important though.
  3. Make “bzr share” start and stop advertising branches as they get added/removed, and handle branch nicknames changing (at this point, it is pretty much blue sky though).
  4. Perhaps some form of access control. I’m not sure how easy this is within the smart server protocol, but it should be possible to query the user over whether to accept a connection or not.

It will be interesting to see how well this works at the next sprint or conference.

Python time.timezone / time.altzone edge case

While browsing the log of one of my Bazaar branches, I noticed that the commit messages were being recorded as occurring in the +0800 time zone even though WA switched over to daylight savings.

Bazaar stores commit dates as a standard UNIX seconds since epoch value and a time zone offset in seconds. So the problem was with the way that time zone offset was recorded. The code in bzrlib that calculates the offset looks like this:

def local_time_offset(t=None):
    """Return offset of local zone from GMT, either at present or at time t."""
    # python2.3 localtime() can't take None
    if t is None:
        t = time.time()

    if time.localtime(t).tm_isdst and time.daylight:
        return -time.altzone
    else:
        return -time.timezone

Now the tm_isdst flag was definitely being set on the time value, so it must have something to do with one of the time module constants being used in the function. Looking at the values, I was surprised:

>>> time.timezone
-28800
>>> time.altzone
-28800
>>> time.daylight
0

So the time module thinks that I don’t have daylight saving, and the alternative time zone has the same offset as the main time zone (+0800). This seems a bit weird since time.localtime() says that the time value is in daylight saving time.

Looking at the Python source code, the way these variables are calculated on Linux systems goes something like this:

  1. Get the current time as seconds since the epoch.
  2. Round this to the nearest year (365 days plus 6 hours, to be exact).
  3. Pass this value to localtime(), and record the tm_gmtoff value from the resulting struct tm.
  4. Add half a year to the rounded seconds since epoch, and pass that to localtime(), recording the tm_gmtoff value.
  5. The earlier of the two offsets is stored as time.timezone and the later as time.altzone. If these two offsets differ, then time.daylight is set to True.

Unfortunately, the UTC offset used in Perth at the beginning of 2006 and the middle of 2006 was +0800, so +0800 gets recorded as the daylight saving time zone too. In the new year, the problem should correct itself, but this highlights the problem of relying on these constants.

Unfortunately, the time.localtime() function from the Python standard library does not expose tm_gmtoff, so there isn’t an easy way to correctly calculate this value.

With the patch I did for pytz to parse binary time zone files, it would be possible to use the /etc/localtime zone file with the Python datetime module without much trouble, so that’s one option. It would be nice if the Python standard library provided an easy way to get this information though.

Recovering a Branch From a Bazaar Repository

In my previous entry, I mentioned that Andrew was actually publishing the contents of all his Bazaar branches with his rsync script, even though he was only advertising a single branch. Yesterday I had a need to actually do this, so I thought I’d detail how to do it.

As a refresher, a Bazaar repository stores the revision graph for the ancestry of all the branches stored inside it. A branch is essentially just a pointer to the head revision of a particular line of development. So if the branch has been deleted but the data is still in the repository, recovering it is a simple matter of discovering the identifier for the head revision.

Finding the head revision

Revisions in a Bazaar repository have string identifiers. While the identifiers can be almost arbitrary strings (there are some restrictions on the characters they can contain), the ones Bazaar creates when you commit are of the form “$email-$date-$random“. So if we know the person who committed the head revision and the date it was committed, we can narrow down the possibilities.

For these sort of low level operations, it is easiest to use the Python bzrlib interface (this is the guts of Bazaar). Lets say that we want to recover a head revision committed by foo@example.com on 2006-12-01. We can get all the matching revision IDs like so:

>>> from bzrlib.repository import Repository
>>> repo = Repository.open('repository-directory')
>>> possible_ids = [x for x in repo.all_revision_ids()
...                 if x.startswith('foo@example.com-20061201')]

Now if you’re working on multiple branches in parallel, it is likely that the matching revisions come from different lines of development. To help work out which revision ID we want, we can look at the branch-nick revision property of each revision, which is recorded in each commit. If the nickname hadn’t been set explicitly for the branch we’re after, it will take the base directory name of the branch as a default. We can easily loop through each of the revisions and print a the nicknames:

>>> for rev_id in sorted(possible_ids):
...     rev = repo.get_revision(rev_id)
...     print rev_id
...     print rev.properties['branch-nick']

We can then take the last revision ID that has the nickname we are after. Since lexical sorting of these revision IDs will have sorted them in date order, it should be the last revision. We can check the log message on this revision to make sure:

>>> rev = repo.get_revision('head-revision-id')
>>> print rev.message

If it doesn’t look like the right revision, you can try some other dates (the dates in the revision identifiers are in UTC, so it might have recorded a different date to the one you remembered). If it is the right revision, we can proceed onto recovering the branch.

Recovering the branch

Once we know the revision identifier, recovering the branch is easy. First we create a new empty branch inside the repository:

$ cd repositorydir
$ bzr init branchdir

We can now use the pull command with a specific revision identifier to recover the branch:

$ cd branchdir
$ bzr pull -r revid:head-revision-id .

It may look a bit weird that we are pulling from a branch that contains no revisions into itself, but since the repository for this empty branch contains the given revision it does the right thing. And since bzr pull canonicalises the branch’s history, the new branch should have the same linear revision history as the original branch.

Recovering the branch from someone else’s repository

The above method assumes that you can create a branch in the repository. But what if the repository belongs to someone else, and you only have read-only access to the repository? You might want to do this if you are trying to recover one of the branches from Andrew’s Java GNOME repository :)

The easy way is to copy all the revisions from the read-only repository into one you control. First we’ll create a new repository:

$ bzr init-repo repodir

Then we can use the Repository.fetch() bzrlib routine to copy the revisions:

>>> from bzrlib.repository import Repository
>>> remote_repo = Repository.open('remote-repo-url')
>>> local_repo = Repository.open('repodir')
>>> local_repo.fetch(remote_repo)

When that command completes, you’ll have a local copy of all the revisions and can proceed as described above.

UTC+9

Daylight saving started yesterday: the first time since 1991/1992 summer for Western Australia. The legislation finally passed the upper house on 21st November (12 days before the transition date). The updated tzdata packages were released on 27th November (6 days before the transition). So far, there hasn’t been an updated package released for Ubuntu (see bug 72125).

One thing brought up in the Launchpad bug was that not all applications used the system /usr/share/zoneinfo time zone database. So other places that might need updating include:

  • Evolution has a database in /usr/share/evolution-data-server-$version/zoneinfo/ that is in iCalendar VTIMEZONE format.
  • Java has a database in /usr/lib/jvm/java-$version/jre/lib/zi. This uses a different binary file format.
  • pytz (used by Zope 3 and Launchpad among others) has a database consisting of generated Python source files for its database.

All the above rules time zone databases are based on the same source time zone information, but need to be updated individually and in different ways.

In a way, this is similar to the zlib security problems from a few years back: the same problem duplicated in many packages and needing to be fixed over and over again. Perhaps the solution is the same too: get rid of the duplication so that in future only one package needs updating.

As a start, I put together a patch to pytz so that it uses the same format binary time zone files as found in /usr/share/zoneinfo (bug 71227). This still means it has its own time zone database, but it goes a long way towards being able to share the system time zone database. It’d be nice if the other applications and libraries with their own databases could make similar changes.

For people using Windows, there is an update from Microsoft. Apparently you need to install one update now, and then a second update next year — I guess Windows doesn’t support multiple transition rules like Linux does. The page also lists a number of applications that will malfunction and not know about the daylight saving shift, so I guess that they have similar issues of some applications ignoring the system time zone database.

Playing Around With the Bluez D-BUS Interface

In my previous entry about using the Maemo obex-module on the desktop, Johan Hedberg mentioned that bluez-utils 3.7 included equivalent interfaces to the osso-gwconnect daemon used by the method. Since then, the copy of bluez-utils in Edgy has been updated to 3.7, and the necessary interfaces are enabled in hcid by default.

Before trying to modify the VFS code, I thought I’d experiment a bit with the D-BUS interfaces via the D-BUS python bindings. Most of the interesting method calls exist on the org.bluez.Adapter interface. We can easily get the default adapter with the following code:

import dbus

bus = dbus.SystemBus()
manager = dbus.Interface(
    bus.get_object('org.bluez', '/org/bluez'),
    'org.bluez.Manager')

adapter = dbus.Interface(
    bus.get_object('org.bluez', manager.DefaultAdapter()),
    'org.bluez.Adapter')

At this point, it is possible to perform discovery:

import dbus.glib
import gtk

def remote_device_found(addr, class_, rssi):
    print 'Found:', addr
def discovery_complete():
    gtk.main_quit()

adapter.connect_to_signal('RemoteDeviceFound', remote_device_found)
adapter.connect_to_signal('DiscoveryCompleted', discovery_completed)

adapter.DiscoverDevices()
gtk.main()

It is also possible to configure periodic discovery, which will send signals about devices that get found, disappear, or change name, so we could easily implement the obex: directory listing that shows all the devices found that support OBEX-FTP. One thing that isn’t clear from the API documentation is what happens if multiple programs try to start or stop discovery at the same time. It looks like the second program will get a org.bluez.Error.InProgress error when it tries to begin discovery. Ideally discovery would stay active til the last program interested in the results closed. Maybe I am misunderstanding it a bit and you can actually use the interface in this mode.

When we want to actually do OBEX-FTP with the device, we can establish the rfcomm connection:

rfcomm = dbus.Interface(
    bus.get_object('org.bluez', manager.DefaultAdapter()),
    'org.bluez.RFCOMM')

# will return e.g. /dev/rfcomm0
devname = rfcomm.Connect(bluetooth_address, 'ftp')

# communicate with the phone via the new rfcomm device

rfcomm.Disconnect(devname)

So it should be possible to modify obex-method to function with only the daemons included in Ubuntu Edgy. All that’s left is to do the actual work :).

Launchpad enterered into Python bug tracker competition

The Python developers have been looking for a new bug tracker, and essentially put out a tender for people interested in providing a bug tracker. Recently I have been working on getting Launchpad‘s entry ready, which mainly involved working on SourceForge import.

The entry is now up, and our demonstration server is up and running with a snapshot of the Python bug tracker data.

As a side effect of this, we’ve got fairly good SourceForge tracker import support now, which we should be able to use if other projects want to switch away from SF.

Re: Lazy loading

Emmanuel: if you are using a language like Python, you can let the language keep track of your state machine for something like that:

def load_items(treeview, liststore, items):
    for obj in items:
        liststore.append((obj.get_foo(),
                          obj.get_bar(),
                          obj.get_baz()))
        yield True
    treeview.set_model(liststore)
    yield False

def lazy_load_items(treeview, liststore, items):
    gobject.idle_add(load_items(treeview, liststore, item).next)

Here, load_items() is a generator that will iterate over a sequence like [True, True, ..., True, False]. The next() method is used to get the next value from the iterator. When used as an idle function with this particular generator, it results in one item being added to the list store per idle call til we get to the end of the generator body where the “yield False” statement results in the idle function being removed.

For a lot of algorithms, this removes the need to design and debug a state machine equivalent. Of course, it is possible to do similar things in C but that’s even more obscure :).

pygpgme 0.1 released

Back in January I started working on a new Python wrapper for the GPGME library. I recently put out the first release:

http://cheeseshop.python.org/pypi/pygpgme/0.1

This library allows you to encrypt, decrypt, sign and verify messages in the OpenPGP format, using gpg as the backend. In general, it stays fairly close to the C API with the following changes:

  • Represent C structures as Python classes where appropriate (e.g. contexts, keys, etc). Operations on those data types are converted to methods.
  • The gpgme_data_t type is not exposed directly. Instead, any Python object that looks like a file object can be passed (including StringIO objects).
  • In cases where there are gpgme_op_XXXX() and gpgme_op_XXXX_result() function pairs, these have been replaced by a single gpgme.Context.XXXX() method. Errors are returned in the exception where appropriate.
  • No explicit memory management. As expected for a Python module, memory management is automatic.

The module also releases the global interpreter lock over calls that fork gpg subprocesses. This should make the module multithread friendly.

This code is being used inside Launchpad to verify incoming email and help manage users’ PGP public keys.

In other news, gnome-gpg 0.4 made it into dapper, so users of the next Ubuntu release can see the improvements.