Using Storm with Django

I’ve been playing around with Django a bit for work recently, which has been interesting to see what choices they’ve made differently to Zope 3.  There were a few things that surprised me:

  • The ORM and database layer defaults to autocommit mode rather than using transactions.  This seems like an odd choice given that all the major free databases support transactions these days.  While autocommit might work fine when a web application is under light use, it is a recipe for problems at higher loads.  By using transactions that last for the duration of the request, the testing you do is more likely to help with the high load situations.
  • While there is a middleware class to enable request-duration transactions, it only covers the database connection.  There is no global transaction manager to coordinate multiple DB connections or other resources.
  • The ORM appears to only support a single connection for a request.  While this is the most common case and should be easy to code with, allowing an application to expand past this limit seems prudent.
  • The tutorial promotes schema generation from Python models, which I feel is the wrong choice for any application that is likely to evolve over time (i.e. pretty much every application).  I’ve written about this previously and believe that migration based schema management is a more workable solution.
  • It poorly reinvents thread local storage in a few places.  This isn’t too surprising for things that existed prior to Python 2.4, and probably isn’t a problem for its default mode of operation.

Other than these things I’ve noticed so far, it looks like a nice framework.

Integrating Storm

I’ve been doing a bit of work to make it easy to use Storm with Django.  I posted some initial details on the mailing list.  The initial code has been published on Launchpad but is not yet ready to merge. Some of the main details include:

  • A middleware class that integrates the Zope global transaction manager (which requires just the zope.interface and transaction packages).  There doesn’t appear to be any equivalent functionality in Django, and this made it possible to reuse the existing integration code (an approach that has been taken to use Storm with Pylons).  It will also make it easier to take advantage of other future improvements (e.g. only committing stores that are used in a transaction, two phase commit).
  • Stores can be configured through the application’s Django settings file, and are managed as long lived per-thread connections.
  • A simple get_store(name) function is provided for accessing per-thread stores within view code.

What this doesn’t do yet is provide much integration with existing Django functionality (e.g. django.contrib.admin).  I plan to try and get some of these bits working in the near future.

Metrics for success of a DVCS

One thing that has been mentioned in the GNOME DVCS debate was that it is as easy to do “git diff” as it is to do “svn diff” so the learning curve issue is moot.  I’d have to disagree here.

Traditional Centralised Version Control

With traditional version control systems  (e.g. CVS and Subversion) as used by Free Software projects like GNOME, there are effectively two classes of users that I will refer to as “committers” and “patch contributors”:

Centralised VCS Users

Patch contributors are limited to read only access to the version control system.  They can check out a working copy to make changes, and then produce a patch with the “diff” command to submit to a bug tracker or send to a mailing list.  This is where new contributors start, so it is important that it be easy to get started in this mode.

Once a contributor is trusted enough, they may be given write access to the repository moving them to the committers group. They now have access to more functionality from the VCS, including the ability to checkpoint changes into focused commits, possibly on branches.  The contributor may still be required to go through patch review before committing, or may be given free reign to commit changes as they see fit.

Some problems with this arrangement include:

  • New developers are given a very limited set of tools to do their work.
  • If a developer goes to the trouble of learning the advanced features of the version control system, they are still limited to the read only subset if they decide to start contributing to another project.

Distributed Workflow

A DVCS allows anyone to commit to their own branches and provides the full feature set to all users.  This splits the “committers” class into two classes:

Distributed VCS Users

The social aspect of the “committers” group now becomes the group of people who can commit to the main line of the project – the core developers. Outside this group, we have people who make use of the same features of the VCS as the core developers but do not have write access to the main line: their changes must be reviewed and merged by a core developer.

I’ve left the “patch contributor” class in the above diagram because not all contributors will bother learning the details of the VCS.  For projects I’ve worked on that used a DVCS, I’ve still seen people send simple patches (either from the “xxx diff” command, or as diffs against a tarball release) and I don’t think that is likely to change.

Measuring Success

Making the lives of core developers better is often brought up as a reason to switch to a DVCS (e.g. through features like offline commits, local cache of history, etc).  I’d argue that making life easier for non core contributors is at least as important.  One way we can measure this is by looking at whether such contributors are actually using VCS features beyond what they could with a traditional centralised setup.

By looking at the relative numbers of contributors who submit regular patches and those that either publish branches or submit changesets we can get an idea of how much of the VCS they have used.

It’d be interesting to see the results of a study based on contributions to various projects that have already adopted DVCS.  Although I don’t have any reliable numbers, I can guess at two things that might affect the results:

  1. Familiarity for existing developers.  There is a lot of cross pollination in Free Software, so it isn’t uncommon for a new contributor to have worked on another project before hand.  Using a VCS with a familiar command set can help here (or using the same VCS).
  2. A gradual learning curve.  New contributors should be able to get going with a small command set, and easily learn more features as they need them.

I am sure that there are other things that would affect the results, but these are the ones that I think would have the most noticeable effects.

DVCS talks at GUADEC

Yesterday, a BoF was scheduled for discussion of distributed version control systems with GNOME.  The BoF session did not end up really discussing the issues of what GNOME needs out of a revision control system, and some of the examples Federico used were a bit snarky.

We had a more productive meeting in the session afterwards where we went over some of the concrete goals for the system.  The list from the blackboard was:

  • Contributor collaboration (i.e. let anyone use the tool rather than just core developers).
  • Distro ⇔ distro and distro ⇔ upstream collaboration.
  • Host GNOME source code repositories
  • Code review
  • Server side hooks
  • Translators: what to do?
  • Enforced checks
  • Offline operations
  • Documentation authors?
  • Support Win32/Mac (important for GTK)

The sys admin tasks were broken down to:

  • MAINTAINERS file syntax checking
  • PO file syntax checking
  • CIA integration.
  • Commits mailing list
  • Check that commit messages are not empty
  • Trigger updates from commits (e.g. the web site module).
  • Release notes tarballs
  • Damned Lies support

It was clear from the discussion that neither Git or Bazaar satisfied all of the criteria.

The Playground

John Carr did a great job setting up Bazaar mirrors of all the GNOME modules.  This provided an easy way for people to see play around with Bazaar.  However, it only gave you half the experience since it didn’t provide a way to publish code and collaborate.

To aid in this, we have set up the bzr-playground.gnome.org machine, which any GNOME developer should be able to use to publish branches based on John’s imports.  Instructions on getting set up can be found on the wiki.  I hope that we will get a lot of people trying out this infrastructure.

We gave a presentation today on some of the things Bazaar provides that could be useful when hacking on GNOME.  Demoing bzr-playground was a bit problematic due to the internet connection problems at the venue, but I think we still showed some useful tools for local collaboration, searching and code review.

Meanwhile, Robert Collins has been working on some of the GNOME sysadmin features that Bazaar was lacking.  Among other things, he got Damned Lies working with both Subversion and Bazaar, with a test installation on the playground machine.

MySQL Announces Move to Bazaar

Bazaar logoIt has been a while coming, but MySQL has announced their move to Bazaar for version control.  This has been a long time coming, and it is great to finally see it announced publicly.

The published Bazaar branches include 8 years of history going back to MySQL 3.23.22, imported from the BitKeeper repositories.  So you can see a lot more than just the history since the switch: you can use all the normal Bazaar tools to see where the code came from and how it evolved.  Giuseppe Maxia has posted some instructions on how to check out the code for those who are interested.

I haven’t checked extensively, but I wouldn’t be surprised if this is the largest public code base managed with Bazaar.  I’ve known from personal experience working on Launchpad that it is capable of handling large trees, but it is good to have a high profile project to point at as an example now.

How not to do thread local storage with Python

The Python standard library contains a function called thread.get_ident().  It will return an integer that uniquely identifies the current thread at that point in time.  On most UNIX systems, this will be the pthread_t value returned by pthread_self(). At first look, this might seem like a good value to key a thread local storage dictionary with.  Please don’t do that.

The value uniquely identifies the thread only as long as it is running.  The value can be reused after the thread exits.  On my system, this happens quite reliably with the following sample program printing the same ID ten times:

import thread, threading

def foo():
    print 'Thread ID:', thread.get_ident()

for i in range(10):
    t = threading.Thread(target=foo)
    t.start()
    t.join()

If the return value of thread.get_ident() was used to key thread local storage, all ten threads would share the same storage. This is not generally considered to be desirable behaviour.

Assuming that you can depend on Python 2.4 (released 3.5 years ago), then just use a threading.local object. It will result in simpler code, correctly handle serially created threads, and you won’t hold onto TLS data past the exit of a thread.

You will save yourself (or another developer) a lot of time at some point in the future. Debugging these problems is not fun when you combine code doing proper TLS with other code doing broken TLS.

Prague

I arrived in Prague yesterday for the Ubuntu Developer Summit.  Including time spent in transit in Singapore and London, the flights took about 30 hours.

As I was flying on BA, I got to experience Heathrow Terminal 5. It wasn’t quite as bad as some of the horror stories I’d heard.  There were definitely aspects that weren’t forgiving of mistakes.  For example, when taking the train to the “B” section there was a sign saying that if you accidentally got on the train when you shouldn’t have it would take 40 minutes to get back to the “A” section.

It is also quite difficult to find water fountains in the terminal, which is inexcusable given that they don’t let people bring their own water bottles.

I had been a bit worried that they’d lose my bag, but it arrived okay in Prague.  Jonathan was not so lucky.

As well as the Ubuntu and Canonical folks, there are a bunch of Gnome developers here, including Ryan, Murray, Olav, David and Lennart.  It will be an interesting week.

bzr commit –author

One of the features I recently discovered in Bazaar is the --author option for “bzr commit“.  This lets you make commits to a Bazaar branch on behalf of another person.  When used, the new revision credits two people: you as the committer and the other person as the author.

While Bazaar does make it easy for non-core contributors to send changes in a form that correctly attributes them (e.g. by publishing a branch or sending a bundle), I doubt we’ll ever see the end of pure patches.  Some cases include:

  • Patches based on a tarball release.   In these cases the contributor likely hasn’t even used the VCS.
  • People send simple diffs from e.g. “bzr diff” since that is sometimes the easiest solution (or what they do by default due to having transferred their knowledge from another VCS).
  • Some people use a VCS bridge so they can work with their favourite VCS.  They might not be able to provide their changes as Bazaar commits due to this.

The --author option lets you commit these changes in a way that credits the contributor for their work.  The author of the change will then be displayed in “bzr annotate” output and credited along with the you in the “bzr log” output.

The feature is also used by a number of plugins such as bzr-rebase: if you replay or rebase someone else’s changes, the new revisions will creit you as the committer and the original committer as the author.

SSL caching on Firefox 3

Since upgrading to Ubuntu Hardy, I’ve been enjoying using Firefox 3.  The reduced memory usage has made a lot of other things nicer to use (I don’t feel like I need to buy more memory now).  One thing that is nice to see fixed is caching of SSL content.

In previous versions of Firefox, SSL content was never cached to disk with the default settings.  While you certainly don’t want all SSL content to be written to disk, a lot of it can be cached without problem.  For example, it is important that the CSS and JavaScript for a page be served via SSL to avoid man in the middle attacks (injecting arbitrary active content into a secure page is bad), but there isn’t much harm in caching them to disk: if the attacker can modify the disk cache then SSL probably doesn’t matter much.

Now it was possible to turn on disk caching in Firefox 2 through the browser.cache.disk_cache_ssl hidden option, but it had a serious drawback: the security information for resources was not saved in the disk cache so you’d get a broken padlock if resources were loaded from the cache.

Firefox 3 fixes up the disk cache to record the security information though, so turning on disk_cache_ssl setting no longer results in a broken padlock.  But what about all the people using Firefox with its default settings (or those who do not want all SSL content cached to disk)?  For these users, the web server can still cause some content to be cached.

By sending the “Cache-Control: public” response header, the server can say that a resource can be stored in the disk cache.  Firefox 3 will respect this irrespective of the disk_cache_ssl setting.  This should bring Firefox back into parity with Internet Explorer with respect to network  performance on SSL web sites.

Psycopg migrated to Bazaar

Last week we moved psycopg from Subversion to Bazaar.  I did the migration using Gustavo Niemeyer‘s svn2bzr tool with a few tweaks to map the old Subversion committer IDs to the email address form conventionally used by Bazaar.

The tool does a good job of following tree copies and create related Bazaar branches.  It doesn’t have any special handling for stuff in the tags/ directory (it produces new branches, as it does for other tree copies).  To get real Bazaar tags, I wrote a simple post-processing script to calculate the heads of all the branches in a tags/ directory and set them as tags in another branch (provided those revisions occur in its ancestry).  This worked pretty well except for a few revisions synthesised by a previous cvs2svn migration.  As these tags were from pretty old psycopg 1 releases I don’t know how much it matters.

As there is no code browsing set up on initd.org yet, I set up mirrors of the 2.0.x and 1.x branches on Launchpad to do this:

It is pretty cool having access to the entire revision history locally, and should make it easier to maintain full credit for contributions from non-core developers.

Psycopg2 2.0.7 Released

Yesterday Federico released version 2.0.7 of psycopg2 (a Python database adapter for PostgreSQL).  I made a fair number of the changes in this release to make it more usable for some of Canonical‘s applications.  The new release should work with the development version of Storm, and shouldn’t be too difficult to get everything working with other frameworks.

Some of the improvements include:

  • Better selection of exceptions based on the SQLSTATE result field.  This causes a number of errors that were reported as ProgrammingError to use a more appropriate exception (e.g. DataError, OperationalError, InternalError).  This was the change that broke Storm’s test suite as it was checking for ProgrammingError on some queries that were clearly not programming errors.
  • Proper error reporting for commit() and rollback(). These methods now use the same error reporting code paths as execute(), so an integrity error on commit() will now raise IntegrityError rather than OperationalError.
  • The compile-time switch that controls whether the display_size member of Cursor.description is calculated is now turned off by default.  The code was quite expensive and the field is of limited use (and not provided by a number of other database adapters).
  • New QueryCanceledError and TransactionRollbackError exceptions.  The first is useful for handling queries that are canceled by statement_timeout.  The second provides a convenient way to catch serialisation failures and deadlocks: errors that indicate the transaction should be retried.
  • Fixes for a few memory leaks and GIL misuses. One of the leaks was in the notice processing code that could be particularly problematic for long-running daemon processes.
  • Better test coverage and a driver script to run the entire test suite in one go.  The tests should all pass too, provided your database cluster uses unicode (there was a report just before the release of one test failing for a LATIN1 cluster).

If you’re using previous versions of psycopg2, I’d highly recommend upgrading to this release.

Future work will probably involve support for the DB-API two phase commit extension, but I don’t know when I’ll have time to get to that.