MySQL Announces Move to Bazaar

Bazaar logoIt has been a while coming, but MySQL has announced their move to Bazaar for version control.  This has been a long time coming, and it is great to finally see it announced publicly.

The published Bazaar branches include 8 years of history going back to MySQL 3.23.22, imported from the BitKeeper repositories.  So you can see a lot more than just the history since the switch: you can use all the normal Bazaar tools to see where the code came from and how it evolved.  Giuseppe Maxia has posted some instructions on how to check out the code for those who are interested.

I haven’t checked extensively, but I wouldn’t be surprised if this is the largest public code base managed with Bazaar.  I’ve known from personal experience working on Launchpad that it is capable of handling large trees, but it is good to have a high profile project to point at as an example now.

Psycopg migrated to Bazaar

Last week we moved psycopg from Subversion to Bazaar.  I did the migration using Gustavo Niemeyer‘s svn2bzr tool with a few tweaks to map the old Subversion committer IDs to the email address form conventionally used by Bazaar.

The tool does a good job of following tree copies and create related Bazaar branches.  It doesn’t have any special handling for stuff in the tags/ directory (it produces new branches, as it does for other tree copies).  To get real Bazaar tags, I wrote a simple post-processing script to calculate the heads of all the branches in a tags/ directory and set them as tags in another branch (provided those revisions occur in its ancestry).  This worked pretty well except for a few revisions synthesised by a previous cvs2svn migration.  As these tags were from pretty old psycopg 1 releases I don’t know how much it matters.

As there is no code browsing set up on initd.org yet, I set up mirrors of the 2.0.x and 1.x branches on Launchpad to do this:

It is pretty cool having access to the entire revision history locally, and should make it easier to maintain full credit for contributions from non-core developers.

Inkscape Migrated to Launchpad

Yesterday I performed the migration of Inkscape‘s bugs from SourceForge.net to Launchpad. This was a full import of all their historic bug data – about 6900 bugs.

As the import only had access to the SF user names for bug reporters, commenters and assignees, it was not possible to link them up to existing Launchpad users in most cases. This means that duplicate person objects have been created with email addresses like $USERNAME@users.sourceforge.net.

If you are a Launchpad user and have previously filed or commented on Inkscape bugs, you can clean up the duplicate person object by going to the following URL and entering your $USERNAME@users.sourceforge.net address:

https://launchpad.net/people/+requestmerge

After following the instructions in the email confirmation, all references to the duplicate person will be fixed up to point at your primary account (so bug mail will go to your preferred email address rather than being redirected through SourceForge).

Schema Generation in ORMs

When Storm was released, one of the comments made was that it did not include the ability to generate a database schema from the Python classes used to represent the tables while this feature is available in a number of competing ORMs. The simple reason for this is that we haven’t used schema generation in any of our ORM-using projects.

Furthermore I’d argue that schema generation is not really appropriate for long lived projects where the data stored in the database is important. Imagine developing an application along these lines:

  1. Write the initial version of the application.
  2. Generate a schema from the code.
  3. Deploy one or more instances of the application in production, and accumulate some data.
  4. Do further development on the application, that involves modifications to the schema.
  5. Deploy the new version of the application.

In order to perform step 5, it will be necessary to modify the existing database to match the new schema. These changes might be in a number of forms, including:

  • adding or removing a table
  • adding or removing a column from a table
  • changing the way data is represented in a particular column
  • refactoring one table into two related tables or vice versa
  • adding or removing an index

Assuming that you want to keep the existing data, it isn’t enough to simply represent the new schema in the updated application: we need to know how that new schema relates to the old one in order to migrate the existing data.

For some changes like addition of tables, it is pretty easy to update the schema given knowledge of the new schema. For others it is more difficult, and will often require custom migration logic. So it is likely that you will need to write a custom script to migrate the schema and data.

Now we have two methods of building the database schema for the application:

  1. generate a schema from the new version of the application.
  2. generate a schema from the old version of the application, then run the migration script.

Are you sure that the two methods will result in the same schema? How about if we iterate the process another 10 times or so? As a related question, are you sure that the database environment your tests are running under match the production environment?

The approach we settled on with Launchpad development was to only deal with migration scripts and not generate schemas from the code. The migration scripts are formulated as a sequence of SQL commands to migrate the schema and data as needed. So to set up a new instance, a base schema is loaded then patched up to the current schema. Each patch leaves a record in the database that it has been applied so it is trivial to bring a database up to date, or check that an application is in sync with the database.

When the schema is not generated from the code, it also means that the code can be simpler. As far as Python ORM layer is concerned, does it matter what type of integer a field contains? Does the Python code care what indexes or constraints are defined for the table? By only specifying what is needed to effectively map data to Python objects, we end up with easy to understand code without annotations that probably can’t specify everything we want anyway.