Schema Generation in ORMs

When Storm was released, one of the comments made was that it did not include the ability to generate a database schema from the Python classes used to represent the tables while this feature is available in a number of competing ORMs. The simple reason for this is that we haven’t used schema generation in any of our ORM-using projects.

Furthermore I’d argue that schema generation is not really appropriate for long lived projects where the data stored in the database is important. Imagine developing an application along these lines:

  1. Write the initial version of the application.
  2. Generate a schema from the code.
  3. Deploy one or more instances of the application in production, and accumulate some data.
  4. Do further development on the application, that involves modifications to the schema.
  5. Deploy the new version of the application.

In order to perform step 5, it will be necessary to modify the existing database to match the new schema. These changes might be in a number of forms, including:

  • adding or removing a table
  • adding or removing a column from a table
  • changing the way data is represented in a particular column
  • refactoring one table into two related tables or vice versa
  • adding or removing an index

Assuming that you want to keep the existing data, it isn’t enough to simply represent the new schema in the updated application: we need to know how that new schema relates to the old one in order to migrate the existing data.

For some changes like addition of tables, it is pretty easy to update the schema given knowledge of the new schema. For others it is more difficult, and will often require custom migration logic. So it is likely that you will need to write a custom script to migrate the schema and data.

Now we have two methods of building the database schema for the application:

  1. generate a schema from the new version of the application.
  2. generate a schema from the old version of the application, then run the migration script.

Are you sure that the two methods will result in the same schema? How about if we iterate the process another 10 times or so? As a related question, are you sure that the database environment your tests are running under match the production environment?

The approach we settled on with Launchpad development was to only deal with migration scripts and not generate schemas from the code. The migration scripts are formulated as a sequence of SQL commands to migrate the schema and data as needed. So to set up a new instance, a base schema is loaded then patched up to the current schema. Each patch leaves a record in the database that it has been applied so it is trivial to bring a database up to date, or check that an application is in sync with the database.

When the schema is not generated from the code, it also means that the code can be simpler. As far as Python ORM layer is concerned, does it matter what type of integer a field contains? Does the Python code care what indexes or constraints are defined for the table? By only specifying what is needed to effectively map data to Python objects, we end up with easy to understand code without annotations that probably can’t specify everything we want anyway.

Storm Released

This week at the EuroPython conference, Gustavo Niemeyer announced the release of Storm and gave a tutorial on using it.

Storm is a new object relational mapper for Python that was developed for use in some Canonical projects, and we’ve been working on moving Launchpad over to it. I’ll discuss a few of the nice features of the package:

Loose Binding Between Database Connections and Classes

Storm has a much looser binding between database connections and the classes used to represent records in particular tables. The standard way of querying the database uses a store object:

for obj in store.find(SomeClass, conditions):
    # do something with obj (which will be a SomeClass instance)

Some things to note about this syntax:

  • The class used to represent rows in the table is passed to find(), so it is possible to have multiple classes representing a single table. This can be useful with large tables where you are only interested in a few columns in some cases.
  • The class used to represent the table is not bound to a particular connection. So instances of it can come from different stores.

Lockstep Iteration

As well as iterating over a single table, a Storm result set can iterate over multiple tables together. For instance, if we have a table representing people and a table representing email addresses (where each person can have multiple email addresses), it is possible to iterate over them in lockstep:

for person, email in store.find((Person, Email), Person.id == Email.person):
    print person.name, email.address

Automatic Flushing Before Queries

One of the gotchas when using SQLObject was the way it locally cached updates to tables. This is a great way to reduce the number of updates sent to the database, but could result in unexpected results when performing subsequent SELECT queries. It was up to the programmer to remember to flush changes before doing a query.

With Storm, the store will flush pending changes automatically before performing the query.

Easy To Execute Raw SQL

An ORM can really help when developing a database driven application, but sometimes plain old SQL is a better fit. Storm makes it easy to execute raw SQL against a particular store with the store.execute() method. This method returns an object that you can iterate over to get the tuples from the result set. It also makes sure that any local changes have been flushed before executing the query.

Nice Clean Code

After working with SQLObject for a while, Storm has been a breath of fresh air. The internals are clean and nicely laid out, which makes hacking on it very easy. It was developed using test-driven development methodology, so there is an extensive test suite that makes it easy to validate changes.