Schema Generation in ORMs

When Storm was released, one of the comments made was that it did not include the ability to generate a database schema from the Python classes used to represent the tables while this feature is available in a number of competing ORMs. The simple reason for this is that we haven’t used schema generation in any of our ORM-using projects.

Furthermore I’d argue that schema generation is not really appropriate for long lived projects where the data stored in the database is important. Imagine developing an application along these lines:

  1. Write the initial version of the application.
  2. Generate a schema from the code.
  3. Deploy one or more instances of the application in production, and accumulate some data.
  4. Do further development on the application, that involves modifications to the schema.
  5. Deploy the new version of the application.

In order to perform step 5, it will be necessary to modify the existing database to match the new schema. These changes might be in a number of forms, including:

  • adding or removing a table
  • adding or removing a column from a table
  • changing the way data is represented in a particular column
  • refactoring one table into two related tables or vice versa
  • adding or removing an index

Assuming that you want to keep the existing data, it isn’t enough to simply represent the new schema in the updated application: we need to know how that new schema relates to the old one in order to migrate the existing data.

For some changes like addition of tables, it is pretty easy to update the schema given knowledge of the new schema. For others it is more difficult, and will often require custom migration logic. So it is likely that you will need to write a custom script to migrate the schema and data.

Now we have two methods of building the database schema for the application:

  1. generate a schema from the new version of the application.
  2. generate a schema from the old version of the application, then run the migration script.

Are you sure that the two methods will result in the same schema? How about if we iterate the process another 10 times or so? As a related question, are you sure that the database environment your tests are running under match the production environment?

The approach we settled on with Launchpad development was to only deal with migration scripts and not generate schemas from the code. The migration scripts are formulated as a sequence of SQL commands to migrate the schema and data as needed. So to set up a new instance, a base schema is loaded then patched up to the current schema. Each patch leaves a record in the database that it has been applied so it is trivial to bring a database up to date, or check that an application is in sync with the database.

When the schema is not generated from the code, it also means that the code can be simpler. As far as Python ORM layer is concerned, does it matter what type of integer a field contains? Does the Python code care what indexes or constraints are defined for the table? By only specifying what is needed to effectively map data to Python objects, we end up with easy to understand code without annotations that probably can’t specify everything we want anyway.