New Default Branch Format in Bzr

One of the new features in the soon to be released bzr 0.8 is the new “knit” storage format.

When comparing the size of the repository data for jhbuild with “knit” and “metadir” formats (metadir is just the old storage format with repository, branch and checkout bookkeeping separated), I see the following:

metadir knit
Size 9.9MB 5.5MB
Number of files 1267 307

The reason for the smaller number of files is that information about all revisions in the repository is now stored together rather than in separate files. So the file count comes out at a constant plus 2 times the number of tracked files (a knit index file plus the knit data file). For comparison, the CVS repository I imported this from was 4.4MB, and comprised 143 files.

As well as reducing storage requirements, the new knit repository format is designed to reduce network traffic. With the current weave repository format, the weave file for each file touched by a commit gets rewritten to include the contents of the new revision. In contrast to this, the information about the new revision can simply be appended to the knit data file and the knit index file updated to match. This means publishing a branch to a server via sftp mainly involves append operations, resulting in a nice speed up.

Similarly when pulling new changes from a published branch, bzr only needs to download a knit index to find out which sections of the knit data are missing locally. It can then ask for just the changed sections (by an HTTP range request or a partial read with sftp), rather than downloading the entire contents of the changed weaves.

Overall, this should make bzr 0.8 a lot more usable than 0.7 for various network operations.

Repositories in Bzr

One of the new features comming up in the next release of bzr is support for shared repositories. This provides a way to reduce disk space needed to store multiple related branches. To understand how repositories work, it helps to know a bit about how branches are stored by bzr.

[bzr repository diagram]

There are three concepts that make up a bzr branch:

  1. A checkout or working tree. This is the source files you are working with. It represents the state of the source code at some recorded revision plus any local changes you’ve made. In the diagram on the right, it is represented as the red node.
  2. The branch, consisting of a linear sequence of revisions. This is represented by the blue nodes in the diagram. Note that there may be multiple paths from the first revision to the current revision due to branching and merging. The branch revision history indicates the path that was taken by this particular branch.
  3. The repository, being a store of the text of all the revisions in the ancestry of the branch, plus metadata about those revisions. This essentially stores information about every node and edge in the diagram.

In previous versions of bzr, this information was not clearly separated. However with the new default branch format in bzr 0.8 they are separated, and a particular directory need not contain all three parts, which is what makes the space savings and performance improvements possible.

One of the biggest space savings is achieved from sharing the repository data between branches. If a particular branch does not contain any repository information, bzr will recursively check the parent directory til it finds a repository. If a collection branches share some of their history, then the single shared repository will be significantly smaller than the space used if each branch had its own repository data.

Another way to reduce disk usage is to create branches without checkouts. This is useful when publishing a branch, since people pulling or merging from that branch don’t use the checkout files.

Finally, it is possible to create a checkout which does not contain branch or repository data, instead containing a pointer to where that data is located. This is quite useful when combined with a central shared repository.

So how big is this space saving? When I converted JHBuild to bzr, the repository data totals to 10MB, the branch data totals 100KB and a checkout is 1.4MB.

So to publish a second branch without the use of shared repositories means another 10MB of storage (a bit more if I include a checkout at the published location). If I use shared repositories, the cost of the second branch is 100KB plus an amount proportional to the size of the changes I make on that branch. So for many projects, the cost of publishing another branch is lost in the noise.

po/LINGUAS

One issue that was meantioned as a Gnome Goal was to switch packages to use a po/LINGUAS file.

The idea makes sense — translators only need to edit a simple text file to add a new translation to an application, rather than having to modify the configure.in/configure.ac file without breaking things. Unfortunately, the suggested way of supporting this is a pretty big hack. A better long term solution would be to use the upstream gettext macros and po/Makefile.in.in infrastructure.

For a Gnome module that doesn’t use intltool, the following steps should work.

  1. Make sure the module is being built with Automake 1.8 or 1.9. If it isn’t, upgrade to 1.9.
  2. Create an m4 subdirectory in your project if it doesn’t exist, add it in CVS and then create and add a m4/.cvsignore file (there are a number of files that will get created here by gettext that you don’t want to check into CVS).
  3. Mark the m4 subdirectory as the macro dir in the configure.ac file:
    AC_CONFIG_MACRO_DIR([m4])
    

    And make sure that the macro dir gets checked if the makefile reruns aclocal:

    AC_SUBST([ACLOCAL_AMFLAGS], ["-I $ac_macro_dir \${ACLOCAL_FLAGS}"])
    
  4. If you aren’t using the gnome-common autogen.sh script, you will also need to make sure that aclocal is called with “-I m4“. If you are using the gnome-common script, then this will happen automatically.
  5. Remove the AM_GLIB_GNU_GETTEXT call from configure.ac and replace it with:
    AM_GNU_GETTEXT([external])
    AM_GNU_GETTEXT_VERSION([0.14.1])
    
  6. If you aren’t using the gnome-common autogen.sh script, change the call to glib-gettextize to autopoint, and make sure it gets run before aclocal (again, unneeded if you are using the gnome-common script).
  7. Now rerun autogen.sh so that autopoint gets run. This should result in a number of files getting created under m4, and some new files under po.
  8. Copy po/Makevars.template to po/Makevars and customise the variables. You might want to set DOMAIN to $(GETTEXT_PACKAGE) rather than $(PACKAGE). Add this new file in CVS.
  9. Update po/LINGUAS from the ALL_LINGUAS variable in configure.ac, and then remove the ALL_LINGUAS definition. Add po/LINGUAS to CVS.
  10. Finally update m4/.cvsignore and po/.cvsignore to ignore the new generated files.

As I said at the start, this change is only appropriate for apps not using intltool, since intltool overwrites the po/Makefile.in.in file with an incomaptible version.

To get things working with intltool, I believe it would make most sense to modify intltool as follows:

  • Make intltool provide some commands that are command line argument compatible with xgettext and msgmerge.
  • Make IT_PROG_INTLTOOL alter XGETTEXT and MSGMERGE with the appropriate intltool functions.
  • Don’t overwrite po/Makefile.in.in.
  • If additional makefile rules are needed in the po subdirectory, install a po/Rules-intltool file containing them. The gettext M4 macros will include them into the resulting Makefile.

Using Tailor to Convert a Gnome CVS Module

In my previous post, I mentioned using Tailor to import jhbuild into a Bazaar-NG branch. In case anyone else is interested in doing the same, here are the steps I used:

1. Install the tools

First create a working directory to perform the import, and set up tailor. I currently use the nightly snapshots of bzr, which did not work with Tailor, so I also grabbed bzr-0.7:

$ wget http://darcs.arstecnica.it/tailor-0.9.20.tar.gz
$ wget http://www.bazaar-ng.org/pkg/bzr-0.7.tar.gz
$ tar xzf tailor-0.9.20.tar.gz
$ tar xzf bzr-0.7.tar.gz
$ ln -s ../bzr-0.7/bzrlib tailor-0.9.20/bzrlib

2. Prepare a local CVS Repository to import from

The import will run a lot faster with a local CVS repository. If you have a shell account on window.gnome.org, this is trivial to set up:

$ mkdir cvsroot
$ cvs -d `pwd`/cvsroot init
$ rsync -azP window.gnome.org:/cvs/gnome/jhbuild/ cvsroot/jhbuild/

3. Check for history inconsistency

As I discovered, Tailor will bomb if time goes backwards at some point in your CVS history, and will probably bomb out part way through. The quick fix for this is to directly edit the RCS ,v files to correct the dates. Since you are working with a copy of the repository, there isn’t any danger of screwing things up.

I wrote a small program to check an RCS file for such discontinuities:

http://www.gnome.org/~jamesh/code/backward-time.py

When editing the dates in the RCS files, make sure that you change the dates in the different files in a consistent way. You want to make sure that revisions in different files that are part of the same changeset still have the same date after the edits.

4. Create a Tailor config file

Here is the Tailor config file I used to import jhbuild:

#!
"""
[DEFAULT]
verbose = True
projects = jhbuild
encoding = utf-8

[jhbuild]
target = bzr:target
start-revision = INITIAL
root-directory = basedir/jhbuild.cvs
state-file = tailor.state
source = cvs:source
subdir = .
before-commit = remap_author
patch-name-format =

[bzr:target]
encoding = utf-8

[cvs:source]
module = jhbuild
repository = basedir/cvsroot
encoding = utf-8
"""

def remap_author(context, changeset):
    if '@' not in changeset.author:
        changeset.author = '%s <%s@cvs.gnome.org>' % (changeset.author,
                                                      changeset.author)
    return True

The remap_author function at the bottom maps the CVS user names to something closer to what bzr normally uses.

5. Perform the conversion

Now it is possible to run the conversion:

$ python tailor-0.9.20/tailor -vv --configfile jhbuild.tailor

When the conversion is complete, you should be left with a bzr branch containing the history of the HEAD branch from CVS. Now is a good time to check that the converted bzr looks sane.

6. Use the new branch

Rather than using the converted branch directly, it is a good idea to branch off it and do the development there:

$ bzr branch jhbuild.cvs jhbuild.dev

The advantage of doing this is that you have the option of rsyncing in new changes to the CVS repository and running tailor again to incrementally import them. You can then merge those changes to your development branch.

Revision Control Migration and History Corruption

As most people probably know, the Gnome project is planning a migration to Subversion. In contrast, I’ve decided to move development of jhbuild over to bzr. This decision is a bit easier for me than for other Gnome modules because:

  • No need to coordinate with GDP or GTP, since I maintain the docs and there is no translations.
  • Outside of the moduleset definitions, the large majority of development and commits are done by me.
  • There aren’t really any interesting branches other than the mainline.

I plan to leave the Gnome module set definitions in CVS/Subversion though, since many people help in keeping them up to date, so leaving them there has some value.

I performed a test conversion using Tailor 0.9.20. My first attempt at performing the conversion failed part way through. Looking at what had been imported, it was apparent that the first few changesets created weren’t the first changesets I’d created in CVS. What was weirder still was the dates on those changesets: they were dated 1997, while I hadn’t started jhbuild til 2001.

It turns out that it was caused by clock skew on the CVS server back in September 2003, so the revision dates for a few files are not monotonic. I did the quick fix of directly editing the RCS files (I was working off a local copy of the repo), which allowed the conversion to run through to completion. The problem has been reported as bug #37 in Tailor’s bug tracker.

This made me a bit worried about whether the CVS to Subversion conversion script being used for the rest of the Gnome modules was also vulnerable to this sort of clock skew problem. Sure enough it was, and the first real changeset of jhbuild had been imported as revision 323.

I did a bit more checking of the CVS repository, and found that there were 98 other modules exhibiting clock skew in their revision history, spread over 1245 files (some files with multiple points of skew). I’ve only checked the SVN test conversions of some of these modules, but all the ones I checked exhibited the same type of corruption.

It is going to be a fair bit of work cleaning it all up before the final conversion.