Local caching: A major distinguishing difference between VCSes

An interesting difference between the major VCSes concerns how much information is cached locally when one obtains a copy of the source code using the VCS. The amount of information obtained by default when one performs a checkout or clone with each of the five major VCSes is:

  • cvs – a working copy of the specified version of the source code files, plus information about which revision was checked out and where the repository is located.
  • svn – same as cvs, plus an extra copy of the specified version of the source code files
  • bzr, hg – same as svn, plus the remainder of the history of the current branch (i.e. cvs, plus a copy of the complete history of the current branch)
  • git – same as bzr & hg, plus the full history of all other branches in the repository as well.

Note that some systems have options to cache less than the default.

Benefits of local caching

The additional cached information can serve multiple purposes; for example, making operations faster (by using the local disk instead of the network), or allowing offline use. For example, nearly all operations in cvs other than edits of the working copy require network connectivity to the repository. In subversion, diffs between the version you checked out and your current working copy is fast due to the extra copy that was checked out, but other operations still require network connectivity. In bzr & hg, diffs against versions older than the checkout version, reverting to an older version, and getting a log of the changes on the branch can all be fast operations and can be done offline. In git, even comparing to a different branch, switching to a different branch, or getting a log of changes in any branch can be done quickly and offline.

This local caching also pushes another point: cvs and svn have limited utility when working offline. bzr, hg, and git allow quite a bit of offline use…in fact, it even makes sense to commit while offline (and then merge the local commit(s) and remote repositories later). Thus, one thinks of the local cache in such cases as being a repository itself. This has ramifications as well. Since the local cache is a repository, it means that it makes sense to think of updating from a different remote repository than you got your checkout/clone from, and of merging/pushing your changes to yet another location. This is the essence of being a VCS with distributed capabilities. This can be taken to the pathological extreme (resulting in the kernel development model), or one can use a more standard centralized model that simply has impressive offline capabilities (which is how Xorg runs), or one can pick something inbetween that suits them. One common case where someone might want to pick something in the middle is when an organization has multiple development sites (perhaps one in the US and one in Europe) and developers at the remote site would like to avoid the penalties associated with slow network connections. In such a case, there can be two “central” repositories which developers update from and commit to, with occasional merges between these centers. It can also be useful with developers gone to a conference wanting to work on the software and collaborate even when they don’t have connectivity to the “real” repository.

Another side effect of local caches being a repository is that it becomes extremely simple to mirror repositories.

Another interesting observation to make is that git allows the most offline use. There have been many times where I’ve wanted to work offline with cvs or svn projects (I’ve even resorted to rsyncing cvs repositories when I had access as a cheap hack to try to achieve this), and many times that I wished I had a copy of other branches and older versions while offline. bzr and hg are leaps and bounds better than cvs and svn in this regard, but they only partially solve this problem; using them would mean that I’d either need to manually do a checkout for every branch, that I’ll have to be online, or that I’ll have to do without information potentailly useful to me when I don’t have network connectivity. This is especially important considering that VCSes with distributed capabilities make merging easy, which encourages the use of more branches. Looking at the comparison this way, I’d really have to say that the extensive offline capabilities of git is a killer feature. I’m confused why other VCSes haven’t adopted as much local caching as git does (though I read somewhere that bzr may be considering it).

Disk usage — Client

When people see this list of varying amounts of local caching, they typically assume that disk usage is proportional to the amount of history cached, and thus believe that git will require hundreds of times the amount of diskspace to get a copy of the source code…with bzr and hg being somewhere inbetween. Reality is somewhat surprising; from my tests, the size of a checkout or clone from the various VCSes would rank in this order (with some approximate relative sizes to cvs checkouts in parentheses):

  • cvs (1)
  • git (1.92)
  • svn (2)
  • hg (2.05)
  • bzr (3.2) [*]

The main reason for git, hg, and bzr being so small relative to expectations is that source code packs well and these systems tend to be smart about handling metadata (information about the checkout and how to contact the server). However, there are some caveats here: my numbers (particularly for hg and bzr) aren’t based off as thorough studies as they should be, and the numbers have a higher than you’d expect variance (depends a lot on how well history of your project can pack, whether you have large files in the history that are no longer in the project, etc.) Also, while bzr and hg do automatic packing for the user, git requires the user to state when packing should be done. If the user never packs (i.e. never runs ‘git gc’) then the local repository can be much larger than a cvs or svn checkout. A basic rule of thumb is to just run ‘git gc’ after several commits, or whenever .git is larger than you think it should be.

I’ve done lots of git imports (git-cvsimport and git-svn make this easy), comparing dozens of cvs and svn repository checkouts to git ones. So I feel fairly confident about my number for git above. It does vary pretty wildly, though; e.g. for metacity it’d be 1.51 while for gtk+ it’d be 2.56[**]; I’ve seen ranges between about 0.3 and 6.0 on real world projects, so the 1.92 is just an overall mean. The hg numbers were based strictly off of converting git imports of both metacity and gtk+ to hg and taking an average of the relative difference of those (using the recent ‘hg convert’ command). My bzr number was based off importing metacity with bzr-svn and with git-svn and comparing those relative sizes (bzr-svn choked on gtk+, and I couldn’t get tailor to convert the existing git gtk+ repo to bzr).

[*] I did these tests before bzr-0.92 was out, which has a new experimental (and non-default) format that claims to drop this number significantly. I hear this new format is planned to become the default (with possibly a few tweaks) in a few months, so this is a benchmark that should be redone early next year. However, the existing number does show that bzr is already very close to an svn checkout in size despite bringing lots more information.

[**] For those wanting to duplicate, I ignored the space taken by the .git/svn directory, since that information is not representative of how much space a native git repository would take. It is interesting to note, though, that .git/svn/tags is ridiculously huge; to the point that I think it’s got to be a bug in the git-svn bridge.

Disk usage — “Central” Server

If what concerns you is the size of the repository on the central server, then the numbers are more dramatic. Benchmarks I’ve seen put git at about 1/3 the size of CVS and 1/10 the size of svn.

UPDATE: A number of people pointed me to the new named branches feature in hg that I was unaware of, which looks like it puts hg in the same category as git. Cool!