Local caching: A major distinguishing difference between VCSes

An interesting difference between the major VCSes concerns how much information is cached locally when one obtains a copy of the source code using the VCS. The amount of information obtained by default when one performs a checkout or clone with each of the five major VCSes is:

  • cvs – a working copy of the specified version of the source code files, plus information about which revision was checked out and where the repository is located.
  • svn – same as cvs, plus an extra copy of the specified version of the source code files
  • bzr, hg – same as svn, plus the remainder of the history of the current branch (i.e. cvs, plus a copy of the complete history of the current branch)
  • git – same as bzr & hg, plus the full history of all other branches in the repository as well.

Note that some systems have options to cache less than the default.

Benefits of local caching

The additional cached information can serve multiple purposes; for example, making operations faster (by using the local disk instead of the network), or allowing offline use. For example, nearly all operations in cvs other than edits of the working copy require network connectivity to the repository. In subversion, diffs between the version you checked out and your current working copy is fast due to the extra copy that was checked out, but other operations still require network connectivity. In bzr & hg, diffs against versions older than the checkout version, reverting to an older version, and getting a log of the changes on the branch can all be fast operations and can be done offline. In git, even comparing to a different branch, switching to a different branch, or getting a log of changes in any branch can be done quickly and offline.

This local caching also pushes another point: cvs and svn have limited utility when working offline. bzr, hg, and git allow quite a bit of offline use…in fact, it even makes sense to commit while offline (and then merge the local commit(s) and remote repositories later). Thus, one thinks of the local cache in such cases as being a repository itself. This has ramifications as well. Since the local cache is a repository, it means that it makes sense to think of updating from a different remote repository than you got your checkout/clone from, and of merging/pushing your changes to yet another location. This is the essence of being a VCS with distributed capabilities. This can be taken to the pathological extreme (resulting in the kernel development model), or one can use a more standard centralized model that simply has impressive offline capabilities (which is how Xorg runs), or one can pick something inbetween that suits them. One common case where someone might want to pick something in the middle is when an organization has multiple development sites (perhaps one in the US and one in Europe) and developers at the remote site would like to avoid the penalties associated with slow network connections. In such a case, there can be two “central” repositories which developers update from and commit to, with occasional merges between these centers. It can also be useful with developers gone to a conference wanting to work on the software and collaborate even when they don’t have connectivity to the “real” repository.

Another side effect of local caches being a repository is that it becomes extremely simple to mirror repositories.

Another interesting observation to make is that git allows the most offline use. There have been many times where I’ve wanted to work offline with cvs or svn projects (I’ve even resorted to rsyncing cvs repositories when I had access as a cheap hack to try to achieve this), and many times that I wished I had a copy of other branches and older versions while offline. bzr and hg are leaps and bounds better than cvs and svn in this regard, but they only partially solve this problem; using them would mean that I’d either need to manually do a checkout for every branch, that I’ll have to be online, or that I’ll have to do without information potentailly useful to me when I don’t have network connectivity. This is especially important considering that VCSes with distributed capabilities make merging easy, which encourages the use of more branches. Looking at the comparison this way, I’d really have to say that the extensive offline capabilities of git is a killer feature. I’m confused why other VCSes haven’t adopted as much local caching as git does (though I read somewhere that bzr may be considering it).

Disk usage — Client

When people see this list of varying amounts of local caching, they typically assume that disk usage is proportional to the amount of history cached, and thus believe that git will require hundreds of times the amount of diskspace to get a copy of the source code…with bzr and hg being somewhere inbetween. Reality is somewhat surprising; from my tests, the size of a checkout or clone from the various VCSes would rank in this order (with some approximate relative sizes to cvs checkouts in parentheses):

  • cvs (1)
  • git (1.92)
  • svn (2)
  • hg (2.05)
  • bzr (3.2) [*]

The main reason for git, hg, and bzr being so small relative to expectations is that source code packs well and these systems tend to be smart about handling metadata (information about the checkout and how to contact the server). However, there are some caveats here: my numbers (particularly for hg and bzr) aren’t based off as thorough studies as they should be, and the numbers have a higher than you’d expect variance (depends a lot on how well history of your project can pack, whether you have large files in the history that are no longer in the project, etc.) Also, while bzr and hg do automatic packing for the user, git requires the user to state when packing should be done. If the user never packs (i.e. never runs ‘git gc’) then the local repository can be much larger than a cvs or svn checkout. A basic rule of thumb is to just run ‘git gc’ after several commits, or whenever .git is larger than you think it should be.

I’ve done lots of git imports (git-cvsimport and git-svn make this easy), comparing dozens of cvs and svn repository checkouts to git ones. So I feel fairly confident about my number for git above. It does vary pretty wildly, though; e.g. for metacity it’d be 1.51 while for gtk+ it’d be 2.56[**]; I’ve seen ranges between about 0.3 and 6.0 on real world projects, so the 1.92 is just an overall mean. The hg numbers were based strictly off of converting git imports of both metacity and gtk+ to hg and taking an average of the relative difference of those (using the recent ‘hg convert’ command). My bzr number was based off importing metacity with bzr-svn and with git-svn and comparing those relative sizes (bzr-svn choked on gtk+, and I couldn’t get tailor to convert the existing git gtk+ repo to bzr).

[*] I did these tests before bzr-0.92 was out, which has a new experimental (and non-default) format that claims to drop this number significantly. I hear this new format is planned to become the default (with possibly a few tweaks) in a few months, so this is a benchmark that should be redone early next year. However, the existing number does show that bzr is already very close to an svn checkout in size despite bringing lots more information.

[**] For those wanting to duplicate, I ignored the space taken by the .git/svn directory, since that information is not representative of how much space a native git repository would take. It is interesting to note, though, that .git/svn/tags is ridiculously huge; to the point that I think it’s got to be a bug in the git-svn bridge.

Disk usage — “Central” Server

If what concerns you is the size of the repository on the central server, then the numbers are more dramatic. Benchmarks I’ve seen put git at about 1/3 the size of CVS and 1/10 the size of svn.

UPDATE: A number of people pointed me to the new named branches feature in hg that I was unaware of, which looks like it puts hg in the same category as git. Cool!

14 thoughts on “Local caching: A major distinguishing difference between VCSes”

  1. A question you might research is why git requires that manual packing step and whether they plan to fix it; it’s a pretty lame “implementation detail leak” – why wouldn’t it just do the packing periodically, or in the background, or whatever? Even something lame like “pack every 10th commit” would seem fine. Make it configurable with manual pack as a (non-default) option 😉

  2. Note that with Bazaar it is possible to have a branch without a working tree. If you want to keep copies of several branches locally, this is worth doing.

    Assuming that the branches are stored in a single shared repository, the overhead will be ~ 40K plus the size of the changes (so if it is a branch that is fully merged into another one you in the repo it is just 40K).

    I guess the difference between this and git is how easy the workflow is. With git it happens by default, while with Bazaar you’d need to think about it (although there are plugins to make this sort of thing easier).

  3. Havoc: Yeah, I agree that it’s pretty lame. It’s definitely not the only ugly UI wart in git. But it wasn’t really the point of this particular post, so I only mentioned that detail in passing.

    I suspect the reason is that members of the git community care about things measured even down to the subsecond level, and automated packing means git takes a second or a few at a time users could be doing something else. By deferring it, they allow the user to specify when they have a bit of time to burn. While I understand the desire to not give users ugly pauses for something like automatic garbage collection (something that has annoyed me at times in emacs and various Java programs), I think the tradeoff adding a few seconds (if that!) to various benchmarks in order to simplify things for new users is well worth it. Besides, they could always add some arcane hidden option to use the old (current) behavior. I’m just guessing my opinion would be a minority in the git community; a higher learning curve to save a few seconds here and there in the future seems to be worth it to them.

    James: Yeah, I’ve been reading about the shared repositories in bzr recently. They sound interesting, and sound like they would somewhat make similar functionality at least possible. But it’s not a first-rate supported workflow (possibly the only one missing in bzr’s large list of supported workflows). Personally, I find this to be a deficiency in bzr. It’s sad too, because bzr does have many usability advantages, but everything I read about bzr seems to suggest that such a workflow has specifically been decided to be outside the scope of bzr.

  4. Oh, and in case anyone’s wondering: one of the two other posts I already have typed up mostly rags on git, which should balance out this one that praises it. You’ll have to wait two weeks, but it’ll come. 🙂

  5. FYI, automatic packing is in git master branch already [1]. It should be available in 1.5.4.

    [1] commit d4bb43ee273528064192848165f93f8fc3512be1

  6. are you sure that hg doesn’t copy _local_ branches when fetching the repo?

    if that is the case, then I think hg is just as good as git, just that it doesn’t force you to run any “hg gc” once in a time 😀

  7. Other interesting topics:
    patch management, stgit.
    visualization, gitweb, gitk.

    I use stgit continuously. stgit is designed for managing patch series like you see posted to lkml. It allows you to edit them, regenerate the series, and rebase to a new kernel version. quilt is similar and not based on a specific VCS. If your patches need public review you want to be using one of these systems.

    The visualization tools are nice too.

    Another important git feature is bisect. Bisect does a binary search in the commits to track down the one that caused the bug. This is a very useful feature, I’m not sure if any other VCS has it.

    I believe the latest git versions nag you at commit time if a ‘git gc’ would help. ‘git gc’ isn’t really critical, who cares is git is using 100MB more disk that it has to. If it bothers you type ‘git gc’. It doesn’t significantly change the performance of git, it just reduces the disk space needs.

    Another fairly unique git feature is remotes. ‘git add remote linus git://linusrepo’ You can add all of the kernel repositories from other developers to your remotes. ‘git fetch remote’ pulls down the changes. What is neat about this is that git figures out all of the common commits between the repos and only pulls down the unique changes. After you fetch them use ‘git merge’ to bring them into your local branch.

    If using stgit:
    git fetch linus (pull down the change’s from Linus’ repo)
    stg rebase linus/master (merge in linus’ changes and fix up your patches to apply on them)
    stg export (export the patch set to disk or directly email it to lkml)

    Another git trick:
    git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git digispeaker
    cd digispeaker
    git config remote.origin.url http://git.digispeaker.com/projects/digispeaker-kernel.git
    git pull

    In this trick you pull in all of the kernel history from the high speed servers at kernel.org (or your local server). You then switch servers and pull in only my changes from my slow hosting provider.

    Of course if you already have a local kernel repo just add digispeaker as a remote and it will download in a few seconds.

  8. Another interesting thing with git: its “object” longevity (here object is commit and file content data)

    With usual usually, git-gc, the packing step only compresses. It removes nothing. Which means that even objects that you have removed totally (changes from a branch that you never merged in, and you deleted the branch), they don’t expire. The future auto-packing “git gc –auto” or the standard packing “git gc” won’t remove these objects. There is a second step that needs to be taken.. “git gc –purge”. Even then, only unreachable objects will be removed. Git has another feature– the reflog. The reflog is a very special “branch” since it records each position of HEAD in the last 30 days (by default). So removed branches won’t be purged by purge until after waiting for 30 days, when the last reference to them will finally be released.

  9. Addendum: Such objects are of course never transferred out of the repository via pushes or other means.

  10. engla: Well, similar to how you can pass the –prune flag to git gc, you can also run
    git reflog expire –all –expire-unreachable=0
    to get rid of deleted stuff (I’m pretty sure it’ll work for your deleted branches case and it’s also handy for immediately getting rid of “deleted” information when using git-filter-branch, at least older versions of it.)

    Anyway, I agree that object longevity is interesting in git, but it’s really not anything that I think most users would need to concern themselves with.

  11. If git remotes are just shortcuts for URLs, hg supports that. There’s no command for it; you have to edit .hg/hgrc, adding “[paths] \n linus = http://linusrepo“.

    Every DVCS only pulls the unique changes. It’s like the definition of a DVCS.

Comments are closed.