Local caching: A major distinguishing difference between VCSes

An interesting difference between the major VCSes concerns how much information is cached locally when one obtains a copy of the source code using the VCS. The amount of information obtained by default when one performs a checkout or clone with each of the five major VCSes is:

  • cvs – a working copy of the specified version of the source code files, plus information about which revision was checked out and where the repository is located.
  • svn – same as cvs, plus an extra copy of the specified version of the source code files
  • bzr, hg – same as svn, plus the remainder of the history of the current branch (i.e. cvs, plus a copy of the complete history of the current branch)
  • git – same as bzr & hg, plus the full history of all other branches in the repository as well.

Note that some systems have options to cache less than the default.

Benefits of local caching

The additional cached information can serve multiple purposes; for example, making operations faster (by using the local disk instead of the network), or allowing offline use. For example, nearly all operations in cvs other than edits of the working copy require network connectivity to the repository. In subversion, diffs between the version you checked out and your current working copy is fast due to the extra copy that was checked out, but other operations still require network connectivity. In bzr & hg, diffs against versions older than the checkout version, reverting to an older version, and getting a log of the changes on the branch can all be fast operations and can be done offline. In git, even comparing to a different branch, switching to a different branch, or getting a log of changes in any branch can be done quickly and offline.

This local caching also pushes another point: cvs and svn have limited utility when working offline. bzr, hg, and git allow quite a bit of offline use…in fact, it even makes sense to commit while offline (and then merge the local commit(s) and remote repositories later). Thus, one thinks of the local cache in such cases as being a repository itself. This has ramifications as well. Since the local cache is a repository, it means that it makes sense to think of updating from a different remote repository than you got your checkout/clone from, and of merging/pushing your changes to yet another location. This is the essence of being a VCS with distributed capabilities. This can be taken to the pathological extreme (resulting in the kernel development model), or one can use a more standard centralized model that simply has impressive offline capabilities (which is how Xorg runs), or one can pick something inbetween that suits them. One common case where someone might want to pick something in the middle is when an organization has multiple development sites (perhaps one in the US and one in Europe) and developers at the remote site would like to avoid the penalties associated with slow network connections. In such a case, there can be two “central” repositories which developers update from and commit to, with occasional merges between these centers. It can also be useful with developers gone to a conference wanting to work on the software and collaborate even when they don’t have connectivity to the “real” repository.

Another side effect of local caches being a repository is that it becomes extremely simple to mirror repositories.

Another interesting observation to make is that git allows the most offline use. There have been many times where I’ve wanted to work offline with cvs or svn projects (I’ve even resorted to rsyncing cvs repositories when I had access as a cheap hack to try to achieve this), and many times that I wished I had a copy of other branches and older versions while offline. bzr and hg are leaps and bounds better than cvs and svn in this regard, but they only partially solve this problem; using them would mean that I’d either need to manually do a checkout for every branch, that I’ll have to be online, or that I’ll have to do without information potentailly useful to me when I don’t have network connectivity. This is especially important considering that VCSes with distributed capabilities make merging easy, which encourages the use of more branches. Looking at the comparison this way, I’d really have to say that the extensive offline capabilities of git is a killer feature. I’m confused why other VCSes haven’t adopted as much local caching as git does (though I read somewhere that bzr may be considering it).

Disk usage — Client

When people see this list of varying amounts of local caching, they typically assume that disk usage is proportional to the amount of history cached, and thus believe that git will require hundreds of times the amount of diskspace to get a copy of the source code…with bzr and hg being somewhere inbetween. Reality is somewhat surprising; from my tests, the size of a checkout or clone from the various VCSes would rank in this order (with some approximate relative sizes to cvs checkouts in parentheses):

  • cvs (1)
  • git (1.92)
  • svn (2)
  • hg (2.05)
  • bzr (3.2) [*]

The main reason for git, hg, and bzr being so small relative to expectations is that source code packs well and these systems tend to be smart about handling metadata (information about the checkout and how to contact the server). However, there are some caveats here: my numbers (particularly for hg and bzr) aren’t based off as thorough studies as they should be, and the numbers have a higher than you’d expect variance (depends a lot on how well history of your project can pack, whether you have large files in the history that are no longer in the project, etc.) Also, while bzr and hg do automatic packing for the user, git requires the user to state when packing should be done. If the user never packs (i.e. never runs ‘git gc’) then the local repository can be much larger than a cvs or svn checkout. A basic rule of thumb is to just run ‘git gc’ after several commits, or whenever .git is larger than you think it should be.

I’ve done lots of git imports (git-cvsimport and git-svn make this easy), comparing dozens of cvs and svn repository checkouts to git ones. So I feel fairly confident about my number for git above. It does vary pretty wildly, though; e.g. for metacity it’d be 1.51 while for gtk+ it’d be 2.56[**]; I’ve seen ranges between about 0.3 and 6.0 on real world projects, so the 1.92 is just an overall mean. The hg numbers were based strictly off of converting git imports of both metacity and gtk+ to hg and taking an average of the relative difference of those (using the recent ‘hg convert’ command). My bzr number was based off importing metacity with bzr-svn and with git-svn and comparing those relative sizes (bzr-svn choked on gtk+, and I couldn’t get tailor to convert the existing git gtk+ repo to bzr).

[*] I did these tests before bzr-0.92 was out, which has a new experimental (and non-default) format that claims to drop this number significantly. I hear this new format is planned to become the default (with possibly a few tweaks) in a few months, so this is a benchmark that should be redone early next year. However, the existing number does show that bzr is already very close to an svn checkout in size despite bringing lots more information.

[**] For those wanting to duplicate, I ignored the space taken by the .git/svn directory, since that information is not representative of how much space a native git repository would take. It is interesting to note, though, that .git/svn/tags is ridiculously huge; to the point that I think it’s got to be a bug in the git-svn bridge.

Disk usage — “Central” Server

If what concerns you is the size of the repository on the central server, then the numbers are more dramatic. Benchmarks I’ve seen put git at about 1/3 the size of CVS and 1/10 the size of svn.

UPDATE: A number of people pointed me to the new named branches feature in hg that I was unaware of, which looks like it puts hg in the same category as git. Cool!

The foundation board of directors

Andrew posted some particularly insightful comments recently, explaning a lot about the dynamics of communities with respect to the elected directors in their communities. Over the years, I have seen some of the various pieces of behavior he describes in the GNOME community, but didn’t have the experience he does to be able to piece it together so lucidly.

One particular point I’ve seen is that the GNOME board of directors seems to get down on themselves often for not accomplishing as much as what they wanted. That’s unfortunate. Sure, things could always be better (I’ve often gotten down on myself for not being able to code all the ideas I’ve had as quickly as I wanted too). But I think they’re doing an awesome job.

I’m also looking forward to the elections this year. It looks like we have a field of awesome candidates. I almost think that I don’t need to bother voting, because I’d be ecstatic with any 7 of the 10 that are running.

Adoption of various VCSes

There are a lot of Version Control Systems out there, and one of the biggest criteria in selecting one to use is who else uses it. I’ll try to quickly summarize what I have learned about the adoption of various VCSes. There are many people who know more than me, but here’s some of the bits that I’ve picked up.

Perceived adoption from lots of reading

I have read many blog posts, comparisons, tutorials, news articles, reader comments (in blogs and at news sites), and emails (including various VCS project archives) about version control systems. In doing so, it is clear to me that some are frequently mentioned and deemed worthy of comparison by others, while many VCSes seem so obscure that they only appear in comparisons at sites that attempt to be exhaustive or completely objective (e.g. at wikipedia). Here are the ones I hear mentioned more frequently than others:

First rung: cvs, subversion, bazaar-ng,
mercurial, tla/baz, and
git.

Though bazaar perhaps belongs in a rung below (more on that in a minute). There are also several VCSes that are still mentioned often, but not as often as the ones above:

Second rung: svk, monotone, darcs,
codeville, perforce, clearcase,
and bitkeeper.

tla/baz died a few years ago (with both developers and users mostly abandoning it for other systems, though I hear tla got revived for maintenance-only changes). Also, bazaar-ng really straddles these two levels rather than being in the upper one, but I was one of the early adopters and it has relatively strong support in the GNOME community so it’s more relevant to me. Perforce, clearcase, and bitkeeper are proprietary and thus irrelevant to me (other than as a comparison point).

Adoption according to project records

Of the non-dead open source systems, here’s a list of links to who uses them plus some comments on the links:

  • bazaar-ngWhoUsesBzr – wiki page name is inconsistent; it should be “ProjectsUsingBzr” (compare to wiki page names below) :-). The page is also slightly misleading; they claim drupal as a user but my searches show otherwise (turns out to just be a developer with an unofficial mirror). Hopefully there aren’t other cases like this.
  • codeville – NoPage – I wasn’t able to find any list of projects using codeville anywhere. In fact, I wasn’t able to find any projects claiming to use it either. It must have shown up in other peoples’ comparisons on the basis of its interesting merge algorithm.
  • cvs – NoPage – I don’t have a good reference page, and it’d likely go out-of-date quickly. However, while CVS is no longer developed and projects are switching from CVS in droves these days, it wasn’t very many years ago that cvs was ubiquitous and a near universal standard. Nearly everyone familiar with at least one vcs is familiar with cvs, making it a useful reference point. Also, it still has a pretty impressive installed base; I’m even forced to use it occasionally in the open source world as well as every day at work.
  • darcsProjectsUsingDarcs – I strongly appreciate the included list of projects that stopped using their VCS (and why). Bonus points to darcs for not hiding anything.
  • gitProjectsUsingGit
  • mercurialProjectsUsingMercurial – I like how they make a separate list for projects with synchronized repositories (bzr and svk ought to adopt this practice, and maybe others)
  • monotoneProjectsUsingMonotone – I really like the project stats provided.
  • subversionopen-source-projects-using-svn – wiki page name isn’t ProjectsUsingSvn; couldn’t they read everyone else’s minds and realize that they needed such a name to fit in with the standard naming scheme? 😉
  • svkProjectsUsingSVK – claims WINE, KDE, and Ruby on Rails as users; my simple searches showed otherwise (likely svk developers just knew of developers from those projects hosting their own unofficial svk mirrors). I don’t know if their other claimed users are are accurate or not; I only checked these three.

Some adoption pages point to both the project home page and the project repositories, which is very helpful. The other adoption wiki pages should adopt that practice too, IMHO.

Adoption by “Big” users

Looking at the adoption pages listed above, each of the projects other than svk and codeville seem to have lots of users. Mostly small projects, but most projects probably are are small and it is also easier for small projects to switch to a new VCS. The real test is whether VCSes are also capable of supporting large projects. I’d like to compare on that basis, but I’m unwilling to investigate how big each listed project is. So, I’ll instead compare based on (a) if I’ve heard of the project before and know at least a little about it, and (b) I think of the project as big. This results in the following list of “big” users of various VCSes:

  • bazaar-ng – This is kind of surprising, but Ubuntu is the only case matching my definition above. As an added surprise, they aren’t in bzr’s list of users. (samba and drupal only have some unofficial users; and in the case of samba, I know they also have unofficial git users. Official adoption only for my comparison purposes; otherwise GNOME and KDE would be in lots of lists.)
  • codeville – none
  • cvs – Used to be used by virtually everything. Many projects still haven’t moved on yet.
  • darcs – none of the projects listed match my definition of “big” above
  • git – linux kernel (and many related projects), much of freedesktop.org (including, Xorg. HAL, DBUS, cairo, compiz), OLPC, and WINE
  • mercurial – opensolaris, mozilla (update: apparently mozilla hasn’t converted quite yet)
  • monotone – tough case. I would have possibly said none here, noting gaim, er, pidgin, as the closest but their stats suggest two projects (Xaraya and OpenEmbedded) are big…and that pidgin is bigger than I realized. I guess I’m changing my rules due to their cool use of stats.
  • subversion – KDE, GNOME, GCC, Samba, Python, and others
  • svk – none

Brief notes about each system

As a quick additional comparison point for those considering adoption, I’ll add some very brief notes about each system that I’ve gathered from my reading or experience with the system. I’ll try to list both a good point and a bad point for each.

  • Free/Open source VCSes
    • bazaar-ng (bzr) – Developed and Evangelized by Canonical (backers of the Ubuntu distribution). Designed to be easy to use and distributed, and often gets praise for those features. It received a bit of a black eye in the early days for being horribly slow (it made cvs look like a speed demon in its early days), though I hear that the speed issues have received lots of attention and changes (and brief recent usage seems to suggest that it’s a lot better). Annoyingly, it provides misleading and less-than-useful results when passing a date to diff (the implemented behavior is well documented and apparently intentional, it’s just crap).
    • codeville – Designed by Bram Cohen (inventor of bittorrent). People seem to find the merge algorithm introduced by codeville interesting. Doesn’t seem to have been adopted much, though, and it even appeared to have died for a while (going a year and a half between releases, with other updates hard to find as well). Seems to be picking back up again.
    • cvs – The VCS that all other VCSes compare to, both because of its recent ubiquity and because its well known flaws are easy to leverage in promoting new alternatives. The developers working on cvs decided its existing flaws could not be fixed without a rewrite, and thus created a new system called subversion. cvs is inherently centralized.
    • darcs – Really interesting and claimed easy to use research system written by David Roundy (some physicist at OSU) that is based on patches rather than source trees. I believe this allows, for example, merging between source trees that do not necessarily have common history (touted as an advanced cherry-picking algorithm that no other VCS can yet match). However, this design has an associated “doppelganger” bug that can cause darcs to become wedged and which requires care from the user to avoid. From the descriptions of this bug, it sounds like something any big project would trigger all the time (it’s an operation I’ve seen happen lots in my GNOME maintainence even on modestly sized projects like metacity.) However, developers apparently can avoid this bug if they know about it and take steps to actively avoid triggering it. I think this is related to “the conflict bug”, which can cause darcs to be slow on large repository merging, but am not sure.
    • git – Invented by Linus Torvalds (inventor of the linux kernel). It has amazed a lot of people (including me) with its speed, and there are many benchmarks out there that are pretty impressive. I’ve heard/seen people claim that it is at least an order of magnitude faster than all other VCSes they’ve tried (from people who then list most all the major VCSes people think of as fast among the list of VCSes they’ve tried). It also has lots of interesting advanced features. However, versions prior to 1.5 were effectively unusable, requiring superhuman ability to learn how to use. The UI warts are being hammered away and git > 1.5 is much better usability-wise; it’s now becoming a usable system once users first learn and understand a few differences from other systems, despite its few remaining warts here and there. The online tutorials have transformed into something welcoming for new users, though the man pages (which double as the built in “–help” system) still remind me more of academic research articles written for a community of existing experts rather than user documentation. Also, no official port to windows (without cygwin) exists yet, though one is apparently getting close. Interestingly, git seems to be highly preferred as a VCS among those I consider low-level hackers.
    • GNU Arch (tla/baz) – Invented by Tom Lord (who also tried to replace libc with his own rewrite). Both tla and baz are dead now with developers and users having moved on, for the most part. Proponents of these systems (particularly Tom) loudly evangelized the merits of distributed version control systems, which probably backfired since tla/baz were so horribly awful in terms of usability, complexity, quirkiness, and speed that these particular distributed VCSes really didn’t have any redeeming qualities or even salvagable pieces. (baz was written as a fork designed to make a usable tla which was backward compatible to tla; the developers eventually gave up and switched to bzr since this was an impossible goal.) I really wish I had the part of my life back I wasted learning and using these systems. And no, I don’t care about impartiality when it comes to them.
    • mercurial (hg) – Written by Matt Mackall (linux kernel developer). Started two days after git, it was designed to replace bitkeeper as the VCS for the kernel. Thus, like git, it focused on speed. While not as fast as git in most benchmarks I’ve seen, it has received lots of praise for being easier to learn, having more accessible documentation, working on Windows, and still being faster than most other VCSes. The community behind mercurial seems to be a bit smaller, however: it doesn’t have nearly as many plugins as bzr or git (let alone cvs or svn). Also, it annoyingly doesn’t accept a date as an argument to diff, unlike all the other major VCSes.
    • monotone (mtn) – Maintained by Nathaniel Smith and Graydon Hoare (who I don’t know of from elsewhere). The main thing I hear about this system is about it’s ideas to focus on authentication of history to verify repository contents and changes. These ideas influenced and were adopted by git and mercurial. On the con side, it appears getting an initial copy can take an extraordinarily large amount of time; for example, if you look at the developer site for pidgin you’ll note that they provide detailed steps on how to get a checkout of pidgin that involves bypassing monotone since it’s too slow to handle this on its own.
    • subversion (svn) – Designed by former cvs maintainers to “be a better cvs”. It doesn’t suffer from many of the same warts as CVS; e.g. commits are atomic, files can be renamed without messing up project history, changes are per-commit rather than per-commit-per-file, and a number of operations are much faster than in cvs. Most users (myself included) feel that it is much nicer than CVS. Like CVS, svn remains inherently centralized and has no useful merge feature. Unlike CVS, half the point of tagging is inherently broken in svn as far as I can tell[*] (you can’t pass a tag to svn diff; you have to search the log trying to find the revision from which the tag was created and then use whatever revision you think is right as the revision number in svn diff).
    • svk – Invented by Chia-liang Kao and now developed by Best Practical Solutions (some random company). Designed to use the subversion repository format but allow decentralized actions. I know little about their system and am hesitant to comment as I can’t think of any good comments I’ve heard (nor more than a couple bad ones.) However, on the light side of things, I absolutely love their SVKAntiFUD page. On that page, in response to the question “svk is built on top of subversion, isn’t it over-engineered and fragile?” an additional note to the answer (claimed to have been added in 2005) states that “Spaghetti code can certainly not be called over-engineered.” While the history page of their wiki suggests it has been there for at least a year, I’m guessing the maintainers don’t know about this comment and will remove it as soon as someone points it out to them.
  • Proprietary (i.e. included only for comparison purposes) VCSes
    • bitkeeper – A system developed by BitMover Inc., founded by Larry McVoy. Gained prominence from its usage for a few years by the linux kernel. “Free Use” (as in no monetary cost) of the system by open source projects was revoked when Andrew Tridgell started reverse engineering the protocol (by telnetting to a server and typing “help”). Most users of this system seem to like it technically, but the free/open source crowd understandably often disliked its proprietary nature. I haven’t used the system, but think of it as being similar to mercurial (though I don’t know for sure if that’s the best match).
    • clearcase – Developed by (the Rational Software division of) IBM. Clearcase is an exceptionally unusual VCS in that I’ve never heard anyone I know mention a positive word about it. Literally. They all seem to have stories about how it seems to hinder progress far more than it helps. There has to be someone out there that likes it (it seems to have quite a number of users for a proprietary VCS despite being exceptionally expensive), but for some reason I haven’t run across them. Very weird. I believe it is actually lock-based instead of either distributed or inherently centralized, meaning that only one person can edit any given file at a time on a given branch. Sounds mind-bogglingly crazy to me.
    • perforce – Developed by Perforce Software, Inc. It seems that users of the system generally like it technically, and it has a free-of-charge clause for open source software development. My rough feeling is that Perforce is like CVS or subversion, but has a number of speed optimizations over those two. It is apparently even worse than cvs or svn for offline working, making editing not-already-opened files in the working copy problematic and error-prone unless online.

The major VCSes

Based on everything above, I consider the following VCSes to be the “major” ones:

cvs, svn, bzr, hg, and git.

I’ll add an “honorable mention” category for monotone and darcs (which bzr nearly belongs in as well, but passes based on the Canonical backing and much higher than average support by developers within the GNOME community). These five VCSes are the ones that I’ll predominantly be comparing between in my subsequent posts.

Update

[*] Kalle Vahlman in the comments points out that you can diff against a tag in svn, though it requires using atrocious syntax and a store of patience:

As much as I agree with [the claim] that SVN is just a prettier CVS, [it] isn’t really true. You can [run]:

svn diff http://svn.gnome.org/svn/metacity/tags/METACITY_2_21_1 http://svn.gnome.org/svn/metacity/trunk

to get differences between the tag and current trunk. If it looks horribly slow to you, it’s because you are on a very fast connection. IT IS SO SLOW IT MAKES LITTLE KITTENS WEEP. But it is possible anyway.

There are a number of other good posts in the comments too, pointing out project adoption cases I potentially missed and noting additional issues with some systems that I won’t be comparing later.

Starting to compare Version Control Systems

As I blogged about some time ago, I decided to spend some time learning and comparing various version control systems (VCSes for short). Of course, there are many version control system comparisons out there already, and I’ve read countless other sources as well (blogs, articles + comments, archived mailing list messages found in random google searches, etc.). While some of these sources have very interesting information, they still don’t answer all the questions I had; in fact, even the types of comparisons typically performed in these comparisons don’t cover everything I wanted to see. Here are some of the questions I have been considering:

  • What are the most important VCSes to consider?
  • Why are VCSes hard to learn? If someone learns one VCS, how much lower is their learning curve for switching to another?
  • What are the most common pitfalls that users experience with each of the major VCSes? Are there similarities across systems in the mistakes that users make?
  • Why are some systems more widely adopted than others? Are there certain qualities that make some systems more likely to be adopted by certain groups and less likely by others?
  • Why do some users of inherently centralized systems claim that “distributed”[1] systems are harder to learn? Why do users of distributed systems claim that they are *not* harder to learn? Why are there similar questions between the various “distributed” systems?
  • Which VCS is the “best” for a given individual/group? More importantly, what are the important criteria and where do various VCSes shine?
  • Why is there so much misunderstanding between users of different systems?
  • To what extent does the truism that “all software sucks” apply to VCSes?
  • Typical stuff: Which is the fastest at operation X (and by how much)? Which provides the most useful output (why is it more useful)? Which has the best add-ons? Which has the most relevant features? Which has the best documentation (how much better is it)? Which has killer features missing in others? etc.

I’m still far from answering all of them. However, I have learned a few things, and I figured it’d be a useful exercise to bore everyone to death by writing up some of my thoughts on the subject. So I’ll be doing that occasionally. Some of the things I write up will have comparisons similar to what you’d see elsewhere (but with my own slant which focuses on what seems relevant to me), while a few will analyze the subject from an angle different than what I have been able to find in other comparisons. I have a few posts mostly written up already, and may find time to write up a couple more after those.

Obvious Disclaimers: I’m no expert and am additionally error-prone, so I’ll likely make mistakes in my posts. I also won’t make any claims to be objective, as I don’t think it’s even possible to fully achieve. I will aim to be “reasonably” objective…but note that I have an opinion that placing too high a priority on objectivity makes it impossible to achieve much of the full usefulness of a comparison, limiting what “reasonable” can mean.

[1] As I have mentioned before, I think this is a somewhat misleading term; unfortunately, I don’t have a good replacement. Maybe something like “multi-centered”?