version control – Elijah’s Blog

Git User’s Survey 2009

Git is running their annual user survey again.

We would like to ask you a few questions about your use of the Git version control system. This survey is mainly to understand who is using Git, how and why.

The results will be published to the Git wiki and discussed on the git mailing list.

The survey would be open from 15 July till 15 September 2009.

Please devote a few minutes of your time to fill this simple questionnaire, it will help a lot the git community to understand your needs, what you like of Git, and of course what you don’t like of it.

The survey can be found here:
http://tinyurl.com/GitSurvey2009
http://www.survs.com/survey?id=2PIMZGU0&channel=Q0EKJ3NF54

The GNOME DVCS (Distributed Version Control System) Survey completed
about a week and a half ago, with responses from 579 different people with
svn accounts. (There are 1083 people with commit access to
GNOME SVN, so this is about a 53% response rate.) The survey was
intended to collect data related to a possible move for the GNOME
project from SVN to a distributed version control system in 2009, thus
questions about svn were included despite the fact that it is not
distributed. The results of the survey are shown below. (I got the data from Behdad; the scripts I used to generate the plots can be found here.)

Bias

The plots of the data I present simply cover all the questions —
twice. Once to show the percentages of respondents with each answer
for the specific question, then again to contrast how those who
answered a given question differently had differing rankings for the
various VCSes. So the plots are as neutral as I think is possible.

I also add some commentary of my own, analyzing the data and noting
items that surprised me (I had several predictions about how the
survey would turn out; many of my predictions were right but there
were a number of surprises for me too). I don’t think it’s possible
to make such commentary unbiased. In fact, since I noticed a clear
front-runner in looking at the results, I thought it most useful to
look at that particular system, so the majority of my comments focus
on it. If you do not want my bias, ignore my comments and draw your
own conclusions from the data.

Survey Questions

First, let’s remind everyone what the survey questions were:

Your GNOME SVN user id
Do you currently maintain any GNOME modules in SVN?
- Yes, I maintain multiple modules
- Yes, I maintain a single module
- No, I am not a maintainer
Do you currently develop any GNOME modules in SVN?
- Yes, I develop multiple modules
- Yes, I develop a single module
- No, I do not develop any modules
Do you commit to GNOME SVN?
- Yes, I regularly commit to GNOME SVN
- Yes, I sometimes commit to GNOME SVN
- No, I do not commit to GNOME SVN myself
How do you best characterize your current GNOME SVN contributions?
- I develop code
- I write documentation
- I test
- I translate
- Other
(Edit: I wish the question, “In which ways do you characterize
your current GNOME SVN contributions?” had also been asked.
It would be really interesting to see the results of such a
select-all-that-apply question.)
Which of the following distributed version control systems are you familiar with? (select all that apply)
- bzr
- git
- hg
How do you best summarize which DVCS systems you use *regularly*? (select all that apply)
- bzr
- git
- hg
How do you feel about GNOME changing version control system to one of bzr, git, or hg in 2009?
- Not again! We just switched systems, like, yesterday (no)
- No strong feeling, I’d use whatever is provided
- What’s wrong with SVN? (why?)
- I do not care
- Please do! Anything is better than svn (except for cvs of course)
- Other
Which one do you prefer? Please rank the following:
- anything other than svn (no preference)
- bzr
- git
- hg
- svn (no change)

Basic stats

Contribution statistics

Why do we attract so few people that self-identify as primarily being
documenters? Is it because people who get involved in documentation
then also get heavily involved in other areas and thus put themselves
in the “Other” category (most of the documenters I can think of
probably did this)? Are distros more likely to attract this kind of
volunteer? Do we just have a fundamental shortcoming somewhere?

DVCS familiarity statistics, and should we switch

Wow…we have an awful lot of people already familiar with other
VCSes. Over 60% familiar with git, and nearly half the people already
use it regularly? I knew there were a lot of people out there, but I
didn’t know it was that many. bzr and hg also have fairly strong
representation among the community (there’s even 31 people who are
familiar with all three systems, and one person who regularly uses all
three — no I’m not that person). The number of people who regularly
use git still leads the other two systems by quite a bit; I thought
they (or at least bzr) would have caught up more by now but I guess
not.

The lion’s share of the votes for whether we should switch were either
for those that wanted to switch or those that didn’t have a strong
feeling. Although only a small percentage (less than 3%) voted “no”,
that may have been due to the wording; for purposes of counting, the
“why?” column should be lumped with the “no”s. It’s a lighter no, but
still a no. The “other” column is a bit of a wildcard and represents
a somewhat significant cross-section of the community. As can be seen
in the next section, among this group who chose “other” in answer to
the question of whether we should switch, there was a preference for
git over the other systems.

VCS rankings

Note that I’ve created an extra plot derived from the other five, ‘Average rank’, which shows the average rank of each VCS (the number in parenthesis for this extra plot is the number of people whose rankings were averaged). If the community were evenly divided, or if no one cared which system we used, then every VCS would have a rank of 3. So the relevant question in the average rank plot is how far from rank 3 each system is.

Note that the different graphs have different y-axis ranges, as was true with previous plots too. Sorry.

This set of plots really surprised me. I have often thought of git as
polarizing and expected it to have the most first place votes and the
most last place votes. It definitely got the most first place votes,
was close on second place votes, and significantly lagged all other
systems in second-to-last and last place votes. I was floored by
this.

Average rankings for different demographics

One question I was really interested in was which version control
system various demographics preferred. For example, there were a
significant number of people who selected “other” for whether we
should switch to another system. What’s their preference? Do
translators or testers have a different favorite system than coders?
Do maintainers of multiple modules have a different outlook than
non-maintainers? So, in this section I try to look into this
question. Note that in each plot, the number in parentheses are the number of people across whom the average was taken.

Average VCS ranking by maintainence/development load

It looks like VCS preference doesn’t change much relative to
maintainence and development load. However, I found it interesting
that bzr had its highest support among maintainers/developers of a
single module and that git had its highest support among
maintainers/developers of multiple modules. (Mercurial had more
support among non-maintainers and non-developers, though that may just
be a reflection of the latter demographic having less strongly held
opinions.) That matched my intuition about design choices of bzr and
git, what they were optimized for, and how it has reflected in their
usage. However, although I was correct about the trend, the size of
the trend turned out to be nearly negligible.

Average VCS ranking by commit frequency

Not much variance here either. As expected, it looks like regular committers have stronger opinions (average rankings further from 3) than occasional or non-committers.

Average VCS ranking by contribution type

I was surprised by these plots. I expected support for git
to be found almost exclusively among coders, but apparently that is
not the case at all. git is ranked highest by all groups other than
documenters. Documenters, though, do rank git dead last.

Some might suggest we discard the last plot given the tiny sample size
(only 4 people self-identify as being ‘primarily’ documenters!).
While there’s some merit to that claim, I find it to be the most
interesting plot (as a bit of a VCS junkie) since it is the only
non-VCS related demographic for which git does not come in first
place.

I also find the translator plot interesting (as a VCS junkie), as it’s
the only other such plot for which git does not have a commanding
preference lead over all other VCSes. Honestly, though, I was quite
surprised that git was even close to svn for translators, let alone
that it had a small lead.

Average VCS ranking by DVCS usage/familiarity

No real surprise here as far as the favorite goes — users who are familiar with or regularly use a certain system tend to prefer that system. However, git enjoys positive support in all cases and at least comes in second? I found that somewhat surprising. I thought it would get a average ranking lower than 3 by those familiar with or using bzr/hg — much as bzr, svn, and hg did among those familiar with or regularly using git.

Average VCS ranking by propensity to switch systems

Those who think we should switch want to go to git. Those who have no
strong preference or selected other, also had a preference for git.
Those who don’t care whether we switch, wonder what’s wrong with
subversion, or think we just shouldn’t switch, all prefer subversion.
Even among the latter group, git came in a positive second for the
“why?” and “I don’t care” groups.

Final thoughts

It looks like there’s a strong preference in the community toward
switching, and that git has a strong lead in preference among the
community, followed by svn, then bzr, then mercurial.

Among the non-VCS-related demographics, there was only two in which
git did not have a commanding lead: testers and documenters. Among
testers, git was still the preferred system, but it only marginally
lead svn (and these two strongly lead bzr and hg). Among documenters,
git came dead last by a large margin (while bzr came in a commanding
first). It would be interesting to find out why; perhaps we should
poll the 4 relevant people.

Among the VCS-related demographics, people familiar with or regularly
using a certain system tended to prefer that system. git always came
in a positive second, though. Also, those not wanting to switch
systems or not caring *at all* whether we switched strongly supported
subversion, while everyone else (including those with no strong
feeling about the switch) strongly preferred git. Even among the “why
switch” and “I don’t care” groups that preferred subversion, git came
in a positive second. Among the tiniest switch preference group,
those that don’t want to change systems at all, bzr was second
followed fairly closely by git.

I spent a lot more time discussing git than bzr or hg in my comments
here, but that was mostly a reflection of where it appeared in the
stats. As shown in the survey results, the other systems don’t appear
to be nearly as preferred in the community, so I simply didn’t discuss
them as much. I apologize if that makes my analysis looks biased; as
I said at the beginning, feel free to ignore my analysis and draw your
own conclusions from the stats.

Many different kinds of revision specifiers

Version control systems each use their own method to refer to different versions (also known as ‘revisions’) of the repository. The choice of revision specification often reflects underlying data structures, and the choice of data structures often inhibits or enables various features for the system. Additionally, the methods of displaying and using revision specifiers can also affect the ease with which users can learn and use the new system.

Unfortunately, a full comparison is beyond the scope of this post. I will concentrate on simply introducing the basics and giving a flavor for how things are layed out, which itself is a long enough topic. While conclusions could be drawn with just the data and explanations presented here, I am intentionally avoiding doing so and leaving such to possible later posts. (Besides, bloody taxes and the brain-damaged US tax code have stripped me of any time that I would need to write such additional comparisons.)

Warning: My pictoral representations for each system will be crazier and more complex than usual (and even more lopsidedly complex for some systems than others) in order to keep things short while still showing what is possible.

cvs

Method

See cvs revision numbers and cvs branching basics, particularly figure 2.4 near the end of the branching basics section.

CVS has revision identifiers that are per-file, meaning that repositories at any given time are a combination of many different revisions (one for each file). Ignoring an ugly technical detail about the special revisions 1.1.1.1 and 1.1.1, the first version of a file is numbered 1.1. The next change to the file is recorded as 1.2, the next is 1.3, and so forth. If the user wants to create a branch, based on the 1.3 version of a file, then the branched version is 1.3.2.1. Changing and committing the file on the branch results in 1.3.2.2, then 1.3.2.3, etc. A second branch also created off of 1.3 would be numbered 1.3.4 instead of 1.3.2 (with actual commits numbered 1.3.4.x).

Note that branches are named by a revision with one less number (e.g. 1.4.2 is the name of the branch with commits numbered 1.4.2.x). As such, branch names refer to the beginning of the branch. Each file is branched separately, with per-file revision numbers (it is even possible to branch some files without branching others).

Tags are aliases for a specific version number. Since revisions are per-file, a given tag may refer to different revision numbers for different files (e.g. the ‘v1.0’ tag might refer to version 1.27 of foo.c, 1.36 of bar.h, and 1.218 of foobar.py)

Uniqueness of cvs revisions is not an issue since there is only one repository.

Picture

                       (etc)
                         |
             (etc)   1.4.4.3.2.2
               |         |
            1.4.4.5  1.4.4.3.2.1
               |         |
            1.4.4.4  (1.4.4.3.2)
               |     /
               |   /
            1.4.4.3
               |
  1.4.2.2   1.4.4.2
     |         |
  1.4.2.1   1.4.4.1
     |         |
  (1.4.2)   (1.4.4)
      \       /
       \     /
        \   /
         \ /
         1.4
          |
         1.3
          |
         1.2
          |
         1.1

svn

Method

See svn revisions and working with your branch, particularly figure 4.4 (the branching of one file’s history).

svn uses global revision identifiers, with the first revision being marked as 1, the second as 2, the third as 3, etc.

Branches have an unusual implementation in subversion; they are handled by a namespacing convention: a branch is the combination of revisions within the global repository that exist within a certain namespace. Creating a new branch is done by copying an existing set of files from one namespace to another, recorded as a revision itself.

Tags (an alias for a specific version in history) don’t exist in subversion. Instead, subversion again uses a namespacing convention identical to that done for branches (thus making tags and branches indistinguishable in subversion other than the chosen names), and users are merely discouraged from committing additional changes to files within a tag namespace.

Uniqueness of svn revisions is not an issue since there is only one repository.

Technically, a revision could simultaneously modify any combination of branches and tags by simply committing to all namespaces; however, this is typically discouraged and users only have a certain namespace checked out at a time.

Picture

  trunk   branches/proj-2-22  branches/proj-2-20  tags/RELEASE_2_22_2
   24
                                                        23
                 22
   21
   20
                 19
                                     18
                                     17
   16
   15
                 14
                 13
                 12
   11
                                     10
    9
    8
    7
                                      6
    5
                                      4
    3
    2
    1

bzr

Method

See understanding bzr revision numbers and specifying bzr revisions.

bzr, like svn, uses 1, 2, 3, etc. for revision numbers. However, the revision numbers are always consecutive in a branch. Merged in changes from other branches are given 3 numbers per revision. For example, if changes were merged from a repository that has changes relative to revision 2, the changes would come into the current branch numbered 2.1.1, 2.1.2, 2.1.3, etc. If changes from more than one branch are relative to the same commit, then the middle number is used to distinguish commits from the different branches. Thus one would see another set of changes relative to commit 2 numbered as 2.2.1, 2.2.2, 2.2.3, 2.2.4, etc. (Versions of bzr older than 1.2 used more than 3 numbers in certain cases, but that is no longer true of current versions.) See the picture below to make this clearer.

Branches in bzr are done by creating separate directories (typically with their own repository), though one can set up shared repositories. Each branch will have its own numbering scheme for the revisions it stores, recording the order that the revisions entered that repository. (See below about uniqueness issues.)

Tags in bzr are an alias for a commit, and are stored as part of a branch.

Note that bzr revision numbers are not unique. If you have the same revision in two different repositories, they will not necessarily have the same revision number in both. bzr does store unique identifiers for revisions, known as revid’s (an example of which looks like Matthieu.Moy@imag.fr-20051026185030-93c7cad63ee570df), though they are not shown by default. Users can obtain these unique identifiers by passing the –show-ids flag to bzr log, and these revids can be used in place of the simpler default revision specifiers when prefixed with “revid:”.

Picture

              12
              |
              11
            / | \
          /   |  \
        /     |   \
      10    4.1.5  4.2.2
       \   /  |      |
        \ /   |      |
         9    |    4.2.1
        / \   |   /
       /   \  |  /
       8    4.1.4
       |      |
       7    4.1.3
       | \    |
       |   \  |
       6    4.1.2
       |      |
       5    4.1.1
        \   /
          4
          |
          3

          2
          |
          1

Note: The revision identifiers shown in this picture are dependent on merge order; the revisions 4.1.5, 4.2.1, and 4.2.2 could instead be numbered 4.2.1, 4.1.5 and 4.1.6 respectively if the merges done to obtain revision 11 were done in a different order.

git

Method

See Understanding git history: Commits, and naming git commits.

git uses cryptographic checksums (in particular, sha1sums) of repository contents as revision identifiers. These checksums are 40-character hexadecimal strings (e.g. 621ff6759414e2a723f61b6d8fc04b9805eb0c20). Each revision also knows which revision(s) it was derived from (known as the revision’s parent(s)).

Git can be used with one branch per directory like bzr or hg, but it is more common to have branches stored within the same directory/repository (thus the reason some refer to git as a ‘branch container’). In git, branches record the revision of the most recent commit for the branch; since each commit records its parent(s), a branch consists of its most recent commit plus all ancestors of that commit. When a new commit is made on a branch, the branch just records the new revision. Tags simply record a single revision, much like branches, but tags are not advanced when additional commits are made. tags are not stored as part of a branch or in a revision controlled file, though by default tags that point to commits that are downloaded are themselves downloaded as well.

git revisions are unique by design; if you have the same revision in two different repositories, the revision name for both will be the same.

git does provide more human-meaningful ways of referring to commits, in the form of simple suffixes used to count backwards in history from the tip of a branch (or backwards from a tag or commit). This includes methods for counting relative to different parents, making the suffixes have structural meaning. However, such methods are somewhat hidden; for example, they are not shown in the output of git log. This leaves many users unaware of how to take advantage of them, if they are aware of them at all. (A simple wrapper can get them to be shown, at the cost of a little time; they could be shown at negligible time cost with an integrated solution, but none exists to my knowledge.)

Picture

           650a6f...
              |
           caf806...
          /   |   \         719b9d...
        /     |     \       /
      /       |       \   /
 75cc2c...  147c0a... acac44...
      \       |         |
        \     |         |
         8f50e6...    8147be...
         /    |     /
       /      |   /
  9b39b2... 6e2cde...
    |         |
  01fa22... 1a9d90...
    |    \    |
    |      \  |
 46508c...  b6765c...
    |         |
 1c4e8d...  328638...
       \     /
       6627f7b...
          |
       754b42...
          |    \
          |      \
       d1879f...  fba5d0...
          |
       c962db...

hg

Method

See a hg tour through history, and section 2.4.1, “Changesets, revisions, and talking to other people”.

hg uses a method that may look like a mix of the methods used by git and bzr; it has two distinct methods of referring to each revision. Like git, hg uses sha1sums to refer to revisions (though it abbreviates them to fewer characters by default). Like bzr, hg uses the numbers 1, 2, 3, etc. to refer to revisions. Thus hg has one unique method to refer to revisions and another that is simple and easily manipulatable by users. Each revision (or “changeset” in mercurial’s vocabulary) is of the form revision-number:changeset-identifier (e.g. 3:ff5d7b70a2a9).

Like bzr, branches in hg are typically done by creating separate directories (typically with their own repository). However, it also has named branches for naming branches within a repository, which are somewhat similar to git. (I have been told there are important distinctions between hg named branches and git branches, but I do not fully understand all the details; maybe someone will explain in the comments.)

mercurial has both tags and local tags, with (normal) tags being stored in an .hgtags file that is version controlled, and local tags being stored in a file that is not version controlled nor shared (cloned/pulled/pushed/etc.). Like most other systems, tags in hg are an alias for a specific commit.

The (abbreviated) sha1sum portion of hg revisions (the “changeset identifier”) is unique by design; if you have the same revision in two different repositories, the changeset identifier for both will be the same. The simple number portion of hg revisions (the “revision number”) is not unique. If you have the same revision in two different repositories, they will not necessarily have the same revision number in both.

Picture

             19:c87f92...
                |
             18:650a6f...
               |      \
        15:caf806...   \
         /     |        \
       /       |         \
      /        |          \
13:75cc2c... 14:147c0a... 17:acac44...
      \        /           |
        \     /            |
       12:8f50e6...      16:8147be...
         /    |        /
       /      |      /
9:9b39b2... 11:6e2cde...
    |         |
8:01fa22... 10:1a9d90...
    |    \    |
    |      \  |
5:46508c... 7:b6765c...
    |         |
4:1c4e8d... 6:328638...
       \     /
      3:6627f7b...
          |
      2:754b42...
          |
          |
      1:d1879f...
          |
      0:c962db...

Final notes

Each system uses a different scheme, which have different advantages and disadvantages. Odds are that I am not aware of all the relative merits of these systems yet, though I do know some. Personally, I don’t think any of them are optimal (though I admit that optimality is a somewhat relative term given the inherent trade-offs involved). Unfortunately I’m going off-topic, as I said I wouldn’t be discussing advantages and disadvantages in this post, so I’ll shut my trap here…

How NOT to write user-friendly documentation

Warning: This is essentially a long rant about the built-in help of git, but with specific problems pointed out and some constructive suggestions buried at the very end.

When I was starting on my PhD in mathematics, there was a time I remember where the professor in one of my classes referred to a problem on our homework assignment as “a calculus problem.” Later during the week, the six or so students in the small class were talking about the homework and one of the brighter students (i.e. not me) trepidly asked, “Did he really call that a calculus problem?” It turns out we were all stumped by the problem, and we were surprised–and scared–that the professor utterly trivialized it. We did all eventually solve the problem, and did ultimately agree that it was a generalization of ideas from calculus, but not until after getting a bit of a scare that we were in over our heads in graduate school.

In a similar way, Carl and others have argued that they just don’t see the big user-visible differences in the models offered by git and other VCSes. Many a git user seems to have had trouble understanding why other systems would be marketed as being easier than git. I am a lot slower than most of them, but I eventually figured it out and I do think that git could be just as easy if tweaked appropriately; however, it sure does seem to place every obstacle possible in the way of users discovering that.

This blog post is dedicated to what I believe is the single biggest obstacle in the way of users learning git: its built-in documentation (which doubles as its manpages, or vice-versa). Let me first point out that the git manpages are stellar from the point of view of comprehensiveness, or from the point of view of those writing their own UI for git. However, in this post I will judge these documents based on their friendliness to users who are new to git (which may not have been one of their design goals for those pages, so such judgement may come across as unfair).

Do not layer concepts — make it all or nothing

Imagine taking a computer user who is bright but unfamiliar with unix or common editors associated with unix, and giving them vi (or vim) to run. Also, imagine making them unable to use vi until they understand every feature and capability of it. Without a way to incrementally improve their productivity (better manuals, or a friend to explain things, etc.), most users will either give up or take an extremely long time to become productive. While this analogy is a bit of a stretch, for the most part this is precisely the way the git manpages are structured[1].

I’ll take push and pull as an example. If you take a look at git’s help for push or pull (or fetch), you’ll note that “refspecs” are featured prominently. Embedded in the explanation of refspecs, you’ll find concepts, terms, and other details such as rebasing, octopus, configuration file syntax and effects of such configuration options, fast-forwarding, nonlinear history, repository storage format, implementation details about git’s repository directory layout, local and remote repositories, remote tracking branches, and merge strategies–all inextricably intertwined (or so you’d think on your first 12 readings). refspecs pack a lot of information into a short amount of space, making them very convenient to the expert; but for new users this just places a mountain in front of them getting started. Since push and pull are the methods of collaborating with other users and refspecs are necessary to their functioning, new users often feel stuck.

In a similar way, when I was starting I thought that I would have to understand the index, reflogs, tree-ish’es, internal repository storage format, and a few other arcane things before I’d ever be able to do anything productive with git diff.

Assume that all details are equally useful

This section could likely be combined with the previous one, but I thought it was worth a special mention. It typically comes across as “this document is overloaded with detail”, though the existence of details is not necessarily the problem (you can have a comprehensive document and have it be understandable; take a look at the cvs, svn, or hg books at red-bean.com). The problem with documentation that assumes all details are equally useful (typically a subconscious assumption) is that it makes it extremely hard for the user to get started. git’s commands have boatloads of flags, and with few exceptions there is not much guidance at all about which are more important for just getting started. Users have to sift through it all and try to figure it out themselves.

The manpages also often specify details before concepts, erring as far as possible on the side of precision. For example, take the opening sentence of the description of the git diff command: “Show changes between two trees, a tree and the working tree, a tree and the index file, or the index file and the working tree.” After reading this, the user can only respond with comments like: “What’s the index file?” (or maybe even “is the ‘index file’ something slightly different than the ‘index’ I read about elsewhere, perhaps just a subset of it or how it behaves?”), “The working tree isn’t a tree?!?”, or, perhaps most likely, “Huh, what?” The git diff help tries, starting from the beginning, to introduce as many concepts as possible (without explanations) and just befuddles the user. For another example, take a look at the description of git checkout. For someone familiar with git, they won’t see anything wrong. In fact, I don’t see anything hard there anymore. However, do you know how many dozen times I tried to parse those two short paragraphs in attempting to learn git? I don’t — I long since lost count.

Require understanding of deeply nested commands

The documentation in git often defers explanation by referring to other (low-level) commands, in a way that makes users feel they have to understand all the low-level commands too.

Here’s an example. From the git push documentation:

The <src> side can be an arbitrary “SHA1 expression” that can be used as an argument to git-cat-file -t.

From the git-cat-file documentation, explaining the -t option:

Instead of the content, show the object type identified by <object>.

Finding the definition of <object> on the git-cat-file page:

The name of the object to show. For a more complete list of ways to spell object names, see “SPECIFYING REVISIONS” section in git-rev-parse

(More complete?!? There was no explanation here at all!) And for kicks, the very beginning of the SPECIFYING REVISIONS section of git-rev-parse:

A revision parameter typically, but not necessarily, names a commit object. They use what is called an extended SHA1 syntax. Here are various ways to spell object names. The ones listed near the end of this list are to name trees and blobs contained in a commit.

If the user gets to this point they are like to ask whether the object from the git-cat-file page refers to a commit object, a different kind of object, or a more general case that includes both. Also, they may ask, if it’s just a commit object, then part of the ensuing explanation is going to be for something else other than what we are interested in…but are we going to be able to differentiate which parts of the explanation that is?

Here are some other examples, all of them taken from the the first paragraph or two of the descriptions of each command. From git-checkout: “It updates the named paths in the working tree from the index file (i.e. it runs git-checkout-index -f -u)” (one has to love following up an unintelligible (to new users) explanation by referring to a lower-level command that assumes the user knows even more). From git-bisect: “This command uses git-rev-list –bisect option to drive the binary search…“. From git-log: “The command takes options applicable to the git-rev-list command to control what is shown and how, and options applicable to the git-diff-tree commands to control how the changes each commit introduce[d] are shown.”

Prefer implementational details to user-level concepts

Just one example I know of immediately…open up the git-checkout manpage and you’ll see a section titled “Detached Head”; it’s completely meaningless to new users. Trying to describe the concepts (“working with no active branch”, “working on an unnamed branch”?) would be much better. I think they’ve tried to stomp out issues in this category, so there are not as many as there used to be.

Make your terminology inconsistent

I’ll jump straight to a quick example from git: You put things in the “index” with “add” which is known as “staging”, you remove them with “reset” (or “rm –cached” if it’s the initial commit), you use flags like “–cached” to refer to index stuff in some places, use “HEAD” to avoid it (specifically in diff), and sometimes the flags related to it are “soft” vs “mixed” (which also need to be juggled with “hard” in git-reset). While users eventually learn all the synonyms and related terminology, it really slows learning down.

Use inconsistent conventions

The git manpages do not use consistent conventions. For example, taking a look at the synopsis lines you’ll see “<filepattern>” on git-add, “file” on git-annotate, and “-pNUM” on git-apply; why are non-literal items denoted with angle brackets in one case, uppercase in another, and no special markings in a third case? On the git-push manpage alone, you’ll see “–receive-pack=<git-receive-pack>” and also “–repo=all”; angle brackets in one case, and plain text in another? On the git-diff manpage you’ll see “<commit>{0,2}” and also “<path>…”, while on git-bundle you’ll see “refname…”, and on git-checkout-index you’ll see “[<file>]*”; so denoting more than one item is done with a variety of regexes, an ellipsis combined with angle brackets, or just a plain ellipsis.

The more ways you have of referring to things, the more likely users are to confuse them. One that bit me particularly hard was the “<commit>{0,2}” in the git-diff synopsis. Now, I use regular expressions daily and my usage of them spans emacs, grep, sed, perl, python, and probably other places I’m forgetting…but it still did not occur to me that this syntax in this context happened to be a regex. It looked somewhat similar to reflog notation in git, and I launched into all sorts of reading up on reflogs (and the documentation it depended on) trying to understand this detail. I also read the online tutorial, parts of the git user manual, blog sites, and even mailing list archives. I gave up and moved on, but never felt comfortable that I understood things…until months later when it dawned on me that this was a simple regex.

(And no, I don’t care that my post whining about inconsitent conventions is itself riddled with inconsistent conventions.)

What can be done to fix this?

Well…some people have fled to hg and bzr to avoid these problems in git. As far as built-in documentation, both of them have done well and have very nice documentation for new users; bzr’s is particularly well polished. Adopting one of those systems is a reasonable choice which I can’t fault people for making.

Now, I dislike reaming a project without mentioning something good or at least providing constructive feedback. In addition to spelling out some specific problems throughout this post, I came up with some constructive ideas for improvement: Easy GIT, a brainstorming session about making git more user-friendly, in the form of a usable demonstration. It also doubles as a transition tool, helping users switch from other systems (particularly svn). I don’t know if anyone wants to adopt any of my ideas or documentation, but it’s out there for people to evaluate (even if not yet complete).

I will also note that Govind Salinas had a similar idea and created pyrite (and did so before I started eg, though I didn’t learn of it until after I was using eg daily). So it seems that I am not the only one thinking along these lines…

[1] The online tutorial and user manual of git are different, but sadly users don’t want to read big long tutorials when they are already familiar with another “similar” system; rather they just use ‘git help’ and get the git manpages…and end up more confused than they started.

Happenings in the VCS world

It has been a long time since my last blog post on VCSes. I am getting back into the swing of things and will be making a few more posts. Besides, Olav doesn’t have enough to do and he wants more of my long rambling posts to digest.

The VCS world is becoming more and more interesting, even if it is also more and more frustrating. I’ll briefly point out a few things I have seen happen in the last few months that look cool, making this VCS post a little bit different than my others.

cvs

Stinking stingy CVS refuses to die…it seems to prefer slowly petrifying over the years or something. It was great a number of years ago, but there’s just so many better tools these days. However, there does appear to be a light at the end of the tunnel. The last place I am forced to use CVS (work) will finally be switching (to subversion) in a couple months. Woohoo!

svn

I haven’t seen any big changes in subversion itself (only one bug fix release has occurred). However, it looks like they are making progress on finally implementing useful merge functionality. This is interesting on a number of levels: (1) this lack of functionality was one of the big reasons subversion sometimes looks like a (very well polished) antique rather than a modern system; will the incorporation of this feature be enough to stave off some of the ongoing defections to other systems?, (2) this may be interesting for those using bzr-svn, hgsvn, or git-svn — are users of such systems going to find it even easier to use their preferred tool?, (3) the main reason svn’s dozen or so ugly renaming bugs (some of which essentially result in corrupted data) have gone almost completely unnoticed is that most are only triggered in merge operations and subversion’s current merge functionality is so primitive and problematic that hardly anyone uses it. Further, svn’s roadmap clearly lists fixing the rename problems in a different release, after the merge fixes are included. Will the extra visibility that one problem will receive due to a different problem being fixed make subversion look more problematic or less? This will be fun to watch.

On a separate note, it is interesting to see that subversion developers are considering adopting some features of distributed VCSes — sometime in the distant future. An easy to miss but interesting nugget from that email is the following:

Fortunately, we’ve pretty much agreed, IIRC, that we’re willing to punt on subdirectory detachability in working copies in order to get performance improvements.

I have often seen svn and cvs proponents argue that as one of the big advantages to those systems, yet it looks like the svn developers are willing to drop it. Very interesting indeed.

hg

Mercurial version 0.9.5 was released since I did my last round of VCS blog posts and it is on my system. hg-0.9.5 has quite a number of improvements; the one that particularly caught my eye was support for subversion as a source SCM in its convert functionality. When I first looked at mercurial, they suggested people use git-svn and then convert from git to hg. To me, that seemed to push people to just use git. It looks like this has changed.

I have often found it somewhat strange that mercurial doesn’t have more active vocal proponents. Usually one hears from the git or bzr proponents, but not so much from mercurial. Yet it has always had many of the advantages of both (and, in some ways seems to have the most svn-like UI, and would seem a more natural transition for svn converts). I guess it’s a case where having most of the advantages or capabilities of other systems (even multiple other systems) yet not clearly standing out in one particular area will rob you of the active advocates that you could otherwise have. Of course, maybe it’s like the linuxjournal reader’s choice awards phenomenon too; the noise or results that others hear may only be indicative of a certain small subset of the community.

bzr

A lot has happened in the Bazaar world. They had their big 1.0 release in mid-December and are now up to bzr-1.2. They have made impressive gains in performance, particularly with their adoption of the pack idea from git, and it appears they have at long last caught up to the leaders in the field in this area.

Near the end of last year, I corresponded about early versions of the “Main Competitors” writeups of the Why Choose Bazaar page, with Ian Clatworthy. I pointed out some advantages of bzr he hadn’t included, mentioned how some bold claims had no accompanying proof, and pointed out some places where he seemed to be unaware of capabilities of other systems or where I disagreed with some of his claims. The final versions seem to have mixed results; part of my feedback was addressed (and more was addressed in follow-ups), but other parts were not. I’m particularly puzzled by the reticence to investigate the existing capabilities of other systems and the willingness to claim features of bzr as advantages without determining whether they are actually unique. Regardless, though, while one does need to individually verify or discard each claim, the writeups are fairly impressive. I probably need to get back in touch with Ian again.

git

I’m so annoyed with Carl right now. He was the one who introduced me to git a number of years ago, and showed me some really cool things about it. I dropped it almost immediately at the time because it was way too hard to use. But, I’ve always been interested in it and made occasional attempts to tame the dragon ever since.

As many are aware, git has made huge strides towards usability in the 1.5 series, and has recently introduced automatic repacking in git-1.5.4. Because of all this work, I made diligent attempts to understand it over the last couple months. In doing so, I finally had the necessary epiphanies to feel I understand it. It turns out I was able to use it productively long before the uncomfortable feeling of I-don’t-really-understand-this-thing was finally expelled. The result? I found that there are several features of git not present in other systems that I am absolutely addicted to, but looking back on the journey I can’t say that it would be worth the effort for others to follow the same path, despite these awesome features. The thing is still too bloody hard to figure out.

One of my desires for my blog posts series was to point out how horrible the git manpages (i.e. the built in help system for git) are for new users, but I felt uncomfortable doing so until I actually understood them. I was not able to understand even the synopsis of the git-diff manpage until a couple weeks ago. And I tried. Hard. Over days, weeks, and months. I read up on reflogs, the index, git’s storage format, the git tutorial and all kinds of other documentation. I feel stupid now, because I was just missing something simple and now seemingly obvious. But from what I can tell, little should-be-obvious-but-aren’t things like this are blocking lots of people from being able to use git.

Long story short: git has become far more usable…mere mortals can actually figure the system out (a big change from earlier versions) if they have an unusually large level of patience and motivation. git has some really awesome features, but I just can’t recommend it to others in its current state.

Limbo: Why users are more error-prone with git than other VCSes

Limbo is a term I use but VCS authors don’t. However, that’s because they tend to ignore a certain state that exists in all major VCSes (and give it no name because they tend to ignore it) despite the fact that this state seems to be the largest source of errors. I call this state limbo.

How to make git behave like other VCSes

Most potential git users probably don’t want to read this whole page, and would like their knowledge from usage of other VCSes to apply without learning how the index and limbo are different in git than their previous VCS (despite the really cool extra functionality it brings). This can be done by

Always using git diff HEAD instead of git diff

and

Always using git commit -a instead of git commit

Either make sure you always remember those extra arguments, or come back and read this page when you get a nasty surprise.

The concept of Limbo

VCS users are accustomed to thinking of using their VCS in terms of two states — a working copy where local changes are made, and the repository where the changes are saved. However, the working copy is split into three sets (see also VCS concepts):

(explicitly) ignored — files inside your working copy that you explicitly told the VCS system to not track
index — the content in your working copy that you asked the VCS to track; this is the portion of your working copy that will be saved when you commit (in CVS, this is done using the CVS/Entries files)
limbo — not explicitly ignored, and not explicitly added. This is stuff in your working copy that won’t be checked in when you commit but you haven’t told the VCS to ignore, which typically includes newly created files.

The first state is identical across all major VCSes. The second two states are identical across cvs, svn, bzr, hg, and likely others. But git splits the index and limbo differently.

One could imagine a VCS which just automatically saves all changes that aren’t in an explicitly ignored file (including newly created files) whenever a developer commits, i.e. a VCS where there is no limbo state. None of the major VCSes do this, however. There are various rationales for the existence of limbo: maybe developers are too lazy to add new files to the ignored list, perhaps they are unaware of some autogenerated files, or perhaps the VCS only has one ignore list and developers want to share it but not include their own temporary files in such a shared list. Whatever the reason, limbo is there in all major VCSes.

Changes in limbo are a large source of user error

The problem with limbo is that changes in this state are, in my experience, the cause of the most errors with users. If you create a new file and forget to explicitly add it, then it won’t be included in your commit (happens with all the major VCSes). Naturally, even those familiar with their VCS forget to do that from time to time. This always seems to happen when other changes were committed that depend on the new files, and it always happens just before the relevant developers go on vacation…leaving things in a broken state for me to deal with. (And sure, I return the favor on occasion when I simply forget to add new files.)

A powerful feature of git

Unlike other VCSes, git only commits what you explicitly tell it to. This means that without taking additional steps, the command “git commit” will commit nothing (in this particular case it typically complains that there’s nothing to commit and aborts). git also gives you a lot of fine-grained control over what to commit, more than most other VCSes. In particular, you can mark all the changes of a given file for subsequent committing, but unlike other VCSes this only means that you are marking the current contents of that file for commit; any further changes to the same file will not be included in subsequent commits unless separately added. Additionally, recent versions of git allow the developer to mark subsets of changes in an existing file for commit (pulling a handy feature from darcs). The power of this fine-grained choose-what-to-commit functionality is made possible due to the fact that git enables you to generate three different kinds of diffs: (1) just the changes marked for commit (git diff –cached), (2) just the changes you’ve made to files beyond what has been marked for commit (git diff), or (3) all the changes since the last commit (git diff HEAD).

This fine-grained control can come in handy in a variety of special cases:

When doing conflict resolution from large merges (or even just reviewing a largish patch from a new contributor), hunks of changes can be categorized into known-to-be-good and still-needs-review subsets.
It makes it easier to keep “dirty” changes in your working copy for a long time without committing them.
When adding a feature or refactoring (or otherwise making changes to several different sections of the code), you can mark some changes as known-to-be-good and then continue making further changes or even adding temporary debugging snippets.

These are features that would have helped me considerably in some GNOME development tasks I’ve done in the past.

How git is more problematic

This decision to only commit changes that are explicitly added, and doing so at content boundaries rather than file boundaries, amounts to a shift in the boundary between the index and limbo. With limbo being much larger in git, there is also more room for user error. In particular, while this allows for a powerful feature in git noted above, it also comes with some nasty gotchas in common use cases as can be seen in the following scenarios:

Only new files included in the commit
1. Edit bar
2. Create foo
3. Run git add foo
4. Run git commit
In this set of steps, users of other VCSes will be surprised that after step 4 the changes to bar were not included in the commit. git only commits changes when explicitly asked. (This can be avoided by either running git add bar before committing, or running git commit -a. The -a flag to commit means “Act like other VCSes — commit all changes in any files included in the previous commit”.)
Missing changes in the commit
1. Create/edit the file foo
2. Run git add foo
3. Edit foo some more
4. Run git commit
In this set of steps, users of other VCSes will be surprised that after step 4 the version of foo that was commited was the version that existed at the time step 2 was run; not the version that existed when step 4 was run. That’s because step 2 is translated to mean mark the changes currently in the file foo for commit. (This can be avoided by running git add foo again before committing, or running git commit -a for step 4.)
Missing edits in the generated patch
1. Edit the tracked file foo
2. Run git add foo
3. Edit foo some more
4. Run git diff
In this set of steps, users of other VCSes will be surprised that at step 4 they only get a list of changes to foo made in step 3. To get a list of changes to foo made since the last commit, run git diff HEAD instead.
Missing file in the generated patch
1. Create a new file called foo
2. Run git add foo
3. Run git diff
In this set of steps, users of other VCSes will be surprised that at step 3 the file foo is not included in the diff (unless changes have been made to foo since step 2, but then only those additional changes will be shown). To get foo included in the diff, run git diff HEAD instead.

These gotchas are there in addition to the standard gotcha exhibited in all the major VCSes:

How all the major VCSes are problematic

Missing file in the commit
1. Edit bar
2. Create a new file called foo
3. Run vcs commit (where vcs is cvs, svn, hg, bzr…see below about git)
In this set of steps, the edits in step 1 will be included in the commit, but the file foo will not be. The user must first run vcs add foo (again, replacing vcs with the relevant VCS being used) before committing in order to get foo included in the commit.

It turns out that git actually can help the user in this case due to its default to only commit what it is explicitly told to commit; meaning that in this case it won’t commit anything and tell the user that it wasn’t told to commit anything. However, since nearly every tutorial on git[*] says to use git commit -a, users include that flag most the time (60% of the time? 98%?). Due to that training, they’ll still get this nasty bug. However, they’re going to forget or neglect this flag sometimes, so they also get the new gotchas above.

[*] Recent versions of the official git tutorial being the only exception I’ve run across. It’s fairly thorough (make sure to also read part two), though it isn’t quite as explicit about the potential gotchas in certain situations.

How bzr, hg, and git mitigate these gotchas (and cvs and svn don’t)

These gotchas can be avoided by always running vcs status (again, replace vcs with the relevant VCS being used) and looking closely at the states the VCS lists files in. It turns out bzr, hg, and git are smart here and try to help the user avoid problems by showing the output of the status command when running a plain vcs commit (at the end of the commit message they are given to edit). This helps, but isn’t foolproof; I’ve somehow glossed over this extra bit of info in the past and still been bit. Also, I’ll often either use the -m flag to specify the commit message on the command line (for tiny personal projects) or a flag to specify taking the commit message from a file (i.e. using -F in most vcses, -l in hg).

The concepts a user must learn to understand existing VCSes

Note: Most will not find this post as interesting as my previous posts or my next one. It was intended to help explain questions like “How much knowledge transfers to a new VCS if you’ve learned another?” and “Why do some claim that certain *types* of VCSes are easier to learn than others, while others claim that they are all pretty much equal?”, questions I mentioned in my first post. Most probably aren’t interested in those questions and thus not this entry. I’m including it anyway.

Editors as an analogy

I sometimes see people arguing about whether text editors and word processors ought to automatically save with every change. While almost every existing editor has two states (the version being edited, plus the version on disk when last saved), some argue that it would simplify things to save on every change. Most editors stick with the two state model, which from a darwinian point of view would suggest it is the more superior model overall. However, it is interesting to note that the multi-state model does come with its complications even for simple cases like this. The multi-state model for editors has stung just about everyone at least once in the past before they learned the appropriate habits. For example, many have lost data in the past due to exiting the app before saving, due to power outages, due to application crashes, or even due to OS and hardware failures. (These days, most editors have workarounds which mitigate these problems.) Also, users can’t use separate programs to copy or print or import the file on disk and use it unless they rememebered to first save their latest changes. And users may be confused at first by extra files (foo.autosave, foo.bak, foo~, .#foo, etc.) that show up on their hard disk.

Virtually all editors use this two-state model (current edits, plus last version saved on disk), and nearly all computer users seem to have mastered it. At a basic level, VCSes use a similar model.

The multiple states of all major VCSes

All the major VCSes provide developers with their own little sandbox, or working copy, as well as a place for changes that are ready to be saved, called a repository. This maps almost directly to the concepts of standard editors — changes you make locally, and what version you last saved. Most any VCS guide will say that these are the two states you need to learn (I particularly remember reading several about CVS which said this.) It’s a convenient lie though. There are more than two states.

Compiling source code can create files that don’t need to be saved in the repository (others can regenerate them with the source). So, all the major VCSes have the concept of an ignore list; any files in the ignore list will not be saved in the repository. So we have three states so far: ignored files, local changes, and the repository.

Sadly, there’s another state that the working copy is split into. The major VCSes seem to have decided that developers may be too lazy to add files that shouldn’t be saved to the ignore list…or that they may be unaware of such files (editor autosave files, for example), or that developers want to have shared ignore lists but don’t want to add some personal files to such shared lists. Whatever the reason, the major VCSes have another state which I call “limbo”, whose existence everyone seems to forget about. This state is changes which aren’t explicitly added to the index (think CVS/Entries files in CVS) and thus will not be saved, but are not explicitly ignored either. This state causes the most bugs in my experience, even with advanced users, because people simply forget to explicitly add new files to the index and thus they don’t get saved with the rest of the changes. So we have four states so far (three being subsets of the working copy): explicitly ignored files, limbo, local changes that will be saved in the next commit, and the repository.

It turns out that the repository side also is split into multiple states. Developers want to be able to track what changes they themselves have made to their working copy, regardless of commits that have since been recorded in the (remote) repository. So, if you want to get a list of changes you’ve made, or the history that led up to your current working copy, it needs to be relative to the version of the repository that existed when you got your copy. That may not be the current version, because other developers could have recorded their changes in the (remote) repository. This also affects your ability to push your changes to the (remote) repository (by a “commit” in cvs or svn terminology), potentially requiring you to merge the various changes together. So, we have five states:

Substates of the working copy:
1. (explicitly) ignored — files inside your working copy that you explicitly told the VCS system to not track
2. index — the content in your working copy that you asked the VCS to track; this is the portion of your working copy that will be saved when you commit (in CVS, this is done using the CVS/Entries files)
3. limbo — not explicitly ignored, and not explicitly added. This is stuff in your working copy that won’t be checked in when you commit but you haven’t told the VCS to ignore, which typically includes newly created files.
Substates of the repository:
1. “checkout” version — the version of the code in your working copy before you started modifying it
2. remote version — the version of the code currently saved in the remote repository

Not understanding these multiple states and the differences between them for the VCS you are using has varying consquences: not being able to take full advantage of your system, being unable to do some basic operations, or (worst case) introducing erroneous or incomplete changes.

Similarities and differences between the major VCSes with these states

Most of these five states are similar between the major VCSes. State 1 (ignored files) is essentially identical between the systems. (The only difference is in the details of setting it up; for cvs it means editing .cvsignore, for svn it means modifying the svn:ignore property, for mercurial it means editing .hgingore, etc.) State 5 is also essentially identical. States 2 and 3 always sum up to everything in the working copy other than explicitly ignored files, so extending state 3 means shrinking state 2. Thus, we can get a feel for the differences between VCSes by looking at their differences in states 3 and 4.

The major distinction between inherently centralized VCSes (e.g. cvs and svn), and so called distributed (I prefer the term “multi-centered”) VCSes comes with state 4. The differences in this state can be thought of as different choices along a continuum rather than a binary difference, however. The difference here is in how much information gets cached when one gets a copy. With CVS, you only get a working copy plus info about what version you checked out and where the repository is located. With SVN, you get the same as with CVS, but also an extra copy of all the files. Most distributed systems go a few steps farther than svn and by default cache a copy of all the versions on the given branch. git, by default, caches a copy of all versions on all branches.

Caching extra information as part of state 4 can allow additional work to be done offline. cvs and svn are very limited in this respect, but the additional offline capabilities of the other systems come with the understanding that the local cache itself is a repository and thus users need to understand both how to sync changes with the local repository as well as between the local repository and the remote one(s). In cvs and svn, it’s not useful to “sync with the local cache”; instead those systems just automatically synchronize the local cache and the remote repository to the indexed local changes all at
once. Thus, cvs and svn users only need to learn a smaller set of “synchronization” commands (limited to “commit” and “update”.)

There is also a potential difference between VCSes in state 3. Having changes in state 3 is the place that in my experience causes the most errors. Users simply forget that their changes are in this state and forget to add them. Now, it turns out that all VCSes I’ve looked at close enough are identical here, except for git. (So if you know one of them you already also understand this aspect in all the other VCS systems other than git.) git extends the concept of limbo, turning the index into a high-level (and in your face) concept with some really cool features, but unfortunately it has the side-effect of making git even more error-prone for users. I’ll discuss this in more detail in my next post.

Local caching: A major distinguishing difference between VCSes

An interesting difference between the major VCSes concerns how much information is cached locally when one obtains a copy of the source code using the VCS. The amount of information obtained by default when one performs a checkout or clone with each of the five major VCSes is:

cvs – a working copy of the specified version of the source code files, plus information about which revision was checked out and where the repository is located.
svn – same as cvs, plus an extra copy of the specified version of the source code files
bzr, hg – same as svn, plus the remainder of the history of the current branch (i.e. cvs, plus a copy of the complete history of the current branch)
git – same as bzr & hg, plus the full history of all other branches in the repository as well.

Note that some systems have options to cache less than the default.

Benefits of local caching

The additional cached information can serve multiple purposes; for example, making operations faster (by using the local disk instead of the network), or allowing offline use. For example, nearly all operations in cvs other than edits of the working copy require network connectivity to the repository. In subversion, diffs between the version you checked out and your current working copy is fast due to the extra copy that was checked out, but other operations still require network connectivity. In bzr & hg, diffs against versions older than the checkout version, reverting to an older version, and getting a log of the changes on the branch can all be fast operations and can be done offline. In git, even comparing to a different branch, switching to a different branch, or getting a log of changes in any branch can be done quickly and offline.

This local caching also pushes another point: cvs and svn have limited utility when working offline. bzr, hg, and git allow quite a bit of offline use…in fact, it even makes sense to commit while offline (and then merge the local commit(s) and remote repositories later). Thus, one thinks of the local cache in such cases as being a repository itself. This has ramifications as well. Since the local cache is a repository, it means that it makes sense to think of updating from a different remote repository than you got your checkout/clone from, and of merging/pushing your changes to yet another location. This is the essence of being a VCS with distributed capabilities. This can be taken to the pathological extreme (resulting in the kernel development model), or one can use a more standard centralized model that simply has impressive offline capabilities (which is how Xorg runs), or one can pick something inbetween that suits them. One common case where someone might want to pick something in the middle is when an organization has multiple development sites (perhaps one in the US and one in Europe) and developers at the remote site would like to avoid the penalties associated with slow network connections. In such a case, there can be two “central” repositories which developers update from and commit to, with occasional merges between these centers. It can also be useful with developers gone to a conference wanting to work on the software and collaborate even when they don’t have connectivity to the “real” repository.

Another side effect of local caches being a repository is that it becomes extremely simple to mirror repositories.

Another interesting observation to make is that git allows the most offline use. There have been many times where I’ve wanted to work offline with cvs or svn projects (I’ve even resorted to rsyncing cvs repositories when I had access as a cheap hack to try to achieve this), and many times that I wished I had a copy of other branches and older versions while offline. bzr and hg are leaps and bounds better than cvs and svn in this regard, but they only partially solve this problem; using them would mean that I’d either need to manually do a checkout for every branch, that I’ll have to be online, or that I’ll have to do without information potentailly useful to me when I don’t have network connectivity. This is especially important considering that VCSes with distributed capabilities make merging easy, which encourages the use of more branches. Looking at the comparison this way, I’d really have to say that the extensive offline capabilities of git is a killer feature. I’m confused why other VCSes haven’t adopted as much local caching as git does (though I read somewhere that bzr may be considering it).

Disk usage — Client

When people see this list of varying amounts of local caching, they typically assume that disk usage is proportional to the amount of history cached, and thus believe that git will require hundreds of times the amount of diskspace to get a copy of the source code…with bzr and hg being somewhere inbetween. Reality is somewhat surprising; from my tests, the size of a checkout or clone from the various VCSes would rank in this order (with some approximate relative sizes to cvs checkouts in parentheses):

cvs (1)
git (1.92)
svn (2)
hg (2.05)
bzr (3.2) [*]

The main reason for git, hg, and bzr being so small relative to expectations is that source code packs well and these systems tend to be smart about handling metadata (information about the checkout and how to contact the server). However, there are some caveats here: my numbers (particularly for hg and bzr) aren’t based off as thorough studies as they should be, and the numbers have a higher than you’d expect variance (depends a lot on how well history of your project can pack, whether you have large files in the history that are no longer in the project, etc.) Also, while bzr and hg do automatic packing for the user, git requires the user to state when packing should be done. If the user never packs (i.e. never runs ‘git gc’) then the local repository can be much larger than a cvs or svn checkout. A basic rule of thumb is to just run ‘git gc’ after several commits, or whenever .git is larger than you think it should be.

I’ve done lots of git imports (git-cvsimport and git-svn make this easy), comparing dozens of cvs and svn repository checkouts to git ones. So I feel fairly confident about my number for git above. It does vary pretty wildly, though; e.g. for metacity it’d be 1.51 while for gtk+ it’d be 2.56[**]; I’ve seen ranges between about 0.3 and 6.0 on real world projects, so the 1.92 is just an overall mean. The hg numbers were based strictly off of converting git imports of both metacity and gtk+ to hg and taking an average of the relative difference of those (using the recent ‘hg convert’ command). My bzr number was based off importing metacity with bzr-svn and with git-svn and comparing those relative sizes (bzr-svn choked on gtk+, and I couldn’t get tailor to convert the existing git gtk+ repo to bzr).

[*] I did these tests before bzr-0.92 was out, which has a new experimental (and non-default) format that claims to drop this number significantly. I hear this new format is planned to become the default (with possibly a few tweaks) in a few months, so this is a benchmark that should be redone early next year. However, the existing number does show that bzr is already very close to an svn checkout in size despite bringing lots more information.

[**] For those wanting to duplicate, I ignored the space taken by the .git/svn directory, since that information is not representative of how much space a native git repository would take. It is interesting to note, though, that .git/svn/tags is ridiculously huge; to the point that I think it’s got to be a bug in the git-svn bridge.

Disk usage — “Central” Server

If what concerns you is the size of the repository on the central server, then the numbers are more dramatic. Benchmarks I’ve seen put git at about 1/3 the size of CVS and 1/10 the size of svn.

UPDATE: A number of people pointed me to the new named branches feature in hg that I was unaware of, which looks like it puts hg in the same category as git. Cool!

Adoption of various VCSes

There are a lot of Version Control Systems out there, and one of the biggest criteria in selecting one to use is who else uses it. I’ll try to quickly summarize what I have learned about the adoption of various VCSes. There are many people who know more than me, but here’s some of the bits that I’ve picked up.

Perceived adoption from lots of reading

I have read many blog posts, comparisons, tutorials, news articles, reader comments (in blogs and at news sites), and emails (including various VCS project archives) about version control systems. In doing so, it is clear to me that some are frequently mentioned and deemed worthy of comparison by others, while many VCSes seem so obscure that they only appear in comparisons at sites that attempt to be exhaustive or completely objective (e.g. at wikipedia). Here are the ones I hear mentioned more frequently than others:

First rung: cvs, subversion, bazaar-ng,
mercurial, tla/baz, and
git.

Though bazaar perhaps belongs in a rung below (more on that in a minute). There are also several VCSes that are still mentioned often, but not as often as the ones above:

Second rung: svk, monotone, darcs,
codeville, perforce, clearcase,
and bitkeeper.

tla/baz died a few years ago (with both developers and users mostly abandoning it for other systems, though I hear tla got revived for maintenance-only changes). Also, bazaar-ng really straddles these two levels rather than being in the upper one, but I was one of the early adopters and it has relatively strong support in the GNOME community so it’s more relevant to me. Perforce, clearcase, and bitkeeper are proprietary and thus irrelevant to me (other than as a comparison point).

Adoption according to project records

Of the non-dead open source systems, here’s a list of links to who uses them plus some comments on the links:

bazaar-ng – WhoUsesBzr – wiki page name is inconsistent; it should be “ProjectsUsingBzr” (compare to wiki page names below) :-). The page is also slightly misleading; they claim drupal as a user but my searches show otherwise (turns out to just be a developer with an unofficial mirror). Hopefully there aren’t other cases like this.
codeville – NoPage – I wasn’t able to find any list of projects using codeville anywhere. In fact, I wasn’t able to find any projects claiming to use it either. It must have shown up in other peoples’ comparisons on the basis of its interesting merge algorithm.
cvs – NoPage – I don’t have a good reference page, and it’d likely go out-of-date quickly. However, while CVS is no longer developed and projects are switching from CVS in droves these days, it wasn’t very many years ago that cvs was ubiquitous and a near universal standard. Nearly everyone familiar with at least one vcs is familiar with cvs, making it a useful reference point. Also, it still has a pretty impressive installed base; I’m even forced to use it occasionally in the open source world as well as every day at work.
darcs – ProjectsUsingDarcs – I strongly appreciate the included list of projects that stopped using their VCS (and why). Bonus points to darcs for not hiding anything.
git – ProjectsUsingGit
mercurial – ProjectsUsingMercurial – I like how they make a separate list for projects with synchronized repositories (bzr and svk ought to adopt this practice, and maybe others)
monotone – ProjectsUsingMonotone – I really like the project stats provided.
subversion – open-source-projects-using-svn – wiki page name isn’t ProjectsUsingSvn; couldn’t they read everyone else’s minds and realize that they needed such a name to fit in with the standard naming scheme? 😉
svk – ProjectsUsingSVK – claims WINE, KDE, and Ruby on Rails as users; my simple searches showed otherwise (likely svk developers just knew of developers from those projects hosting their own unofficial svk mirrors). I don’t know if their other claimed users are are accurate or not; I only checked these three.

Some adoption pages point to both the project home page and the project repositories, which is very helpful. The other adoption wiki pages should adopt that practice too, IMHO.

Adoption by “Big” users

Looking at the adoption pages listed above, each of the projects other than svk and codeville seem to have lots of users. Mostly small projects, but most projects probably are are small and it is also easier for small projects to switch to a new VCS. The real test is whether VCSes are also capable of supporting large projects. I’d like to compare on that basis, but I’m unwilling to investigate how big each listed project is. So, I’ll instead compare based on (a) if I’ve heard of the project before and know at least a little about it, and (b) I think of the project as big. This results in the following list of “big” users of various VCSes:

bazaar-ng – This is kind of surprising, but Ubuntu is the only case matching my definition above. As an added surprise, they aren’t in bzr’s list of users. (samba and drupal only have some unofficial users; and in the case of samba, I know they also have unofficial git users. Official adoption only for my comparison purposes; otherwise GNOME and KDE would be in lots of lists.)
codeville – none
cvs – Used to be used by virtually everything. Many projects still haven’t moved on yet.
darcs – none of the projects listed match my definition of “big” above
git – linux kernel (and many related projects), much of freedesktop.org (including, Xorg. HAL, DBUS, cairo, compiz), OLPC, and WINE
mercurial – opensolaris, mozilla (update: apparently mozilla hasn’t converted quite yet)
monotone – tough case. I would have possibly said none here, noting gaim, er, pidgin, as the closest but their stats suggest two projects (Xaraya and OpenEmbedded) are big…and that pidgin is bigger than I realized. I guess I’m changing my rules due to their cool use of stats.
subversion – KDE, GNOME, GCC, Samba, Python, and others
svk – none

Brief notes about each system

As a quick additional comparison point for those considering adoption, I’ll add some very brief notes about each system that I’ve gathered from my reading or experience with the system. I’ll try to list both a good point and a bad point for each.

Free/Open source VCSes
- bazaar-ng (bzr) – Developed and Evangelized by Canonical (backers of the Ubuntu distribution). Designed to be easy to use and distributed, and often gets praise for those features. It received a bit of a black eye in the early days for being horribly slow (it made cvs look like a speed demon in its early days), though I hear that the speed issues have received lots of attention and changes (and brief recent usage seems to suggest that it’s a lot better). Annoyingly, it provides misleading and less-than-useful results when passing a date to diff (the implemented behavior is well documented and apparently intentional, it’s just crap).
- codeville – Designed by Bram Cohen (inventor of bittorrent). People seem to find the merge algorithm introduced by codeville interesting. Doesn’t seem to have been adopted much, though, and it even appeared to have died for a while (going a year and a half between releases, with other updates hard to find as well). Seems to be picking back up again.
- cvs – The VCS that all other VCSes compare to, both because of its recent ubiquity and because its well known flaws are easy to leverage in promoting new alternatives. The developers working on cvs decided its existing flaws could not be fixed without a rewrite, and thus created a new system called subversion. cvs is inherently centralized.
- darcs – Really interesting and claimed easy to use research system written by David Roundy (some physicist at OSU) that is based on patches rather than source trees. I believe this allows, for example, merging between source trees that do not necessarily have common history (touted as an advanced cherry-picking algorithm that no other VCS can yet match). However, this design has an associated “doppelganger” bug that can cause darcs to become wedged and which requires care from the user to avoid. From the descriptions of this bug, it sounds like something any big project would trigger all the time (it’s an operation I’ve seen happen lots in my GNOME maintainence even on modestly sized projects like metacity.) However, developers apparently can avoid this bug if they know about it and take steps to actively avoid triggering it. I think this is related to “the conflict bug”, which can cause darcs to be slow on large repository merging, but am not sure.
- git – Invented by Linus Torvalds (inventor of the linux kernel). It has amazed a lot of people (including me) with its speed, and there are many benchmarks out there that are pretty impressive. I’ve heard/seen people claim that it is at least an order of magnitude faster than all other VCSes they’ve tried (from people who then list most all the major VCSes people think of as fast among the list of VCSes they’ve tried). It also has lots of interesting advanced features. However, versions prior to 1.5 were effectively unusable, requiring superhuman ability to learn how to use. The UI warts are being hammered away and git > 1.5 is much better usability-wise; it’s now becoming a usable system once users first learn and understand a few differences from other systems, despite its few remaining warts here and there. The online tutorials have transformed into something welcoming for new users, though the man pages (which double as the built in “–help” system) still remind me more of academic research articles written for a community of existing experts rather than user documentation. Also, no official port to windows (without cygwin) exists yet, though one is apparently getting close. Interestingly, git seems to be highly preferred as a VCS among those I consider low-level hackers.
- GNU Arch (tla/baz) – Invented by Tom Lord (who also tried to replace libc with his own rewrite). Both tla and baz are dead now with developers and users having moved on, for the most part. Proponents of these systems (particularly Tom) loudly evangelized the merits of distributed version control systems, which probably backfired since tla/baz were so horribly awful in terms of usability, complexity, quirkiness, and speed that these particular distributed VCSes really didn’t have any redeeming qualities or even salvagable pieces. (baz was written as a fork designed to make a usable tla which was backward compatible to tla; the developers eventually gave up and switched to bzr since this was an impossible goal.) I really wish I had the part of my life back I wasted learning and using these systems. And no, I don’t care about impartiality when it comes to them.
- mercurial (hg) – Written by Matt Mackall (linux kernel developer). Started two days after git, it was designed to replace bitkeeper as the VCS for the kernel. Thus, like git, it focused on speed. While not as fast as git in most benchmarks I’ve seen, it has received lots of praise for being easier to learn, having more accessible documentation, working on Windows, and still being faster than most other VCSes. The community behind mercurial seems to be a bit smaller, however: it doesn’t have nearly as many plugins as bzr or git (let alone cvs or svn). Also, it annoyingly doesn’t accept a date as an argument to diff, unlike all the other major VCSes.
- monotone (mtn) – Maintained by Nathaniel Smith and Graydon Hoare (who I don’t know of from elsewhere). The main thing I hear about this system is about it’s ideas to focus on authentication of history to verify repository contents and changes. These ideas influenced and were adopted by git and mercurial. On the con side, it appears getting an initial copy can take an extraordinarily large amount of time; for example, if you look at the developer site for pidgin you’ll note that they provide detailed steps on how to get a checkout of pidgin that involves bypassing monotone since it’s too slow to handle this on its own.
- subversion (svn) – Designed by former cvs maintainers to “be a better cvs”. It doesn’t suffer from many of the same warts as CVS; e.g. commits are atomic, files can be renamed without messing up project history, changes are per-commit rather than per-commit-per-file, and a number of operations are much faster than in cvs. Most users (myself included) feel that it is much nicer than CVS. Like CVS, svn remains inherently centralized and has no useful merge feature. Unlike CVS, half the point of tagging is inherently broken in svn as far as I can tell[*] (you can’t pass a tag to svn diff; you have to search the log trying to find the revision from which the tag was created and then use whatever revision you think is right as the revision number in svn diff).
- svk – Invented by Chia-liang Kao and now developed by Best Practical Solutions (some random company). Designed to use the subversion repository format but allow decentralized actions. I know little about their system and am hesitant to comment as I can’t think of any good comments I’ve heard (nor more than a couple bad ones.) However, on the light side of things, I absolutely love their SVKAntiFUD page. On that page, in response to the question “svk is built on top of subversion, isn’t it over-engineered and fragile?” an additional note to the answer (claimed to have been added in 2005) states that “Spaghetti code can certainly not be called over-engineered.” While the history page of their wiki suggests it has been there for at least a year, I’m guessing the maintainers don’t know about this comment and will remove it as soon as someone points it out to them.
Proprietary (i.e. included only for comparison purposes) VCSes
- bitkeeper – A system developed by BitMover Inc., founded by Larry McVoy. Gained prominence from its usage for a few years by the linux kernel. “Free Use” (as in no monetary cost) of the system by open source projects was revoked when Andrew Tridgell started reverse engineering the protocol (by telnetting to a server and typing “help”). Most users of this system seem to like it technically, but the free/open source crowd understandably often disliked its proprietary nature. I haven’t used the system, but think of it as being similar to mercurial (though I don’t know for sure if that’s the best match).
- clearcase – Developed by (the Rational Software division of) IBM. Clearcase is an exceptionally unusual VCS in that I’ve never heard anyone I know mention a positive word about it. Literally. They all seem to have stories about how it seems to hinder progress far more than it helps. There has to be someone out there that likes it (it seems to have quite a number of users for a proprietary VCS despite being exceptionally expensive), but for some reason I haven’t run across them. Very weird. I believe it is actually lock-based instead of either distributed or inherently centralized, meaning that only one person can edit any given file at a time on a given branch. Sounds mind-bogglingly crazy to me.
- perforce – Developed by Perforce Software, Inc. It seems that users of the system generally like it technically, and it has a free-of-charge clause for open source software development. My rough feeling is that Perforce is like CVS or subversion, but has a number of speed optimizations over those two. It is apparently even worse than cvs or svn for offline working, making editing not-already-opened files in the working copy problematic and error-prone unless online.

The major VCSes

Based on everything above, I consider the following VCSes to be the “major” ones:

cvs, svn, bzr, hg, and git.

I’ll add an “honorable mention” category for monotone and darcs (which bzr nearly belongs in as well, but passes based on the Canonical backing and much higher than average support by developers within the GNOME community). These five VCSes are the ones that I’ll predominantly be comparing between in my subsequent posts.

Update

[*] Kalle Vahlman in the comments points out that you can diff against a tag in svn, though it requires using atrocious syntax and a store of patience:

As much as I agree with [the claim] that SVN is just a prettier CVS, [it] isn’t really true. You can [run]:

svn diff http://svn.gnome.org/svn/metacity/tags/METACITY_2_21_1 http://svn.gnome.org/svn/metacity/trunk

to get differences between the tag and current trunk. If it looks horribly slow to you, it’s because you are on a very fast connection. IT IS SO SLOW IT MAKES LITTLE KITTENS WEEP. But it is possible anyway.

There are a number of other good posts in the comments too, pointing out project adoption cases I potentially missed and noting additional issues with some systems that I won’t be comparing later.

Starting to compare Version Control Systems

As I blogged about some time ago, I decided to spend some time learning and comparing various version control systems (VCSes for short). Of course, there are many version control system comparisons out there already, and I’ve read countless other sources as well (blogs, articles + comments, archived mailing list messages found in random google searches, etc.). While some of these sources have very interesting information, they still don’t answer all the questions I had; in fact, even the types of comparisons typically performed in these comparisons don’t cover everything I wanted to see. Here are some of the questions I have been considering:

What are the most important VCSes to consider?
Why are VCSes hard to learn? If someone learns one VCS, how much lower is their learning curve for switching to another?
What are the most common pitfalls that users experience with each of the major VCSes? Are there similarities across systems in the mistakes that users make?
Why are some systems more widely adopted than others? Are there certain qualities that make some systems more likely to be adopted by certain groups and less likely by others?
Why do some users of inherently centralized systems claim that “distributed”[1] systems are harder to learn? Why do users of distributed systems claim that they are *not* harder to learn? Why are there similar questions between the various “distributed” systems?
Which VCS is the “best” for a given individual/group? More importantly, what are the important criteria and where do various VCSes shine?
Why is there so much misunderstanding between users of different systems?
To what extent does the truism that “all software sucks” apply to VCSes?
Typical stuff: Which is the fastest at operation X (and by how much)? Which provides the most useful output (why is it more useful)? Which has the best add-ons? Which has the most relevant features? Which has the best documentation (how much better is it)? Which has killer features missing in others? etc.

I’m still far from answering all of them. However, I have learned a few things, and I figured it’d be a useful exercise to bore everyone to death by writing up some of my thoughts on the subject. So I’ll be doing that occasionally. Some of the things I write up will have comparisons similar to what you’d see elsewhere (but with my own slant which focuses on what seems relevant to me), while a few will analyze the subject from an angle different than what I have been able to find in other comparisons. I have a few posts mostly written up already, and may find time to write up a couple more after those.

Obvious Disclaimers: I’m no expert and am additionally error-prone, so I’ll likely make mistakes in my posts. I also won’t make any claims to be objective, as I don’t think it’s even possible to fully achieve. I will aim to be “reasonably” objective…but note that I have an opinion that placing too high a priority on objectivity makes it impossible to achieve much of the full usefulness of a comparison, limiting what “reasonable” can mean.

[1] As I have mentioned before, I think this is a somewhat misleading term; unfortunately, I don’t have a good replacement. Maybe something like “multi-centered”?