svn

GNOME DVCS Survey results

The GNOME DVCS (Distributed Version Control System) Survey completed
about a week and a half ago, with responses from 579 different people with
svn accounts. (There are 1083 people with commit access to
GNOME SVN, so this is about a 53% response rate.) The survey was
intended to collect data related to a possible move for the GNOME
project from SVN to a distributed version control system in 2009, thus
questions about svn were included despite the fact that it is not
distributed. The results of the survey are shown below. (I got the data from Behdad; the scripts I used to generate the plots can be found here.)

Bias

The plots of the data I present simply cover all the questions —
twice. Once to show the percentages of respondents with each answer
for the specific question, then again to contrast how those who
answered a given question differently had differing rankings for the
various VCSes. So the plots are as neutral as I think is possible.

I also add some commentary of my own, analyzing the data and noting
items that surprised me (I had several predictions about how the
survey would turn out; many of my predictions were right but there
were a number of surprises for me too). I don’t think it’s possible
to make such commentary unbiased. In fact, since I noticed a clear
front-runner in looking at the results, I thought it most useful to
look at that particular system, so the majority of my comments focus
on it. If you do not want my bias, ignore my comments and draw your
own conclusions from the data.

Survey Questions

First, let’s remind everyone what the survey questions were:

Your GNOME SVN user id
Do you currently maintain any GNOME modules in SVN?
- Yes, I maintain multiple modules
- Yes, I maintain a single module
- No, I am not a maintainer
Do you currently develop any GNOME modules in SVN?
- Yes, I develop multiple modules
- Yes, I develop a single module
- No, I do not develop any modules
Do you commit to GNOME SVN?
- Yes, I regularly commit to GNOME SVN
- Yes, I sometimes commit to GNOME SVN
- No, I do not commit to GNOME SVN myself
How do you best characterize your current GNOME SVN contributions?
- I develop code
- I write documentation
- I test
- I translate
- Other
(Edit: I wish the question, “In which ways do you characterize
your current GNOME SVN contributions?” had also been asked.
It would be really interesting to see the results of such a
select-all-that-apply question.)
Which of the following distributed version control systems are you familiar with? (select all that apply)
- bzr
- git
- hg
How do you best summarize which DVCS systems you use *regularly*? (select all that apply)
- bzr
- git
- hg
How do you feel about GNOME changing version control system to one of bzr, git, or hg in 2009?
- Not again! We just switched systems, like, yesterday (no)
- No strong feeling, I’d use whatever is provided
- What’s wrong with SVN? (why?)
- I do not care
- Please do! Anything is better than svn (except for cvs of course)
- Other
Which one do you prefer? Please rank the following:
- anything other than svn (no preference)
- bzr
- git
- hg
- svn (no change)

Basic stats

Contribution statistics

Why do we attract so few people that self-identify as primarily being
documenters? Is it because people who get involved in documentation
then also get heavily involved in other areas and thus put themselves
in the “Other” category (most of the documenters I can think of
probably did this)? Are distros more likely to attract this kind of
volunteer? Do we just have a fundamental shortcoming somewhere?

DVCS familiarity statistics, and should we switch

Wow…we have an awful lot of people already familiar with other
VCSes. Over 60% familiar with git, and nearly half the people already
use it regularly? I knew there were a lot of people out there, but I
didn’t know it was that many. bzr and hg also have fairly strong
representation among the community (there’s even 31 people who are
familiar with all three systems, and one person who regularly uses all
three — no I’m not that person). The number of people who regularly
use git still leads the other two systems by quite a bit; I thought
they (or at least bzr) would have caught up more by now but I guess
not.

The lion’s share of the votes for whether we should switch were either
for those that wanted to switch or those that didn’t have a strong
feeling. Although only a small percentage (less than 3%) voted “no”,
that may have been due to the wording; for purposes of counting, the
“why?” column should be lumped with the “no”s. It’s a lighter no, but
still a no. The “other” column is a bit of a wildcard and represents
a somewhat significant cross-section of the community. As can be seen
in the next section, among this group who chose “other” in answer to
the question of whether we should switch, there was a preference for
git over the other systems.

VCS rankings

Note that I’ve created an extra plot derived from the other five, ‘Average rank’, which shows the average rank of each VCS (the number in parenthesis for this extra plot is the number of people whose rankings were averaged). If the community were evenly divided, or if no one cared which system we used, then every VCS would have a rank of 3. So the relevant question in the average rank plot is how far from rank 3 each system is.

Note that the different graphs have different y-axis ranges, as was true with previous plots too. Sorry.

This set of plots really surprised me. I have often thought of git as
polarizing and expected it to have the most first place votes and the
most last place votes. It definitely got the most first place votes,
was close on second place votes, and significantly lagged all other
systems in second-to-last and last place votes. I was floored by
this.

Average rankings for different demographics

One question I was really interested in was which version control
system various demographics preferred. For example, there were a
significant number of people who selected “other” for whether we
should switch to another system. What’s their preference? Do
translators or testers have a different favorite system than coders?
Do maintainers of multiple modules have a different outlook than
non-maintainers? So, in this section I try to look into this
question. Note that in each plot, the number in parentheses are the number of people across whom the average was taken.

Average VCS ranking by maintainence/development load

It looks like VCS preference doesn’t change much relative to
maintainence and development load. However, I found it interesting
that bzr had its highest support among maintainers/developers of a
single module and that git had its highest support among
maintainers/developers of multiple modules. (Mercurial had more
support among non-maintainers and non-developers, though that may just
be a reflection of the latter demographic having less strongly held
opinions.) That matched my intuition about design choices of bzr and
git, what they were optimized for, and how it has reflected in their
usage. However, although I was correct about the trend, the size of
the trend turned out to be nearly negligible.

Average VCS ranking by commit frequency

Not much variance here either. As expected, it looks like regular committers have stronger opinions (average rankings further from 3) than occasional or non-committers.

Average VCS ranking by contribution type

I was surprised by these plots. I expected support for git
to be found almost exclusively among coders, but apparently that is
not the case at all. git is ranked highest by all groups other than
documenters. Documenters, though, do rank git dead last.

Some might suggest we discard the last plot given the tiny sample size
(only 4 people self-identify as being ‘primarily’ documenters!).
While there’s some merit to that claim, I find it to be the most
interesting plot (as a bit of a VCS junkie) since it is the only
non-VCS related demographic for which git does not come in first
place.

I also find the translator plot interesting (as a VCS junkie), as it’s
the only other such plot for which git does not have a commanding
preference lead over all other VCSes. Honestly, though, I was quite
surprised that git was even close to svn for translators, let alone
that it had a small lead.

Average VCS ranking by DVCS usage/familiarity

No real surprise here as far as the favorite goes — users who are familiar with or regularly use a certain system tend to prefer that system. However, git enjoys positive support in all cases and at least comes in second? I found that somewhat surprising. I thought it would get a average ranking lower than 3 by those familiar with or using bzr/hg — much as bzr, svn, and hg did among those familiar with or regularly using git.

Average VCS ranking by propensity to switch systems

Those who think we should switch want to go to git. Those who have no
strong preference or selected other, also had a preference for git.
Those who don’t care whether we switch, wonder what’s wrong with
subversion, or think we just shouldn’t switch, all prefer subversion.
Even among the latter group, git came in a positive second for the
“why?” and “I don’t care” groups.

Final thoughts

It looks like there’s a strong preference in the community toward
switching, and that git has a strong lead in preference among the
community, followed by svn, then bzr, then mercurial.

Among the non-VCS-related demographics, there was only two in which
git did not have a commanding lead: testers and documenters. Among
testers, git was still the preferred system, but it only marginally
lead svn (and these two strongly lead bzr and hg). Among documenters,
git came dead last by a large margin (while bzr came in a commanding
first). It would be interesting to find out why; perhaps we should
poll the 4 relevant people.

Among the VCS-related demographics, people familiar with or regularly
using a certain system tended to prefer that system. git always came
in a positive second, though. Also, those not wanting to switch
systems or not caring *at all* whether we switched strongly supported
subversion, while everyone else (including those with no strong
feeling about the switch) strongly preferred git. Even among the “why
switch” and “I don’t care” groups that preferred subversion, git came
in a positive second. Among the tiniest switch preference group,
those that don’t want to change systems at all, bzr was second
followed fairly closely by git.

I spent a lot more time discussing git than bzr or hg in my comments
here, but that was mostly a reflection of where it appeared in the
stats. As shown in the survey results, the other systems don’t appear
to be nearly as preferred in the community, so I simply didn’t discuss
them as much. I apologize if that makes my analysis looks biased; as
I said at the beginning, feel free to ignore my analysis and draw your
own conclusions from the stats.

Many different kinds of revision specifiers

Version control systems each use their own method to refer to different versions (also known as ‘revisions’) of the repository. The choice of revision specification often reflects underlying data structures, and the choice of data structures often inhibits or enables various features for the system. Additionally, the methods of displaying and using revision specifiers can also affect the ease with which users can learn and use the new system.

Unfortunately, a full comparison is beyond the scope of this post. I will concentrate on simply introducing the basics and giving a flavor for how things are layed out, which itself is a long enough topic. While conclusions could be drawn with just the data and explanations presented here, I am intentionally avoiding doing so and leaving such to possible later posts. (Besides, bloody taxes and the brain-damaged US tax code have stripped me of any time that I would need to write such additional comparisons.)

Warning: My pictoral representations for each system will be crazier and more complex than usual (and even more lopsidedly complex for some systems than others) in order to keep things short while still showing what is possible.

cvs

Method

See cvs revision numbers and cvs branching basics, particularly figure 2.4 near the end of the branching basics section.

CVS has revision identifiers that are per-file, meaning that repositories at any given time are a combination of many different revisions (one for each file). Ignoring an ugly technical detail about the special revisions 1.1.1.1 and 1.1.1, the first version of a file is numbered 1.1. The next change to the file is recorded as 1.2, the next is 1.3, and so forth. If the user wants to create a branch, based on the 1.3 version of a file, then the branched version is 1.3.2.1. Changing and committing the file on the branch results in 1.3.2.2, then 1.3.2.3, etc. A second branch also created off of 1.3 would be numbered 1.3.4 instead of 1.3.2 (with actual commits numbered 1.3.4.x).

Note that branches are named by a revision with one less number (e.g. 1.4.2 is the name of the branch with commits numbered 1.4.2.x). As such, branch names refer to the beginning of the branch. Each file is branched separately, with per-file revision numbers (it is even possible to branch some files without branching others).

Tags are aliases for a specific version number. Since revisions are per-file, a given tag may refer to different revision numbers for different files (e.g. the ‘v1.0’ tag might refer to version 1.27 of foo.c, 1.36 of bar.h, and 1.218 of foobar.py)

Uniqueness of cvs revisions is not an issue since there is only one repository.

Picture

                       (etc)
                         |
             (etc)   1.4.4.3.2.2
               |         |
            1.4.4.5  1.4.4.3.2.1
               |         |
            1.4.4.4  (1.4.4.3.2)
               |     /
               |   /
            1.4.4.3
               |
  1.4.2.2   1.4.4.2
     |         |
  1.4.2.1   1.4.4.1
     |         |
  (1.4.2)   (1.4.4)
      \       /
       \     /
        \   /
         \ /
         1.4
          |
         1.3
          |
         1.2
          |
         1.1

Method

See svn revisions and working with your branch, particularly figure 4.4 (the branching of one file’s history).

svn uses global revision identifiers, with the first revision being marked as 1, the second as 2, the third as 3, etc.

Branches have an unusual implementation in subversion; they are handled by a namespacing convention: a branch is the combination of revisions within the global repository that exist within a certain namespace. Creating a new branch is done by copying an existing set of files from one namespace to another, recorded as a revision itself.

Tags (an alias for a specific version in history) don’t exist in subversion. Instead, subversion again uses a namespacing convention identical to that done for branches (thus making tags and branches indistinguishable in subversion other than the chosen names), and users are merely discouraged from committing additional changes to files within a tag namespace.

Uniqueness of svn revisions is not an issue since there is only one repository.

Technically, a revision could simultaneously modify any combination of branches and tags by simply committing to all namespaces; however, this is typically discouraged and users only have a certain namespace checked out at a time.

Picture

  trunk   branches/proj-2-22  branches/proj-2-20  tags/RELEASE_2_22_2
   24
                                                        23
                 22
   21
   20
                 19
                                     18
                                     17
   16
   15
                 14
                 13
                 12
   11
                                     10
    9
    8
    7
                                      6
    5
                                      4
    3
    2
    1

bzr

Method

See understanding bzr revision numbers and specifying bzr revisions.

bzr, like svn, uses 1, 2, 3, etc. for revision numbers. However, the revision numbers are always consecutive in a branch. Merged in changes from other branches are given 3 numbers per revision. For example, if changes were merged from a repository that has changes relative to revision 2, the changes would come into the current branch numbered 2.1.1, 2.1.2, 2.1.3, etc. If changes from more than one branch are relative to the same commit, then the middle number is used to distinguish commits from the different branches. Thus one would see another set of changes relative to commit 2 numbered as 2.2.1, 2.2.2, 2.2.3, 2.2.4, etc. (Versions of bzr older than 1.2 used more than 3 numbers in certain cases, but that is no longer true of current versions.) See the picture below to make this clearer.

Branches in bzr are done by creating separate directories (typically with their own repository), though one can set up shared repositories. Each branch will have its own numbering scheme for the revisions it stores, recording the order that the revisions entered that repository. (See below about uniqueness issues.)

Tags in bzr are an alias for a commit, and are stored as part of a branch.

Note that bzr revision numbers are not unique. If you have the same revision in two different repositories, they will not necessarily have the same revision number in both. bzr does store unique identifiers for revisions, known as revid’s (an example of which looks like Matthieu.Moy@imag.fr-20051026185030-93c7cad63ee570df), though they are not shown by default. Users can obtain these unique identifiers by passing the –show-ids flag to bzr log, and these revids can be used in place of the simpler default revision specifiers when prefixed with “revid:”.

Picture

              12
              |
              11
            / | \
          /   |  \
        /     |   \
      10    4.1.5  4.2.2
       \   /  |      |
        \ /   |      |
         9    |    4.2.1
        / \   |   /
       /   \  |  /
       8    4.1.4
       |      |
       7    4.1.3
       | \    |
       |   \  |
       6    4.1.2
       |      |
       5    4.1.1
        \   /
          4
          |
          3

          2
          |
          1

Note: The revision identifiers shown in this picture are dependent on merge order; the revisions 4.1.5, 4.2.1, and 4.2.2 could instead be numbered 4.2.1, 4.1.5 and 4.1.6 respectively if the merges done to obtain revision 11 were done in a different order.

git

Method

See Understanding git history: Commits, and naming git commits.

git uses cryptographic checksums (in particular, sha1sums) of repository contents as revision identifiers. These checksums are 40-character hexadecimal strings (e.g. 621ff6759414e2a723f61b6d8fc04b9805eb0c20). Each revision also knows which revision(s) it was derived from (known as the revision’s parent(s)).

Git can be used with one branch per directory like bzr or hg, but it is more common to have branches stored within the same directory/repository (thus the reason some refer to git as a ‘branch container’). In git, branches record the revision of the most recent commit for the branch; since each commit records its parent(s), a branch consists of its most recent commit plus all ancestors of that commit. When a new commit is made on a branch, the branch just records the new revision. Tags simply record a single revision, much like branches, but tags are not advanced when additional commits are made. tags are not stored as part of a branch or in a revision controlled file, though by default tags that point to commits that are downloaded are themselves downloaded as well.

git revisions are unique by design; if you have the same revision in two different repositories, the revision name for both will be the same.

git does provide more human-meaningful ways of referring to commits, in the form of simple suffixes used to count backwards in history from the tip of a branch (or backwards from a tag or commit). This includes methods for counting relative to different parents, making the suffixes have structural meaning. However, such methods are somewhat hidden; for example, they are not shown in the output of git log. This leaves many users unaware of how to take advantage of them, if they are aware of them at all. (A simple wrapper can get them to be shown, at the cost of a little time; they could be shown at negligible time cost with an integrated solution, but none exists to my knowledge.)

Picture

           650a6f...
              |
           caf806...
          /   |   \         719b9d...
        /     |     \       /
      /       |       \   /
 75cc2c...  147c0a... acac44...
      \       |         |
        \     |         |
         8f50e6...    8147be...
         /    |     /
       /      |   /
  9b39b2... 6e2cde...
    |         |
  01fa22... 1a9d90...
    |    \    |
    |      \  |
 46508c...  b6765c...
    |         |
 1c4e8d...  328638...
       \     /
       6627f7b...
          |
       754b42...
          |    \
          |      \
       d1879f...  fba5d0...
          |
       c962db...

hg

Method

See a hg tour through history, and section 2.4.1, “Changesets, revisions, and talking to other people”.

hg uses a method that may look like a mix of the methods used by git and bzr; it has two distinct methods of referring to each revision. Like git, hg uses sha1sums to refer to revisions (though it abbreviates them to fewer characters by default). Like bzr, hg uses the numbers 1, 2, 3, etc. to refer to revisions. Thus hg has one unique method to refer to revisions and another that is simple and easily manipulatable by users. Each revision (or “changeset” in mercurial’s vocabulary) is of the form revision-number:changeset-identifier (e.g. 3:ff5d7b70a2a9).

Like bzr, branches in hg are typically done by creating separate directories (typically with their own repository). However, it also has named branches for naming branches within a repository, which are somewhat similar to git. (I have been told there are important distinctions between hg named branches and git branches, but I do not fully understand all the details; maybe someone will explain in the comments.)

mercurial has both tags and local tags, with (normal) tags being stored in an .hgtags file that is version controlled, and local tags being stored in a file that is not version controlled nor shared (cloned/pulled/pushed/etc.). Like most other systems, tags in hg are an alias for a specific commit.

The (abbreviated) sha1sum portion of hg revisions (the “changeset identifier”) is unique by design; if you have the same revision in two different repositories, the changeset identifier for both will be the same. The simple number portion of hg revisions (the “revision number”) is not unique. If you have the same revision in two different repositories, they will not necessarily have the same revision number in both.

Picture

             19:c87f92...
                |
             18:650a6f...
               |      \
        15:caf806...   \
         /     |        \
       /       |         \
      /        |          \
13:75cc2c... 14:147c0a... 17:acac44...
      \        /           |
        \     /            |
       12:8f50e6...      16:8147be...
         /    |        /
       /      |      /
9:9b39b2... 11:6e2cde...
    |         |
8:01fa22... 10:1a9d90...
    |    \    |
    |      \  |
5:46508c... 7:b6765c...
    |         |
4:1c4e8d... 6:328638...
       \     /
      3:6627f7b...
          |
      2:754b42...
          |
          |
      1:d1879f...
          |
      0:c962db...

Final notes

Each system uses a different scheme, which have different advantages and disadvantages. Odds are that I am not aware of all the relative merits of these systems yet, though I do know some. Personally, I don’t think any of them are optimal (though I admit that optimality is a somewhat relative term given the inherent trade-offs involved). Unfortunately I’m going off-topic, as I said I wouldn’t be discussing advantages and disadvantages in this post, so I’ll shut my trap here…

Happenings in the VCS world

It has been a long time since my last blog post on VCSes. I am getting back into the swing of things and will be making a few more posts. Besides, Olav doesn’t have enough to do and he wants more of my long rambling posts to digest.

The VCS world is becoming more and more interesting, even if it is also more and more frustrating. I’ll briefly point out a few things I have seen happen in the last few months that look cool, making this VCS post a little bit different than my others.

cvs

Stinking stingy CVS refuses to die…it seems to prefer slowly petrifying over the years or something. It was great a number of years ago, but there’s just so many better tools these days. However, there does appear to be a light at the end of the tunnel. The last place I am forced to use CVS (work) will finally be switching (to subversion) in a couple months. Woohoo!

svn

I haven’t seen any big changes in subversion itself (only one bug fix release has occurred). However, it looks like they are making progress on finally implementing useful merge functionality. This is interesting on a number of levels: (1) this lack of functionality was one of the big reasons subversion sometimes looks like a (very well polished) antique rather than a modern system; will the incorporation of this feature be enough to stave off some of the ongoing defections to other systems?, (2) this may be interesting for those using bzr-svn, hgsvn, or git-svn — are users of such systems going to find it even easier to use their preferred tool?, (3) the main reason svn’s dozen or so ugly renaming bugs (some of which essentially result in corrupted data) have gone almost completely unnoticed is that most are only triggered in merge operations and subversion’s current merge functionality is so primitive and problematic that hardly anyone uses it. Further, svn’s roadmap clearly lists fixing the rename problems in a different release, after the merge fixes are included. Will the extra visibility that one problem will receive due to a different problem being fixed make subversion look more problematic or less? This will be fun to watch.

On a separate note, it is interesting to see that subversion developers are considering adopting some features of distributed VCSes — sometime in the distant future. An easy to miss but interesting nugget from that email is the following:

Fortunately, we’ve pretty much agreed, IIRC, that we’re willing to punt on subdirectory detachability in working copies in order to get performance improvements.

I have often seen svn and cvs proponents argue that as one of the big advantages to those systems, yet it looks like the svn developers are willing to drop it. Very interesting indeed.

hg

Mercurial version 0.9.5 was released since I did my last round of VCS blog posts and it is on my system. hg-0.9.5 has quite a number of improvements; the one that particularly caught my eye was support for subversion as a source SCM in its convert functionality. When I first looked at mercurial, they suggested people use git-svn and then convert from git to hg. To me, that seemed to push people to just use git. It looks like this has changed.

I have often found it somewhat strange that mercurial doesn’t have more active vocal proponents. Usually one hears from the git or bzr proponents, but not so much from mercurial. Yet it has always had many of the advantages of both (and, in some ways seems to have the most svn-like UI, and would seem a more natural transition for svn converts). I guess it’s a case where having most of the advantages or capabilities of other systems (even multiple other systems) yet not clearly standing out in one particular area will rob you of the active advocates that you could otherwise have. Of course, maybe it’s like the linuxjournal reader’s choice awards phenomenon too; the noise or results that others hear may only be indicative of a certain small subset of the community.

bzr

A lot has happened in the Bazaar world. They had their big 1.0 release in mid-December and are now up to bzr-1.2. They have made impressive gains in performance, particularly with their adoption of the pack idea from git, and it appears they have at long last caught up to the leaders in the field in this area.

Near the end of last year, I corresponded about early versions of the “Main Competitors” writeups of the Why Choose Bazaar page, with Ian Clatworthy. I pointed out some advantages of bzr he hadn’t included, mentioned how some bold claims had no accompanying proof, and pointed out some places where he seemed to be unaware of capabilities of other systems or where I disagreed with some of his claims. The final versions seem to have mixed results; part of my feedback was addressed (and more was addressed in follow-ups), but other parts were not. I’m particularly puzzled by the reticence to investigate the existing capabilities of other systems and the willingness to claim features of bzr as advantages without determining whether they are actually unique. Regardless, though, while one does need to individually verify or discard each claim, the writeups are fairly impressive. I probably need to get back in touch with Ian again.

git

I’m so annoyed with Carl right now. He was the one who introduced me to git a number of years ago, and showed me some really cool things about it. I dropped it almost immediately at the time because it was way too hard to use. But, I’ve always been interested in it and made occasional attempts to tame the dragon ever since.

As many are aware, git has made huge strides towards usability in the 1.5 series, and has recently introduced automatic repacking in git-1.5.4. Because of all this work, I made diligent attempts to understand it over the last couple months. In doing so, I finally had the necessary epiphanies to feel I understand it. It turns out I was able to use it productively long before the uncomfortable feeling of I-don’t-really-understand-this-thing was finally expelled. The result? I found that there are several features of git not present in other systems that I am absolutely addicted to, but looking back on the journey I can’t say that it would be worth the effort for others to follow the same path, despite these awesome features. The thing is still too bloody hard to figure out.

One of my desires for my blog posts series was to point out how horrible the git manpages (i.e. the built in help system for git) are for new users, but I felt uncomfortable doing so until I actually understood them. I was not able to understand even the synopsis of the git-diff manpage until a couple weeks ago. And I tried. Hard. Over days, weeks, and months. I read up on reflogs, the index, git’s storage format, the git tutorial and all kinds of other documentation. I feel stupid now, because I was just missing something simple and now seemingly obvious. But from what I can tell, little should-be-obvious-but-aren’t things like this are blocking lots of people from being able to use git.

Long story short: git has become far more usable…mere mortals can actually figure the system out (a big change from earlier versions) if they have an unusually large level of patience and motivation. git has some really awesome features, but I just can’t recommend it to others in its current state.