Limbo: Why users are more error-prone with git than other VCSes

Limbo is a term I use but VCS authors don’t. However, that’s because they tend to ignore a certain state that exists in all major VCSes (and give it no name because they tend to ignore it) despite the fact that this state seems to be the largest source of errors. I call this state limbo.

How to make git behave like other VCSes

Most potential git users probably don’t want to read this whole page, and would like their knowledge from usage of other VCSes to apply without learning how the index and limbo are different in git than their previous VCS (despite the really cool extra functionality it brings). This can be done by

  • Always using git diff HEAD instead of git diff
  • and

  • Always using git commit -a instead of git commit

Either make sure you always remember those extra arguments, or come back and read this page when you get a nasty surprise.

The concept of Limbo

VCS users are accustomed to thinking of using their VCS in terms of two states — a working copy where local changes are made, and the repository where the changes are saved. However, the working copy is split into three sets (see also VCS concepts):

  • (explicitly) ignored — files inside your working copy that you explicitly told the VCS system to not track
  • index — the content in your working copy that you asked the VCS to track; this is the portion of your working copy that will be saved when you commit (in CVS, this is done using the CVS/Entries files)
  • limbo — not explicitly ignored, and not explicitly added. This is stuff in your working copy that won’t be checked in when you commit but you haven’t told the VCS to ignore, which typically includes newly created files.

The first state is identical across all major VCSes. The second two states are identical across cvs, svn, bzr, hg, and likely others. But git splits the index and limbo differently.

One could imagine a VCS which just automatically saves all changes that aren’t in an explicitly ignored file (including newly created files) whenever a developer commits, i.e. a VCS where there is no limbo state. None of the major VCSes do this, however. There are various rationales for the existence of limbo: maybe developers are too lazy to add new files to the ignored list, perhaps they are unaware of some autogenerated files, or perhaps the VCS only has one ignore list and developers want to share it but not include their own temporary files in such a shared list. Whatever the reason, limbo is there in all major VCSes.

Changes in limbo are a large source of user error

The problem with limbo is that changes in this state are, in my experience, the cause of the most errors with users. If you create a new file and forget to explicitly add it, then it won’t be included in your commit (happens with all the major VCSes). Naturally, even those familiar with their VCS forget to do that from time to time. This always seems to happen when other changes were committed that depend on the new files, and it always happens just before the relevant developers go on vacation…leaving things in a broken state for me to deal with. (And sure, I return the favor on occasion when I simply forget to add new files.)

A powerful feature of git

Unlike other VCSes, git only commits what you explicitly tell it to. This means that without taking additional steps, the command “git commit” will commit nothing (in this particular case it typically complains that there’s nothing to commit and aborts). git also gives you a lot of fine-grained control over what to commit, more than most other VCSes. In particular, you can mark all the changes of a given file for subsequent committing, but unlike other VCSes this only means that you are marking the current contents of that file for commit; any further changes to the same file will not be included in subsequent commits unless separately added. Additionally, recent versions of git allow the developer to mark subsets of changes in an existing file for commit (pulling a handy feature from darcs). The power of this fine-grained choose-what-to-commit functionality is made possible due to the fact that git enables you to generate three different kinds of diffs: (1) just the changes marked for commit (git diff –cached), (2) just the changes you’ve made to files beyond what has been marked for commit (git diff), or (3) all the changes since the last commit (git diff HEAD).

This fine-grained control can come in handy in a variety of special cases:

  • When doing conflict resolution from large merges (or even just reviewing a largish patch from a new contributor), hunks of changes can be categorized into known-to-be-good and still-needs-review subsets.
  • It makes it easier to keep “dirty” changes in your working copy for a long time without committing them.
  • When adding a feature or refactoring (or otherwise making changes to several different sections of the code), you can mark some changes as known-to-be-good and then continue making further changes or even adding temporary debugging snippets.

These are features that would have helped me considerably in some GNOME development tasks I’ve done in the past.

How git is more problematic

This decision to only commit changes that are explicitly added, and doing so at content boundaries rather than file boundaries, amounts to a shift in the boundary between the index and limbo. With limbo being much larger in git, there is also more room for user error. In particular, while this allows for a powerful feature in git noted above, it also comes with some nasty gotchas in common use cases as can be seen in the following scenarios:

  • Only new files included in the commit
    1. Edit bar
    2. Create foo
    3. Run git add foo
    4. Run git commit

    In this set of steps, users of other VCSes will be surprised that after step 4 the changes to bar were not included in the commit. git only commits changes when explicitly asked. (This can be avoided by either running git add bar before committing, or running git commit -a. The -a flag to commit means “Act like other VCSes — commit all changes in any files included in the previous commit”.)

  • Missing changes in the commit
    1. Create/edit the file foo
    2. Run git add foo
    3. Edit foo some more
    4. Run git commit

    In this set of steps, users of other VCSes will be surprised that after step 4 the version of foo that was commited was the version that existed at the time step 2 was run; not the version that existed when step 4 was run. That’s because step 2 is translated to mean mark the changes currently in the file foo for commit. (This can be avoided by running git add foo again before committing, or running git commit -a for step 4.)

  • Missing edits in the generated patch
    1. Edit the tracked file foo
    2. Run git add foo
    3. Edit foo some more
    4. Run git diff

    In this set of steps, users of other VCSes will be surprised that at step 4 they only get a list of changes to foo made in step 3. To get a list of changes to foo made since the last commit, run git diff HEAD instead.

  • Missing file in the generated patch
    1. Create a new file called foo
    2. Run git add foo
    3. Run git diff

    In this set of steps, users of other VCSes will be surprised that at step 3 the file foo is not included in the diff (unless changes have been made to foo since step 2, but then only those additional changes will be shown). To get foo included in the diff, run git diff HEAD instead.

These gotchas are there in addition to the standard gotcha exhibited in all the major VCSes:

How all the major VCSes are problematic

  • Missing file in the commit
    1. Edit bar
    2. Create a new file called foo
    3. Run vcs commit (where vcs is cvs, svn, hg, bzr…see below about git)

    In this set of steps, the edits in step 1 will be included in the commit, but the file foo will not be. The user must first run vcs add foo (again, replacing vcs with the relevant VCS being used) before committing in order to get foo included in the commit.

    It turns out that git actually can help the user in this case due to its default to only commit what it is explicitly told to commit; meaning that in this case it won’t commit anything and tell the user that it wasn’t told to commit anything. However, since nearly every tutorial on git[*] says to use git commit -a, users include that flag most the time (60% of the time? 98%?). Due to that training, they’ll still get this nasty bug. However, they’re going to forget or neglect this flag sometimes, so they also get the new gotchas above.

[*] Recent versions of the official git tutorial being the only exception I’ve run across. It’s fairly thorough (make sure to also read part two), though it isn’t quite as explicit about the potential gotchas in certain situations.

How bzr, hg, and git mitigate these gotchas (and cvs and svn don’t)

These gotchas can be avoided by always running vcs status (again, replace vcs with the relevant VCS being used) and looking closely at the states the VCS lists files in. It turns out bzr, hg, and git are smart here and try to help the user avoid problems by showing the output of the status command when running a plain vcs commit (at the end of the commit message they are given to edit). This helps, but isn’t foolproof; I’ve somehow glossed over this extra bit of info in the past and still been bit. Also, I’ll often either use the -m flag to specify the commit message on the command line (for tiny personal projects) or a flag to specify taking the commit message from a file (i.e. using -F in most vcses, -l in hg).

The concepts a user must learn to understand existing VCSes

Note: Most will not find this post as interesting as my previous posts or my next one. It was intended to help explain questions like “How much knowledge transfers to a new VCS if you’ve learned another?” and “Why do some claim that certain *types* of VCSes are easier to learn than others, while others claim that they are all pretty much equal?”, questions I mentioned in my first post. Most probably aren’t interested in those questions and thus not this entry. I’m including it anyway.

Editors as an analogy

I sometimes see people arguing about whether text editors and word processors ought to automatically save with every change. While almost every existing editor has two states (the version being edited, plus the version on disk when last saved), some argue that it would simplify things to save on every change. Most editors stick with the two state model, which from a darwinian point of view would suggest it is the more superior model overall. However, it is interesting to note that the multi-state model does come with its complications even for simple cases like this. The multi-state model for editors has stung just about everyone at least once in the past before they learned the appropriate habits. For example, many have lost data in the past due to exiting the app before saving, due to power outages, due to application crashes, or even due to OS and hardware failures. (These days, most editors have workarounds which mitigate these problems.) Also, users can’t use separate programs to copy or print or import the file on disk and use it unless they rememebered to first save their latest changes. And users may be confused at first by extra files (foo.autosave, foo.bak, foo~, .#foo, etc.) that show up on their hard disk.

Virtually all editors use this two-state model (current edits, plus last version saved on disk), and nearly all computer users seem to have mastered it. At a basic level, VCSes use a similar model.

The multiple states of all major VCSes

All the major VCSes provide developers with their own little sandbox, or working copy, as well as a place for changes that are ready to be saved, called a repository. This maps almost directly to the concepts of standard editors — changes you make locally, and what version you last saved. Most any VCS guide will say that these are the two states you need to learn (I particularly remember reading several about CVS which said this.) It’s a convenient lie though. There are more than two states.

Compiling source code can create files that don’t need to be saved in the repository (others can regenerate them with the source). So, all the major VCSes have the concept of an ignore list; any files in the ignore list will not be saved in the repository. So we have three states so far: ignored files, local changes, and the repository.

Sadly, there’s another state that the working copy is split into. The major VCSes seem to have decided that developers may be too lazy to add files that shouldn’t be saved to the ignore list…or that they may be unaware of such files (editor autosave files, for example), or that developers want to have shared ignore lists but don’t want to add some personal files to such shared lists. Whatever the reason, the major VCSes have another state which I call “limbo”, whose existence everyone seems to forget about. This state is changes which aren’t explicitly added to the index (think CVS/Entries files in CVS) and thus will not be saved, but are not explicitly ignored either. This state causes the most bugs in my experience, even with advanced users, because people simply forget to explicitly add new files to the index and thus they don’t get saved with the rest of the changes. So we have four states so far (three being subsets of the working copy): explicitly ignored files, limbo, local changes that will be saved in the next commit, and the repository.

It turns out that the repository side also is split into multiple states. Developers want to be able to track what changes they themselves have made to their working copy, regardless of commits that have since been recorded in the (remote) repository. So, if you want to get a list of changes you’ve made, or the history that led up to your current working copy, it needs to be relative to the version of the repository that existed when you got your copy. That may not be the current version, because other developers could have recorded their changes in the (remote) repository. This also affects your ability to push your changes to the (remote) repository (by a “commit” in cvs or svn terminology), potentially requiring you to merge the various changes together. So, we have five states:

  • Substates of the working copy:
    1. (explicitly) ignored — files inside your working copy that you explicitly told the VCS system to not track
    2. index — the content in your working copy that you asked the VCS to track; this is the portion of your working copy that will be saved when you commit (in CVS, this is done using the CVS/Entries files)
    3. limbo — not explicitly ignored, and not explicitly added. This is stuff in your working copy that won’t be checked in when you commit but you haven’t told the VCS to ignore, which typically includes newly created files.
  • Substates of the repository:
    1. “checkout” version — the version of the code in your working copy before you started modifying it
    2. remote version — the version of the code currently saved in the remote repository

Not understanding these multiple states and the differences between them for the VCS you are using has varying consquences: not being able to take full advantage of your system, being unable to do some basic operations, or (worst case) introducing erroneous or incomplete changes.

Similarities and differences between the major VCSes with these states

Most of these five states are similar between the major VCSes. State 1 (ignored files) is essentially identical between the systems. (The only difference is in the details of setting it up; for cvs it means editing .cvsignore, for svn it means modifying the svn:ignore property, for mercurial it means editing .hgingore, etc.) State 5 is also essentially identical. States 2 and 3 always sum up to everything in the working copy other than explicitly ignored files, so extending state 3 means shrinking state 2. Thus, we can get a feel for the differences between VCSes by looking at their differences in states 3 and 4.

The major distinction between inherently centralized VCSes (e.g. cvs and svn), and so called distributed (I prefer the term “multi-centered”) VCSes comes with state 4. The differences in this state can be thought of as different choices along a continuum rather than a binary difference, however. The difference here is in how much information gets cached when one gets a copy. With CVS, you only get a working copy plus info about what version you checked out and where the repository is located. With SVN, you get the same as with CVS, but also an extra copy of all the files. Most distributed systems go a few steps farther than svn and by default cache a copy of all the versions on the given branch. git, by default, caches a copy of all versions on all branches.

Caching extra information as part of state 4 can allow additional work to be done offline. cvs and svn are very limited in this respect, but the additional offline capabilities of the other systems come with the understanding that the local cache itself is a repository and thus users need to understand both how to sync changes with the local repository as well as between the local repository and the remote one(s). In cvs and svn, it’s not useful to “sync with the local cache”; instead those systems just automatically synchronize the local cache and the remote repository to the indexed local changes all at
once. Thus, cvs and svn users only need to learn a smaller set of “synchronization” commands (limited to “commit” and “update”.)

There is also a potential difference between VCSes in state 3. Having changes in state 3 is the place that in my experience causes the most errors. Users simply forget that their changes are in this state and forget to add them. Now, it turns out that all VCSes I’ve looked at close enough are identical here, except for git. (So if you know one of them you already also understand this aspect in all the other VCS systems other than git.) git extends the concept of limbo, turning the index into a high-level (and in your face) concept with some really cool features, but unfortunately it has the side-effect of making git even more error-prone for users. I’ll discuss this in more detail in my next post.