Note: Most will not find this post as interesting as my previous posts or my next one. It was intended to help explain questions like “How much knowledge transfers to a new VCS if you’ve learned another?” and “Why do some claim that certain *types* of VCSes are easier to learn than others, while others claim that they are all pretty much equal?”, questions I mentioned in my first post. Most probably aren’t interested in those questions and thus not this entry. I’m including it anyway.
Editors as an analogy
I sometimes see people arguing about whether text editors and word processors ought to automatically save with every change. While almost every existing editor has two states (the version being edited, plus the version on disk when last saved), some argue that it would simplify things to save on every change. Most editors stick with the two state model, which from a darwinian point of view would suggest it is the more superior model overall. However, it is interesting to note that the multi-state model does come with its complications even for simple cases like this. The multi-state model for editors has stung just about everyone at least once in the past before they learned the appropriate habits. For example, many have lost data in the past due to exiting the app before saving, due to power outages, due to application crashes, or even due to OS and hardware failures. (These days, most editors have workarounds which mitigate these problems.) Also, users can’t use separate programs to copy or print or import the file on disk and use it unless they rememebered to first save their latest changes. And users may be confused at first by extra files (foo.autosave, foo.bak, foo~, .#foo, etc.) that show up on their hard disk.
Virtually all editors use this two-state model (current edits, plus last version saved on disk), and nearly all computer users seem to have mastered it. At a basic level, VCSes use a similar model.
The multiple states of all major VCSes
All the major VCSes provide developers with their own little sandbox, or working copy, as well as a place for changes that are ready to be saved, called a repository. This maps almost directly to the concepts of standard editors — changes you make locally, and what version you last saved. Most any VCS guide will say that these are the two states you need to learn (I particularly remember reading several about CVS which said this.) It’s a convenient lie though. There are more than two states.
Compiling source code can create files that don’t need to be saved in the repository (others can regenerate them with the source). So, all the major VCSes have the concept of an ignore list; any files in the ignore list will not be saved in the repository. So we have three states so far: ignored files, local changes, and the repository.
Sadly, there’s another state that the working copy is split into. The major VCSes seem to have decided that developers may be too lazy to add files that shouldn’t be saved to the ignore list…or that they may be unaware of such files (editor autosave files, for example), or that developers want to have shared ignore lists but don’t want to add some personal files to such shared lists. Whatever the reason, the major VCSes have another state which I call “limbo”, whose existence everyone seems to forget about. This state is changes which aren’t explicitly added to the index (think CVS/Entries files in CVS) and thus will not be saved, but are not explicitly ignored either. This state causes the most bugs in my experience, even with advanced users, because people simply forget to explicitly add new files to the index and thus they don’t get saved with the rest of the changes. So we have four states so far (three being subsets of the working copy): explicitly ignored files, limbo, local changes that will be saved in the next commit, and the repository.
It turns out that the repository side also is split into multiple states. Developers want to be able to track what changes they themselves have made to their working copy, regardless of commits that have since been recorded in the (remote) repository. So, if you want to get a list of changes you’ve made, or the history that led up to your current working copy, it needs to be relative to the version of the repository that existed when you got your copy. That may not be the current version, because other developers could have recorded their changes in the (remote) repository. This also affects your ability to push your changes to the (remote) repository (by a “commit” in cvs or svn terminology), potentially requiring you to merge the various changes together. So, we have five states:
- Substates of the working copy:
- (explicitly) ignored — files inside your working copy that you explicitly told the VCS system to not track
- index — the content in your working copy that you asked the VCS to track; this is the portion of your working copy that will be saved when you commit (in CVS, this is done using the CVS/Entries files)
- limbo — not explicitly ignored, and not explicitly added. This is stuff in your working copy that won’t be checked in when you commit but you haven’t told the VCS to ignore, which typically includes newly created files.
- Substates of the repository:
- “checkout” version — the version of the code in your working copy before you started modifying it
- remote version — the version of the code currently saved in the remote repository
Not understanding these multiple states and the differences between them for the VCS you are using has varying consquences: not being able to take full advantage of your system, being unable to do some basic operations, or (worst case) introducing erroneous or incomplete changes.
Similarities and differences between the major VCSes with these states
Most of these five states are similar between the major VCSes. State 1 (ignored files) is essentially identical between the systems. (The only difference is in the details of setting it up; for cvs it means editing .cvsignore, for svn it means modifying the svn:ignore property, for mercurial it means editing .hgingore, etc.) State 5 is also essentially identical. States 2 and 3 always sum up to everything in the working copy other than explicitly ignored files, so extending state 3 means shrinking state 2. Thus, we can get a feel for the differences between VCSes by looking at their differences in states 3 and 4.
The major distinction between inherently centralized VCSes (e.g. cvs and svn), and so called distributed (I prefer the term “multi-centered”) VCSes comes with state 4. The differences in this state can be thought of as different choices along a continuum rather than a binary difference, however. The difference here is in how much information gets cached when one gets a copy. With CVS, you only get a working copy plus info about what version you checked out and where the repository is located. With SVN, you get the same as with CVS, but also an extra copy of all the files. Most distributed systems go a few steps farther than svn and by default cache a copy of all the versions on the given branch. git, by default, caches a copy of all versions on all branches.
Caching extra information as part of state 4 can allow additional work to be done offline. cvs and svn are very limited in this respect, but the additional offline capabilities of the other systems come with the understanding that the local cache itself is a repository and thus users need to understand both how to sync changes with the local repository as well as between the local repository and the remote one(s). In cvs and svn, it’s not useful to “sync with the local cache”; instead those systems just automatically synchronize the local cache and the remote repository to the indexed local changes all at
once. Thus, cvs and svn users only need to learn a smaller set of “synchronization” commands (limited to “commit” and “update”.)
There is also a potential difference between VCSes in state 3. Having changes in state 3 is the place that in my experience causes the most errors. Users simply forget that their changes are in this state and forget to add them. Now, it turns out that all VCSes I’ve looked at close enough are identical here, except for git. (So if you know one of them you already also understand this aspect in all the other VCS systems other than git.) git extends the concept of limbo, turning the index into a high-level (and in your face) concept with some really cool features, but unfortunately it has the side-effect of making git even more error-prone for users. I’ll discuss this in more detail in my next post.
Not to bash your effort but I wonder what the point of all this work is. I can’t see how free software developers will gain anything from all of this information. The VCS debate is a moot point. There will be no convergence of the VCS market. So, if a new VCS system was developed it would relegated to a niche market (bzr, hg, monotone, etc.) It’s unlikely that there will be repeat of git.
True. Git’s index really fights such limbos.
Haven’t though in any positive way about the index first, but now the “git add && git commit” instead of a simple “svn commit” makes sence.
John: If you read my first post on the subject at http://blogs.gnome.org/newren/2007/11/15/starting-to-compare-version-control-systems/, you’ll see that this was about a few of the questions I posted there. Most people probably aren’t interested in those same questions as I was, making this post irrelevant to them. But it may be useful if they shared the same questions I had (how much knowledge transfers from learning one system to learning another? why do people argue that centralized systems are easier to learn? why do distributed users argue that they are just as easy to learn? why do people claim git is harder to learn, but some people claim it’s not?) To me at least, this post is a good way of framing and answering those questions.
If you’re talking about the other posts, I’ve had feedback from other people that they were very useful to them in understanding the various systems. I guess my goals are just different than you seem to be assuming. I’m not trying to encourage convergence of the VCS market (sure it’d be nice, but I agree it won’t happen). And I don’t understand your comment that the VCS debate is a moot point, though I really wasn’t trying to get involved in a debate–I was trying to provide information useful for understanding various systems and help individuals compare and contrast them on their own.
“I’ve had feedback from other people that they were very useful to them in understanding the various systems.”
Indeed, and I really appreciate your articles on the matter. Lot of people really miss use/understand VCS. The concept is not that easy to grasp for everyone. Keep on that really good series.
Thanks.
newren: Your SCM post very quite interesting so far, but this one somehow didn’t manage to get to the point. Alot of definition, few conclusions – allthough I have to admit that I might have over-read it by accidently, ’cause this post in opposition to your other, interesting posts urged me to fast forward when reading.
Greetz Mathias,
Awaiting insightful follow-ups.
Elijah: Please ignore John and continue this series.
John: Think outside of the box.
Elijah: maybe it is just me, but it seems a lot more confusing to talk about the “five states” for files when you have two orthogonal concepts: file states in the working tree, and tree versions.
The tree version idea maps quite well to the model that many text editors use: there is the previously recorded version of the tree, and the unsaved next version of the tree that you are currently working on. As most text editors don’t allow you to selectively save changes to a file, I don’t think the analogy fits the other states.
For working tree states, I think it helps to think of a hierarchy of states rather than a flat list. The most basic is “versioned vs. unversioned”, where only versioned files make it into the next version of the tree.
With this categorisation, “versioned” corresponds to your “index” state. I’d then add the subcategories of “unchanged”, “modified”, “added” and “deleted”. These subcategories are all relative to the previous tree versions.
Under “unversioned”, there are “ignored” and “unknown” (or limbo if you want). By definition, there is no information about unversioned files in previously tree versions.