As I blogged about some time ago, I decided to spend some time learning and comparing various version control systems (VCSes for short). Of course, there are many version control system comparisons out there already, and I’ve read countless other sources as well (blogs, articles + comments, archived mailing list messages found in random google searches, etc.). While some of these sources have very interesting information, they still don’t answer all the questions I had; in fact, even the types of comparisons typically performed in these comparisons don’t cover everything I wanted to see. Here are some of the questions I have been considering:
- What are the most important VCSes to consider?
- Why are VCSes hard to learn? If someone learns one VCS, how much lower is their learning curve for switching to another?
- What are the most common pitfalls that users experience with each of the major VCSes? Are there similarities across systems in the mistakes that users make?
- Why are some systems more widely adopted than others? Are there certain qualities that make some systems more likely to be adopted by certain groups and less likely by others?
- Why do some users of inherently centralized systems claim that “distributed”[1] systems are harder to learn? Why do users of distributed systems claim that they are *not* harder to learn? Why are there similar questions between the various “distributed” systems?
- Which VCS is the “best” for a given individual/group? More importantly, what are the important criteria and where do various VCSes shine?
- Why is there so much misunderstanding between users of different systems?
- To what extent does the truism that “all software sucks” apply to VCSes?
- Typical stuff: Which is the fastest at operation X (and by how much)? Which provides the most useful output (why is it more useful)? Which has the best add-ons? Which has the most relevant features? Which has the best documentation (how much better is it)? Which has killer features missing in others? etc.
I’m still far from answering all of them. However, I have learned a few things, and I figured it’d be a useful exercise to bore everyone to death by writing up some of my thoughts on the subject. So I’ll be doing that occasionally. Some of the things I write up will have comparisons similar to what you’d see elsewhere (but with my own slant which focuses on what seems relevant to me), while a few will analyze the subject from an angle different than what I have been able to find in other comparisons. I have a few posts mostly written up already, and may find time to write up a couple more after those.
Obvious Disclaimers: I’m no expert and am additionally error-prone, so I’ll likely make mistakes in my posts. I also won’t make any claims to be objective, as I don’t think it’s even possible to fully achieve. I will aim to be “reasonably” objective…but note that I have an opinion that placing too high a priority on objectivity makes it impossible to achieve much of the full usefulness of a comparison, limiting what “reasonable” can mean.
[1] As I have mentioned before, I think this is a somewhat misleading term; unfortunately, I don’t have a good replacement. Maybe something like “multi-centered”?
If you are looking for a term other than “Distributed”, “Decentralised” is another term often used to describe DVCS’s.
That said, I am not sure why you think it is misleading to describe them as distributed. If I branch off your work and start hacking, there now exists a distributed revision graph where some nodes only exist on my system and some only exist on yours. A distributed VCS allows both parties to extend the revision graph without causing problems should they decide to merge from one another.
Compare this with cloning a Subversion repository and making independent commits to the two copies. You will end up with revisions that have the same “name” but different content, so you can’t really think of the two repositories as a single revision graph.
There’s two things that came to my mind when reading this entry, I’ll just braindump them here.
The first is that your text focuses on “what VCS is best for a given group”. It might be more interesting to ask “what group is best for a given VCS”. Or maybe you should look at how changing a VCS transformed a group? I don’t know, it jsut seems to me that you see the group as static, even though a lot of its behavior may be influenced by its choice of VCS.
It might also be worth having a look at the unique abilities of VCSes, stuff people from one VCS love, but everyone else says is unnecessary. Two examples of what I mean are git bisect and keeping a ChangeLog.
* What are the most important VCSes to consider?
Subversion, Git, Mercurial, Bazaar (the “next-generation” version, not the tla-port)
* Why are VCSes hard to learn? If someone learns one VCS, how much lower is their learning curve for switching to another?
Subversion is centralised (CVCS for the rest of this post), and the rest are decentralised (DVCS for the rest of this post). If you have experience with one DVCS, working with an other DVCS should be pretty simple. If you have CVCS experience, and trying to learn a DVCS, then on one hand it will help, because concepts like “diff” or “changeset” are also in the DVCS, but on the other hand, you will have to learn a completely different workflow. Some people claim that it’s easier to learn a DVCS when you do not have any CVCS experience.
* What are the most common pitfalls that users experience with each of the major VCSes? Are there similarities across systems in the mistakes that users make?
Many times people migrate from a CVCS to a DVCS, and then try to do the exactly same workflow in the DVCS. This is nearly always a mistake.
* Why are some systems more widely adopted than others? Are there certain qualities that make some systems more likely to be adopted by certain groups and less likely by others?
Subversion was developed to be a “better CVS”. So it’s usually easily accepted by CVS users, because the command-set is very similar.
Git was developed-and-used for linux-kernel-development, so of course it’s very popular with kernel-hackers. On the other hand, Git is still mostly posix-only, so for windows users that’s usually a no-go.
Mercurial and Bazaar work fine on windows for example.
Also, on Windows it’s usually a “requirement” to have good GUI tools, and there Subversion has better tools than the DVCSs (this is changing slowly, GUI tools are getting written for the DVCSs too, but it’s not at the level of the Subversion-GUIs yet).
* Why do some users of inherently centralized systems claim that “distributed”[1] systems are harder to learn? Why do users of distributed systems claim that they are *not* harder to learn?
I probably answered this by the “Why are VCSes hard to learn” question.
* Why are there similar questions between the various “distributed” systems?
Because they work in a very similar way, probably.
* Which VCS is the “best” for a given individual/group? More importantly, what are the important criteria and where do various VCSes shine?
Criterias are for example:
– windows support: Git is not there yet, and Subversion has the best GUIs.
– rename-file-tracking: Bazaar seems to be the best on this, but mercurial is also fine (Subversion can do it too acceptably). Git does it very differently from the other tools, some people claim it’s much worse, other people claim it’s much better than in other VCSs.
– performance: Subversion is the slowest. Git is probably the fastest, although Mercurial is also at similar speed (in some tests even faster than Git sometimes). Bazaar is the slowest from the DVCS group, at least it was, but lately the primary focus of the Bazaar hackers is performance, so maybe now they are faster. but i don’t think they approach Git-like performance currently (but i haven’t tested it lately). on the other hand, Bazaar is not THAT slow, so maybhe it’s performance is acceptable for you.
– user-interface (user-friendliness, consistency of the ui): Bazaar/Mercurial are the best here, and Git is the worst, Subversion somewhere in the middle
* Why is there so much misunderstanding between users of different systems?
I don’t think there are many misunderstandings… could you tell an example?
* To what extent does the truism that “all software sucks” apply to VCSes?
No idea.
* Typical stuff: Which is the fastest at operation X (and by how much)? Which provides the most useful output (why is it more useful)? Which has the best add-ons? Which has the most relevant features? Which has the best documentation (how much better is it)? Which has killer features missing in others? etc.
I think i answered it above, in the “Which VCS is the “best”…” question.
p.s: personally i prefer Mercurial from all the VCSs around there. it does all that i need, with a good performance, and a nice user-interface.
James: It’s merely because it gets interpreted by cvs/svn users as “inherently distributed”; users of distributed systems have touted their ability to handle even pathologically distributed cases (e.g. the kernel) for so long but so strangely, that it comes across to cvs/svn users as though they have to become the pathological case in order to switch. I.e. the interpretation of “distributed” to them is “you *must* change your social structure in order to adopt our tool,” which tends to kill any consideration of switching. The interpretation isn’t remotely true, but it’s the perception that many cvs/svn users have when you use the word “distributed”.
Gabor: I saw one thing in your post that I haven’t really seen several times elsewhere, which is intriguing to me. Why do you claim that bzr has the best rename support? Do you know of any comparison anywhere that backs this up? I’ve somewhere read an implication along those lines, but never as strong a claim as yours. And I’m not sure that I believe it; I’d have to test or see a good comparison.
Actually, I take my last comment back. I remember seeing a blog post a long time ago that did an hg/git/bzr comparison including renames that did show bzr as being the clearest, but the test was very old and very simplistic (e.g. it wouldn’t have caught the many known bugs in subversion rename handling, which I haven’t seen anyone test DVCSes for). Still, I’d be surprised if a single blog post was enough to give everyone the impression that bzr “had the best rename handling” and it wasn’t nearly enough information to believe that statement is really true. Are there other comparisons out there on this specific functionality?
Most important factor: Being able to use it by never ever seeing a terminal command, or having to understand the quirks of it. Such tools must integrate perfectly and transparently into your ide. This is not 70s damnit.
Git’s relatively interesting with rename support. It’s straight forward (git rename ), but what is interesting is that all this command really does is remove the old one and add the new one – git doesn’t store this information. Whether a file was renamed is calulated when you call git-blame. One upshot of this is that if you copy a bunch of code from one file to another,or split apart a file, git-blame also knows about this. You can even tell it to ignore whitespace changes when calculating the blame. The downside is that git-blame can be pretty slow compared to the rest of git (though in generalt I’ve found it quite a bit faster than svn blame.)
Also, keep track of the filespace used vs. the size of the code base vs. the number of commits in the repository. Some of them can easily eat 100 GB with a large project, while others are more efficient.
newren: regading the rename-handling… i don’t remember the exact articles i’ve read about this.
the only thing i can tell you is that rename-handling in GIT is very special, so it’s simply uncomparable to bzr/hg (GIT basically does not track renames at all, but when you ask him for logs/history, he will try to detect renames in history, based on how file-content moved between filenames. this sometimes works nicely, sometimes not).
so regarding comparing hg/mercurial rename-handling, i don’t remember. generally i got the feeling that rename-handling is more “integrated” in bazaar. but on the other hand, i’m quite sure that the level of rename-handling in mercurial is adequate (also, for example, in the release notes for the latest mercurial (0.9.5) : Fixes for some file copy and rename corner cases).
also, when hunting for comparisons, be careful, because these programs are changing very fast. so it’s quite possible, that a comparison that’s let’s say 6 months old, is not obsolete.
Git does rename detection when looking through history, rather than recording renames. It does this for several specific reasons:
* It can detect copies, moves, partial copies, and partial moves. Yes, it can tell you “half this file moved to that file”.
* If it recorded such things when originally done, then they’d form part of the history, and even if Git grew better rename detection algorithms later, the history wouldn’t change.
* By detecting renames later, when reading the history, newer versions of Git with better rename detection can do rename detection better on the old history.
@Gábor: First you said:
“I don’t think there are many misunderstandings… could you tell an example?”
and later said:
“also, when hunting for comparisons, be careful, because these programs are changing very fast. so it’s quite possible, that a comparison that’s let’s say 6 months old, is not obsolete.”
I think many misunderstandings arise for this very reason.
As to why people choose different VCSs, I think it has a lot to do with examples set and who wants to emulate whom.
That is, I think many people say, “Linus wrote git, so it must be great” or “Canonical sponsors bzr, so it is surely awesome.” and only later justify it by saying “I use git because it’s really fast” or “I use bzr so I don’t feel like I’m trapped inside a Kafka story.” (The middle ground, of course, is currently occupied by hg — reasonably fast and reasonably usable.)
Anonymous: what you’ve said does not make sense.
If I am using a version control system that tracks renames or copies, there is nothing stopping me from ignoring that tracking data and performing GIT-style inference. Such a VCS is storing a superset of the information needed for such a merge.
If a VCS has the opportunity to record developer intent w.r.t. moves and other operations, it seems like a no-brainer to do so. Not doing so limits your options in the future.
James: exactly.
as far as i understand, Git’s storage model is simply unable to record file-renames, and there’s nothing wrong with that. Linus probably maid a decision, that other properties were more important.
what i don’t like, is when Git proponents starts to claim that this is in fact an advantage. and that’s completely incorrect, because, as you said, if your VCS stores more than what GIT stores, then it can still do GIT-style inference.
also, to all git-rename-zealots:
http://kerneltrap.org/node/11765
James, Gabor – Could you describe exactly what is the benefit of having the developer burden that they have to always excplictly describe their behaviour when renaming files? AFACT, its just to add some metadata into the repository for efficiency in describing renames, though actually in git it couldn’t get more faster for detecting a direct (no source changes) rename – its just comparing sha1 hashes of the file blobs.
I guess you could argue that people are used to having to tell thier source control systems this information and so its confusing that they don’t have to, but thats the reason there’s a git-rename command,
Gabor, on the kerneltrap discussion, if you read the comments you’ll understand that these issues are equal whether you have ‘automatic’ move discovery or explicit description. Its just down to how much infomation one developer leaves another. At least in git you can find out what happened if someone messed up, which isn’t true (at this moment) for any other scm as far as I know.
Rob: My problem with git is this:
> people are used to having to tell thier source control systems
> this information and so its confusing that they don’t have to,
> but thats the reason there’s a git-rename command,
and then:
> these issues are equal whether you have ‘automatic’
> move discovery or explicit description.
imho this is not consistent.
on one hand, git-zealots say that you do not have to tell git that you’re renaming, because git is able to figure it out automatically,
on the other hand, when confrontend with a situation, where you EXPLICITLY HAVE TO TELL git that you’re renaming, otherwise it does not work, then in such situations they say: but that’s the same with other VCSs.
but doesn’t that means that their previous sentence is not correct?
please note, i’m not saying what git does is wrong. i just do not agree with the opinion, that git’s rename-handling is BETTER IN EVERY WAY.
in my opinion it’s a DIFFERENT approach. better in some ways, and worse in other ways.
I agree with Gábor: the git way has the benefit of being implict, but this involve a bit of magic.
For example if I rename a file and modify it in the same commit there is no way to detect the renaming if not told explicitly.
Gábor: I tend to avoid defining people as ‘zealots’ even if they somewhat deserve it, as this puts the interlocutor in a defensive position and often leads to flames. Just say GIT-fans. 😉
Gabor, I’ve taken some time to think about it a bit more, and you’re quite right! The main problem is if you move, you must commit before making any changes otherwise you’ll have a broken history. That’s pretty far from ideal. One possible solution would be to have git mv always create an automatic commit object.
Thanks,
Rob
Hmm,I just tested this and moving (*with git mv*), editing and commiting a file seems to work just fine in git 1.5.3, so ignore my last comment, it works just fine. I’m not sure how though!