## Estimating merge costs

September 28, 2009 community, freesoftware, General, maemo 2 CommentsAfter commenting on Mal Minhas’s “cost of non-participation” paper (PDF), I’ve been thinking about the cost of performing a merge back to a baseline, and I think I have something to work with.

First, this might be obvious, but worth stating: Merging a branch which has changed and a branch which has not changed is trivial, and has zero cost.

So merging only has a cost if we have a situation where the two trees concerned with the merge have changed.

We can also make another observation: If we are only adding new function points to a branch, and the mainline branch does not change the API, there is a very small cost to merging (almost zero). There may be some cost if functions with similar names, performing similar functions, have been added to the mainline branch, but we can trivially merge even a large diff if we are not touching any of the baseline code, and only adding new files, objects, or functions.

With that said, let’s get to the nuts & bolts of the analysis:

Let’s say that a code tree has n function points. A vendor takes a branch and makes a series of modifications which affects x function points in the program. The community develops the mainline, and changes y function points in the original program. Both vendor and community add new function points to extend functionality, but we’re assuming that merging these is an almost zero cost.

The probability of conflicts is obviously greater the bigger x and y are. This probability increases very fast the bigger the numbers. Let’s assume that every time that a given function point has been modified by both the vendor and the community that there is a conflict which must be manually resolved (1). If we assume that changes are independently distributed across the codebase (2), we can work out that the probability of at least one conflict is 1 – (n-x)!(n-y)!/n!(n-x-y)! if I haven’t messed up my maths (thanks to derf on #maemo for the help!).

So if we have 20 functions, and one function gets modified on the mainline and another on the vendor branch, we have a 5% chance of a conflict, but if we modify 5 each, the probability goes up to over 80%. This is the same phenomenon which lets you show that if you have 23 people in a room, chances are that at least two of them will share a birthday.

We can also calculate the expected number of conflicts, and thus the expected cost of the merge, if we assume the cost of each of these conflicts is a constant cost C (3). However, the maths to do that is outside the scope of my skillz right now Anyone else care to give it a go & put it in the comments?

We have a bunch of data we can analyse to calculate the cost of merges in quantitative terms (for example, Nokia’s merge of Hildon work from GTK+ 2.6 to 2.10), to estimate C, and of course we can quite easily measure n and y over time from the database of source code we have available to us, so it should be possible to give a very basic estimate metric for cost of merge with the public data.

Footnotes:

(1) It’s entirely possible to have automatic merges happen within a single function, and the longer the function, the more likely this is to happen if the patches are short.

(2) A poor assumption, since changes tend to be disproportionately concentrated in a few key functions.

(3) I would guess that the cost is usually proportional to the number of lines in the function, perhaps by the square of the number of lines – resolving a conflict in a 40 line function os probably more than twice as easy as resolving a conflict in an 80 line function. This is slightly at odds with footnote (1), so overall the assumption of constant cost seems reasonable to me.