Spreadsheet Function Semantics

Anyone who has spent time with Excel spreadsheets knows that Excel has a number of really strange behaviours. I am having a look at criteria functions.

Criteria functions come in two flavours: DCOUNT/DSUM/etc and COUNTIF/SUMIF/etc. The former lets you treat a range as a kind of database from which you can filter rows based on whatever criteria you have in mind and then compute some aggregate function on the filtered rows. For example, compute the average of the “Age” column for those records where the “State” column is either Maine or Texas. The COUNTIF group is a much older set of functions that more or less the same thing, but restricted to a single column. For example, count all positive entries in a range.

In either case, criteria are in play. 12, “>0”, “<=12.5", "=Oink", and "Foo*bar" are examples. The quotes here denote strings. This is already messed up. A syntax like “>0” is fine because the value is an integer. It is fine for a string too. However, the syntax is really crazy when the value is a floating-point number, a boolean or a date because now you just introduced a locale dependency for no good reason — mail the spreadsheet to Germany and get different results. Bravo. And for floating-point there is the question of whether precision was lost in turning the number into a string and back.

Excel being Excel there are, of course, special cases. “=” does not mean to look for empty strings. Instead it means to look for blank cells. And strings that can be parsed as numbers, dates, booleans, or whatever are equivalent to searching for such values. These are all just examples of run-of-the-mill Excel weirdness.

The thing that really makes me suspect that Excel designers were under the influence of potent psycho-active substances is that, for no good reason, pattern matching criteria like “foo*bar” mean something different for the two flavours of functions. For the “D” functions it means /^foo.*bar/ in grep terms, whereas for the “if” functions it means /^foo.*bar$/. Was that really necessary?

The thing is that there really is no good alternative to implementing the weird behaviour in any spreadsheet program that has similarly named functions. People have come to rely of the details and changing the semantics just means 3 or 4 sets of arbitrary rules instead of 2. That is not progress.

I noticed this while writing tests for Gnumeric. We now pass those tests, although I suspect there are more problems waiting there as I extend the test file. I do not know if LibreOffice has the intent of matching Excel with respect to these functions but, for the record, it does not. In fact, it fails in a handful of different ways: anchoring for “D” functions, strictness for DCOUNT, wildcards in general, and the array formula used in my sheet to count failures. (As well as anything having to do with booleans which localc does not support.)