Common Subexpressions

It turns out that it is moderately common to have large number of
VLOOKUPs (or HLOOKUPs) in the same spreadsheet. Gnumeric is embarrassingly
slow
for this. There are several reasons for this.

Profiling where the time
is spent points the blame at g_utf8_collate.

Thinking about the problem, however, suggests a different cause, namely that we are evaluating collate keys for the table every
once for every VLOOKUP. That is simple, easy to understand, and not
prone to obscure problems, but evidently it is not good enough.
Luckily it should be quite easy to add some kind of cache for this.

If I was redesigning the evaluation engine from the ground up, I would
probably compile expressions into some kind of byte code with common
subexpressions explicitly taken care of. But I am not, so the above
cache will have to do for now. That should also handle the case where
the subexpressions are not statically common, but the result of
something like INDIRECT.

INDIRECT, btw., is the single most ugly
feature of spreadsheet semantics. It turns the result of an expression into a cell or name reference and if I was designing
a proposed standard
formula syntax and semantics
for spreadsheets I would think
long and hard about INDIRECT and its consequences. But I am not.
(Interestingly, most uses of INDIRECT that I have seen would be
far better handled as INDEX calls.)

Back to g_utf8_collate. It works by converting
both strings, in their entirety, to a normalized format and then
comparing those. In a language like C, as opposed to Haskell, that
is quite wasteful in two ways:

  • The comparison is done character-by-character from the
    beginning on the strings. That means that it is very common to
    only look at the first few characters of the normalized format. In that case, why was the whole thing normalized?
  • The normalization process allocates space for the normalized format in the form of a GString. That is slow and not needed at
    all since the comparison just needs a single character at a time.

It gets even sillier if you want to do the comparison while ignoring
letter case. Then you first get to case fold the strings in their
entirety before you can call g_utf8_collate.