August 2005 – Morten Welinder

Namespaces

Gnumeric’s solver was broken in HEAD and while fixing it, I
updated to the latest version of lp_solve.

Let me tell you, lp_solve is a prime example of how not to make
a library! It looks like there used to be a program and that it
was made into a library by removing main.

There is no concept of namespaces there. When you include the
relevant header file, you get everything used anywhere internally:
EQ, gcd, MALLOC, TRUE, is_int, and about 400-600
other identifiers.

You cannot isolate that problem to just where you use the header,
by the way, as static is practically usused.

I decided to throw a perl script at the problem and combine everything into one
gaint C file. All 44186 lines of it after pruning about 5000 lines.
The script adds tons of statics in the process,
renames the relevant part of the API, and extracts
that API. Extra points for you if you can read the perl script
without losing your breakfast.

The Cat’s Out…

…and so is Gnumeric 1.5.3. (Complete with a big ugly, but harmless,
error message on xls save — oops! At least we got it fixed in time
for Debian and the Win32 build.)

Home;
Source;
Release notes;
Changes.

Utility Functions

Dom,
it is probably not that they are being inefficient or behaving illogically. It is more like that they are optimizing a utility
function somewhat different from the one you would naïvely expect.

(A mathematician takes a walk and comes by a house on fire; he calls
the fire department and they come and put out the fire. The next
day he comes by a house that is not on fire; he sets it on fire and
walks on after thus having reduced the problem to a previously solved
one.)

Common Subexpressions

It turns out that it is moderately common to have large number of
VLOOKUPs (or HLOOKUPs) in the same spreadsheet. Gnumeric is embarrassingly
slow for this. There are several reasons for this.

Profiling where the time
is spent points the blame at g_utf8_collate.

Thinking about the problem, however, suggests a different cause, namely that we are evaluating collate keys for the table every
once for every VLOOKUP. That is simple, easy to understand, and not
prone to obscure problems, but evidently it is not good enough.
Luckily it should be quite easy to add some kind of cache for this.

If I was redesigning the evaluation engine from the ground up, I would
probably compile expressions into some kind of byte code with common
subexpressions explicitly taken care of. But I am not, so the above
cache will have to do for now. That should also handle the case where
the subexpressions are not statically common, but the result of
something like INDIRECT.

INDIRECT, btw., is the single most ugly
feature of spreadsheet semantics. It turns the result of an expression into a cell or name reference and if I was designing
a proposed standard
formula syntax and semantics for spreadsheets I would think
long and hard about INDIRECT and its consequences. But I am not.
(Interestingly, most uses of INDIRECT that I have seen would be
far better handled as INDEX calls.)

Back to g_utf8_collate. It works by converting
both strings, in their entirety, to a normalized format and then
comparing those. In a language like C, as opposed to Haskell, that
is quite wasteful in two ways:

The comparison is done character-by-character from the
beginning on the strings. That means that it is very common to
only look at the first few characters of the normalized format. In that case, why was the whole thing normalized?
The normalization process allocates space for the normalized format in the form of a GString. That is slow and not needed at
all since the comparison just needs a single character at a time.

It gets even sillier if you want to do the comparison while ignoring
letter case. Then you first get to case fold the strings in their
entirety before you can call g_utf8_collate.