Testing is not an Option!

I released Gnumeric 1.7.3
only to discover that a little too much editing killed evaluation in
very common situations. Bad me! 1.7.4 is out.

That is not going to happen again.

I sat down and spent a few hours automating most of our tests. Then
I added a valgrind run and the beginning of tests of our importers.
It is part of “make distcheck”, so testing is now mandatory and automatic.

The workhorse of these tests is ssconvert, our handy little command-line utility that converts from one format to another.
By forcing evaluation of all cells between import and export,
we end up exercising quite a large part of the core. As well
as a few importers and exporters. No GUI tests are currently performed, but I suspect we can add that
too somehow.

A Bugfix A Day…

It has been a while since I have been poured some water out of my
ears here, but I have been busy. A couple of months ago I decided
to fix a Gnumeric bug per day. And I have by and large kept that
and our
NEWS has been growing like weeds.

Note, that there are huge differences in the amount of work behind
the items lists. “Allow ={+42}” was a trivial one-liner, while
“Introduce top-level expressions” was a massive and
intrusive patch.

But I am running out of little issues to fix on lazy days and I
rarely have any significant amount of time during the week.

Namespaces

Gnumeric’s solver was broken in HEAD and while fixing it, I
updated to the latest version of lp_solve.

Let me tell you, lp_solve is a prime example of how not to make
a library! It looks like there used to be a program and that it
was made into a library by removing main.

There is no concept of namespaces there. When you include the
relevant header file, you get everything used anywhere internally:
EQ, gcd, MALLOC, TRUE, is_int, and about 400-600
other identifiers.

You cannot isolate that problem to just where you use the header,
by the way, as static is practically usused.

I decided to throw a perl script at the problem and combine everything into one
gaint C file. All 44186 lines of it after pruning about 5000 lines.
The script adds tons of statics in the process,
renames the relevant part of the API, and extracts
that API. Extra points for you if you can read the perl script
without losing your breakfast.

Common Subexpressions

It turns out that it is moderately common to have large number of
VLOOKUPs (or HLOOKUPs) in the same spreadsheet. Gnumeric is embarrassingly
slow
for this. There are several reasons for this.

Profiling where the time
is spent points the blame at g_utf8_collate.

Thinking about the problem, however, suggests a different cause, namely that we are evaluating collate keys for the table every
once for every VLOOKUP. That is simple, easy to understand, and not
prone to obscure problems, but evidently it is not good enough.
Luckily it should be quite easy to add some kind of cache for this.

If I was redesigning the evaluation engine from the ground up, I would
probably compile expressions into some kind of byte code with common
subexpressions explicitly taken care of. But I am not, so the above
cache will have to do for now. That should also handle the case where
the subexpressions are not statically common, but the result of
something like INDIRECT.

INDIRECT, btw., is the single most ugly
feature of spreadsheet semantics. It turns the result of an expression into a cell or name reference and if I was designing
a proposed standard
formula syntax and semantics
for spreadsheets I would think
long and hard about INDIRECT and its consequences. But I am not.
(Interestingly, most uses of INDIRECT that I have seen would be
far better handled as INDEX calls.)

Back to g_utf8_collate. It works by converting
both strings, in their entirety, to a normalized format and then
comparing those. In a language like C, as opposed to Haskell, that
is quite wasteful in two ways:

  • The comparison is done character-by-character from the
    beginning on the strings. That means that it is very common to
    only look at the first few characters of the normalized format. In that case, why was the whole thing normalized?
  • The normalization process allocates space for the normalized format in the form of a GString. That is slow and not needed at
    all since the comparison just needs a single character at a time.

It gets even sillier if you want to do the comparison while ignoring
letter case. Then you first get to case fold the strings in their
entirety before you can call g_utf8_collate.

On Bugs Reports

For entertainment purposes only:

Evidently someone dumped a feature request for Gnumeric on
comp.os.linux.advocacy and is mildly upset that after a year it has not been implemented.
(Not even an interesting feature, mind you.)

In my humble opinion that is like tattooing an Excel feature request on
your butt, flying off to Redmond and mooning the Microsoft campus on
a dark and rainy winter evening. That is likely to work, for sure.

Bugzilla might not be the world’s best bug reporting interface,
but it sure beats mailing lists and irc in terms on probability
of the bug report not being forgotten forever.

OpenDocument for Spreadsheets

From time to time I am asked if Gnumeric supports or will support the OpenDocument standard. I am sad to report that the question is meaningless. The purpose of a standard ought to be interoperability; this standard is not a step in that direction.

Executive summary: The OpenDocument Standard v1.0 completely ignores semantics of spreadsheets, thus virtually ensuring that if two implementations of the standard existed then they could not use each other’s spreadsheets. That, in turn, renders the alleged standard pointless for spreadsheets.

The OpenDocument standard v1.0 is 708 pages long. You would think that in such an elephant they could find the space to hammer out a usable spec for spreadsheets. But no, they do not.

Formulas (or formulae if you are feeling learned) is an important part of what goes into spreadsheets. What does the alleged standard have to say about those? Basically you go to page 184-186 and find the two (equivalent) formula examples that it has:

  =sum([.A1:.A5])
  =sum([.A1];[.A2];[.A3];[.A4];[.A5])

Pedantics — that would be me — would point out that both these fail to comply with the formula requirement “Every formula should begin with a namespace prefix specifying the syntax and semantics used within the formula.” Hmmm… Or maybe “=” is that prefix and anyone seeing it will magically understand the syntax and semantics.

Regardless, we learn

  • the existence of a function named sum. I hope you can make good use of it, because it is the only function we get to know about. (There is a handful more in contexts unrelated to formulas.)
  • that there exists operators. Which ones seems to be a secret, apart from the fact that some of them are logical.
  • that numbers can be part of formulas. We will have to assume that regular C-like syntax for them is used.
  • that strings can be part of formulas. There is no clue about how to write them; I will guess that you put them in quotes, but what about embedded quotes, then?

There is what appears to be the beginning of a value system on pages 184-185. You get numbers, dates, times, booleans, and strings. Notice the absence of error values and “empty”, let alone arrays. Excel has those, Gnumeric has them, and OOo-calc has at least error values. Thus there is no chance that someone following the OpenDocument standard can approximate Excel semantics, nor for that matter is there any chance that OOo-calc can implement the standard with its current value system.

It gets worse. Let us assume that a proper value system was in place. If you wanted to have any hope of achieving interoperability you would them have to define function (and operator) semantics for a set of functions large enough to be useful. Details matter. For example, for our friend sum above, Excel will completely ignore strings like “42” (with the quotes) in referenced cells, but it will convert them to numbers and include them in the sum when given as direct arguments to SUM. (Yes, that is sick.) In technical terms, Excel does not have referential integrity. OOo-calc lacks referential integrity too and in the direct-argument case produces an error, i.e., it is sick in a different way. (For completeness, Gnumeric ignores all strings for SUM; this might change.) As a cuter example, OOo-calc manages to evaluate “1”+1 to TRUE!

So there. As far as spreadsheets are concerned, the OpenDocument Standard v1.0 is the equivalent of giving precise punctuation rules for sentences without telling if it is for English, German, French, or something else.

Only vaguely related is the question of whether Gnumeric should be able to read OOo-calc files. The answer is yes, and we do have some support for it. In practice it is probably easier to interoperate via the xls format.

[Formatting updated 2015]

Gnumeric 1.5.2

The weekend saw the release of Gnumeric version 1.5.2.
It should probably have a catchy name, but it does not.

This is a lots-of-little-things release that ought to make for a smoother ride.