ODF vs OOX : Asking the wrong questions

Recent posts from Brian Jones (Microsoft) and Robert Weir (IBM) highlight the strange times we live in. A squadron of flying pigs are out doing loops. On one hand you have MS documenting their file format, and providing positive press for competitors. On the other you’ve got IBM paying full time staff to critique development versions of Gnumeric. I’d like to thank both authors for their publicity, and triaging, but I also want to point out that IMO their examples are looking at things the wrong way.

Brian’s example of Numbers reading an OOX file written by Gnumeric could just as easily been an XLS file. Indeed that would likely have had better support on both ends.   The binary format was poorly documented and a miserable pain to read, but it’s been around a long time as is the dominant interchange format. All spreadsheets needed to support it.

Rob heads off in another direction with his examination of Gnumeric’s limited support for xlsx, on both import and export. What he neglects to consider is the amount of effort required. The initial importer was written on the flight to London for the ECMA meeting, and export was added on the flight back. Toss in a few hours of debugging and the sample file produced by Brian’s example was under a week of effort to read and write. After reading his post I added basic chart import and export the following day.Basic XLSX chart import

This was several orders of magnitude simpler than writing the binary filters which required parsers for OLE2, BIFF, and binary expressions just to get far enough to start reading the format details. However, even that is not the most salient question to ask. The core of this argument is the OOX vs ODF showdown. Instead what implementers really want to know is :

How hard was it to implement ODF support compared to OOX ?

it was significantly more difficult. To be clear, ODF support was nowhere near as much work as the old binary filters, we are talking about XML here. However, while Import filters start with parsing the structure, in the end, extracting the basic state is no more than the ante for the real work. You need to handle the impedance mismatches between the concepts in the file format, and your implementation. ODF’s model of ‘chartness’ didn’t fit well with Gnumeric. In contrast XLSX may be ugly, but it”s concepts were very familiar from XLS. We already had much of the code required to handle it.

I suspect most spreadsheet implementers are in the same position.

Out of a rut and onto the trading floor

May was my last month with Novell, and the OpenOffice.org project. Working with Michael, and the rest of the Ximian/Novell team, since my son was born 6 years ago has been great. It doesn’t get much better than earning a living writing free software. However, while taking an unplanned hike through the Czech country side, discussing spreadsheet data-structures (isn’t that what everyone does while hiking?), Gnumeric came up, and I remembered how it felt ‘back in the day’. An important part of open source development had slipped away without my realizing it. This is supposed to be FUN!

OO.o is an important project for the free software cause, but after 2 years, I was still 1-2 orders of magnitude more productive in Gnumeric. Spelunking can be interesting, but it wears thin after a while, and when something isn’t fun, it becomes work. When an old colleague from the investment world called with a job offer, I accepted. It was time for change.

  • Time to leave my basement (at least for a while) and commute into downtown.
  • Time to play with C#, ASP.NET and some nifty toys (you can make almost windows feel like Linux now)
  • Time to eat sushi and decent Thai food on the patio of BCE place.
  • Time to sit on the trading floor and interact with live people, with the latest financial news (eg Paris Hilton leaving prison stay)
  • I’m even contemplating making the jump from vim to emacs (only very briefly, vim7 drew me back)

Change is billowing through my daily routine, and with a clear separation of work and fun my Gnumeric time is back.

Shooting Fish in a Barrel

Back at Linuxworld Boston Michael

mentioned a teensy performance problem
with an internal spreadsheet (sorry,
it’s confidential and can’t be posted).

OOo Pivot Sample

This partially autogenerated 50M xls monster has been chock full of useful
compatibility tests for OOo. Unfortunately, one of my recent patches was
forcing the pivot tables to regenerate on load, rather than only on demand
later, and drove load time up
into the 3 hour range. MS Excel could load it < 1 minute.

The first step was to throw speedprof (properly patched for OOo) at it. Why
not use a sexier tool like cachegrind or oprofile you wonder ? The short
answer is simplicity. For a rough first cut speedprof is good enough and
doesn’t have much time/space overhead. The result showed a hotspot in the xls
importer itself with lots of code of the form

  long nCount = aMemberList.Count();
  for (i=0; i<nCount; i++) {
    const ScDPSaveMember* pMember = (const ScDPSaveMember*)aMemberList.GetObject(i);
    if ( pMember->GetName() == rName )
      return pMember;

A quick check showed that ‘aMemberList’ really was a list. Once I’d bandaged
my forhead and checked the monitor for damage the first patch was obvious.
This code was wrong on several levels. Let’s count the complexity.

1) List::Count : Why not just iterate on the list directly and save the lookup ?

2) List::GetObject(i) : Again, why start from the begining of the list each time when you just what to iterate through each element ?

3) if (pMember->GetName() == rName) : Why look things up in order when what you want to look them up by name ?

The first patch was conceptually simple, get a hash in place of that list.
It took some spelunking into the data structures to make that possible but in
the end
Patch1
brought us down to 45 minutes without bloating the memory usage much.

The next speedprof run seemed as if the construction of the datapilots was uniformly
slow, but a bit of digging showed that one particular pivot tables was dominating.
It had a field with 30,000 unique strings. The code used similar idioms previous block.

  ScDPItemData aMemberData;
  long nCount = aMembers.Count();
  for (long i=0; i<nCount; i++) {
    ScDPResultMember* pMember = aMembers[(USHORT)i];
    target->FillItemData( aMemberData );
    if ( bIsDataLayout || aMemberData->IsNamedItem( target ) )

Thankfully it used an array in place of a list, but it threw an extra object
copy in the heart of the loop to keep things comfortably inefficient. One more
patch
and we were down to 10 minutes. Still not good, but it’s an improvement. The
next steps will be to see why OOo is using 900M vs 90M for XL (and that’s
with wine), and to see about using a set of indicies for the pivot data,
rather than a set of strings.

OOo Pivot Performance

Joining OO.o too

I’ve been quiet for too long now. It’s time to say hi, re-join the
community, and do a bit of spreadsheet blogging.

For the last few months I’ve been working on OOo’s spreadsheet. Given the
choice between working on OOo and leaving free software I swallowed my pride
and made the leap.

To paraphrase the late Douglas Adams

‘OOo, is big. Really big. You just won’t believe how vastly
hugely mindbogglingly big it is. I mean you may think there’s a lot of code
in emacs, but that’s just peanuts compared to OOo.”

There’s lot’s of neat stuff in here, and Michael has done some amazing work
getting it building mostly painlessly.

Gnumeric is still alive and well. The team is on track to release 1.6.0 (with
several nice improvements) along with gnome 2.12 in a few weeks. With luck
I’ll be able to cross-pollinate the projects.

My current project in OOo is to add support for R1C1 style references. The
core of the patch was simple. I was able to lift a blob of Gnumeric code I’d
written a few months back and dual license for inclusion in OOo. The tricky
bit is turning out to be the interface change that propagates the choice of
address style.