Someone threw a 8-million cell csv file at Gnumeric. We handle it, but barely. Barely is better than LibreOffice and Excel if you don’t mind that it takes 10 minutes to load. And if you have lots of memory.
I looked at the memory consumption and, quite surprisingly, the GHashTable that we use for cells in sheet is on top of the list: a GHashTable with 8 million entries uses 600MB!
Here’s why:
- We are on a 64-bit platform so each GHashNode takes 24 bytes including four bytes of padding.
- At a little less than 8 million entries the hash table is resized to about 16 million entries.
- While resizing, we therefore have 24 million GHashNodes around.
- 24*24000000 is around 600M, all of which is being accessed during the resize.
So what can be done about it? Here are a few things:
- Some GHashTables have identical keys and values. For such tables, there’s no need to store both.
- If the hash function is cheap, there’s no need to keep the hash values around. This gets a little tricky with the unused/tombstone special pseudo-hash values used by the current implementation. It can be done, though.
I wrote a proof of concept which skips things like key destructors because I don’t need them. This uses 1/3 the memory of GHashTable. It might be possible to lower this further if an in-place resize algorithm could be worked out.