Been busy. Had to go away for a week for work. Very strange experience, not sure I want to repeat that again any time soon. Oh well.
I had started looking into writing a vorbis decoder before that major interruption. Something a bit more challenging than normal. About all i’ve gathered so far is that they’ve made some strange decisions in the stream and the huffman coding tables. e.g. a canonical huffman code can be stored very efficiently and reconstructed trivially – but instead they came up with some first-fit mechanism which adds unnecessary complexity and overhead. Still, plenty to learn about and play with. I guess I may get back into it when my mind decides to look into it again.
Meanwhile i’ve been sidetracked at home trying to work out how to improve the way we’re storing trees in a database at work. Relational databases don’t really do trees all that well directly, but with a bit of work you get an ok solution. There is some direct support in Oracle and the SQL standard apparently, but we’re using PostgreSQL, so you have to do things a bit more manually. There are a few ways to do it, but only a couple which are suitable for us – inserts need to be interactively-fast, not just lookups.
Currently we use an auxiliary table which holds the parent-child relationships and so with a simple join you can easily get all children or all ancestors of a given node. The only problem is that it requires some house-keeping using triggers, and also adds some additional restrictions on insertion order and the like (which creates issues with editing functions). About all I could think of is to remove the auxiliary table and do the same job using stored procedures on the fly. Not as simple as you’d think …
Given a structure such as:
create table tree ( id serial primary key, parentid int4 not null references tree(id) on delete cascade on update cascade, childcount int4 not null default 0, title text ); create type parent_link as (depth int, id int);
(note that this still requires triggers to maintain the childcount – for a ui app the effort is worth it, and it allows us to make a major optimisation later on).
Here’s a function to find the ancestors. It uses parent_link so it can also return the depth – which is very handy in many algorithms.
CREATE or replace FUNCTION find_ancestors(intab text, inid integer) RETURNS setof parent_link AS $$ DECLARE tid integer; pid integer; res integer[]; count integer := 1; i integer := 0; r parent_link; BEGIN tid := inid; execute 'select parentid from ' || quote_ident(intab) || ' where id = ' || quote_literal(tid) into pid; res[0] := tid; while pid <> tid loop res[count] := pid; count := count + 1; tid := pid; execute 'select parentid from ' || quote_ident(intab) || ' where id = ' || quote_literal(tid) into pid; end loop; while count > 0 loop count := count - 1; r.depth = i; i := i + 1; r.id = res[count]; return next r; end loop; END; $$ LANGUAGE plpgsql;
It also returns the values in sorted order – not sure that is any use for SQL as it may not guarantee the order once you join without an explicit sort, but it doesn’t really cost any extra processing since we need to keep the whole stack around to calculate the depths. I also tried a recursive algorithm, but it made no real difference.
It is used just like a table:
select * from find_ancestors('tree', 4);
Ok, now for a utility function – normally this isn’t needed unless you want it for sparse or one-off lookups. Nothing special here, apart from not being recursive.
CREATE or replace function tree_node_depth(inid integer, tab text) returns integer as $$ DECLARE wid int := -1; pid int := inid; depth int := -1; data record; BEGIN while pid <> wid loop wid := pid; execute 'select parentid from ' || quote_ident(tab) || ' where id = ' || quote_literal(wid) into data; pid = data.parentid; depth := depth + 1; end loop; return depth; END; $$ LANGUAGE plpgsql stable;
Now comes the tricky one. How to list all of the children of a tree without needing large amounts of memory or temporary tables. The obvious is just to use a recursive function – but for the life of me I couldn’t get a recursive function which could return a row iteratively (i.e. using return next). I could easily get it to generate a temporary table, but that isn’t what I wanted. My first idea was to use a queue of outstanding rows to examine for children. It worked just fine, but requires lots of memory – in worst case nearly as many elements in the queue as rows in the table (although in reality it is probably a small fraction of the memory used by the application/other end of the query, so it would probably be just fine for all practical purposes for our application).
So I came up with something that just seems a bit hacky. I maintain a stack of cursors, and use some assignment tricks to get around the issue that pl/pgSQL doesn’t seem to like you using cursors from an array (you can copy them, but you can’t use them directly?). I dunno if it’s ‘proper coding’, but it works …
CREATE or replace FUNCTION find_children(inid integer, tab text) RETURNS setof parent_link AS $$ DECLARE r parent_link; data record; children refcursor; stack refcursor[]; depth int := 0; depthoffset int := 0; BEGIN depthoffset := tree_node_depth(inid, tab); -- first node r.depth = depthoffset; r.id = inid; return next r; open children for execute 'select id,parentid,childcount from ' || quote_ident(tab) || ' tree where parentid <> id and parentid = ' || inid; stack[depth] := children; while depth >= 0 loop children = stack[depth]; fetch children into data; if not found then close children; stack[depth] = null; depth := depth - 1; else r.depth = depth + 1 + depthoffset; r.id = data.id; return next r; if data.childcount > 0 then depth := depth + 1; children = null; open children for execute 'select id,parentid,childcount from ' || quote_ident(tab) || ' tree where parentid <> id and parentid = ' || data.id; stack[depth] = children; end if; end if; end loop; END; $$ STABLE LANGUAGE plpgsql;
Note that since we have childcount maintained elsewhere, we can optimise the leaf search – saving many sub-queries looking for linked children we would have to execute otherwise.
Now I can’t really tell if this is going to run very quickly, or if it saves much memory – I guess it should – a limited stack of cursors vs an extra or temporary table or just storing the id’s in an array. I don’t have enough data or the energy to set it up a test case, and every time I run a query in pgadmin I get wildly different timings. Gut feeling is that with decent indexes queries should be in a similar order of magnitude to the auxiliary table, but without most of the overhead of keeping it intact – speeding up and simplifying insertions and updates. It might also benefit from reducing cache load on redundant data and increasing locality of reference for pages i’m probably going to load anyway. And finally, although it means there is a black-box for the query planner, thus reducing potential optimisations – there may not many opportunities for it do much anyway.
Hmmm, you know, all these hassles to use a relational database and shoe-horn structured information into it. I’m just not sure it’s worth it. Yes it is very handy to have an sql window to run queries and test ideas – but most of the work could be done simpler and with less developer time using code and some sort of persistent object store. I guess it depends on whether we move to a multi-user model and take a short-cut by using a database to do it rather than implementing a proper n-tier architecture.
a nice way to get hierarchical data into RDBMS can be found here:
http://iamcam.wordpress.com/2006/03/24/storing-hierarchical-data-in-a-database-part-2a-modified-preorder-tree-traversal/
The Adjacency List model is a good alternative for fast access to Trees in SQL:
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
Even though the article above is a MySQL article, it’s general enough that you can implement the same solution under PostgreSQL.
Hope that helps.
Thanks for posting this. I found it very educational. I thought the RDB was the world’s perfect hammer to solve its myriad nails :)