Not Zed

oops

Ok so i stuffed up in the last post. Recursive set-of returning functions are just fine in postgresql. I was just using the wrong return function – you need to use RETURN QUERY. Also, the functions I wrote although useful for part of the problem, don’t help us in others. e.g. we need to know things like ‘give me all of the children of a and b from table x as well as all the children of f, g, and h from table y’. So what was once a couple of simple joins is now turning into more custom functions …

But I guess it’s all a learning experience, haven’t done much pl/pgSQL before.

squelch

Been busy. Had to go away for a week for work. Very strange experience, not sure I want to repeat that again any time soon. Oh well.

I had started looking into writing a vorbis decoder before that major interruption. Something a bit more challenging than normal. About all i’ve gathered so far is that they’ve made some strange decisions in the stream and the huffman coding tables. e.g. a canonical huffman code can be stored very efficiently and reconstructed trivially – but instead they came up with some first-fit mechanism which adds unnecessary complexity and overhead. Still, plenty to learn about and play with. I guess I may get back into it when my mind decides to look into it again.

Meanwhile i’ve been sidetracked at home trying to work out how to improve the way we’re storing trees in a database at work. Relational databases don’t really do trees all that well directly, but with a bit of work you get an ok solution. There is some direct support in Oracle and the SQL standard apparently, but we’re using PostgreSQL, so you have to do things a bit more manually. There are a few ways to do it, but only a couple which are suitable for us – inserts need to be interactively-fast, not just lookups.

Currently we use an auxiliary table which holds the parent-child relationships and so with a simple join you can easily get all children or all ancestors of a given node. The only problem is that it requires some house-keeping using triggers, and also adds some additional restrictions on insertion order and the like (which creates issues with editing functions). About all I could think of is to remove the auxiliary table and do the same job using stored procedures on the fly. Not as simple as you’d think …

Given a structure such as:

create table tree (
 id serial primary key,
 parentid int4 not null references tree(id) on delete cascade on update cascade,
 childcount int4 not null default 0,
 title text
);

create type parent_link as (depth int, id int);

(note that this still requires triggers to maintain the childcount – for a ui app the effort is worth it, and it allows us to make a major optimisation later on).

Here’s a function to find the ancestors. It uses parent_link so it can also return the depth – which is very handy in many algorithms.

CREATE or replace FUNCTION find_ancestors(intab text, inid integer) RETURNS setof parent_link AS $$
DECLARE
    tid integer;
    pid integer;
    res integer[];
    count integer := 1;
    i integer := 0;
    r parent_link;
BEGIN
    tid := inid;
    execute 'select parentid from ' || quote_ident(intab) || ' where id = ' || quote_literal(tid) into pid;
    res[0] := tid;
    while pid <> tid loop
         res[count] := pid;
         count := count + 1;
         tid := pid;
         execute 'select parentid from ' || quote_ident(intab) || ' where id = ' || quote_literal(tid) into pid;
    end loop;

    while count > 0 loop
       count := count - 1;
       r.depth = i;
       i := i + 1;
       r.id = res[count];
       return next r;
    end loop;

END;
$$
 LANGUAGE plpgsql;

It also returns the values in sorted order – not sure that is any use for SQL as it may not guarantee the order once you join without an explicit sort, but it doesn’t really cost any extra processing since we need to keep the whole stack around to calculate the depths. I also tried a recursive algorithm, but it made no real difference.

It is used just like a table:

 select * from find_ancestors('tree', 4);

Ok, now for a utility function – normally this isn’t needed unless you want it for sparse or one-off lookups. Nothing special here, apart from not being recursive.

CREATE or replace
  function tree_node_depth(inid integer, tab text)
  returns integer as $$
DECLARE
  wid int := -1;
  pid int := inid;
  depth int := -1;
  data record;
BEGIN
  while pid <> wid loop
    wid := pid;
    execute 'select parentid from ' || quote_ident(tab) || ' where id = ' || quote_literal(wid) into data;
    pid = data.parentid;
    depth := depth + 1;
  end loop;
  return depth;
END;
$$
 LANGUAGE plpgsql stable;

Now comes the tricky one. How to list all of the children of a tree without needing large amounts of memory or temporary tables. The obvious is just to use a recursive function – but for the life of me I couldn’t get a recursive function which could return a row iteratively (i.e. using return next). I could easily get it to generate a temporary table, but that isn’t what I wanted. My first idea was to use a queue of outstanding rows to examine for children. It worked just fine, but requires lots of memory – in worst case nearly as many elements in the queue as rows in the table (although in reality it is probably a small fraction of the memory used by the application/other end of the query, so it would probably be just fine for all practical purposes for our application).

So I came up with something that just seems a bit hacky. I maintain a stack of cursors, and use some assignment tricks to get around the issue that pl/pgSQL doesn’t seem to like you using cursors from an array (you can copy them, but you can’t use them directly?). I dunno if it’s ‘proper coding’, but it works …

CREATE or replace
  FUNCTION find_children(inid integer, tab text)
  RETURNS setof parent_link AS $$
DECLARE
    r parent_link;
    data record;
    children refcursor;
    stack refcursor[];
    depth int := 0;
    depthoffset int := 0;
BEGIN

    depthoffset := tree_node_depth(inid, tab);

    -- first node
    r.depth = depthoffset;
    r.id = inid;
    return next r;

    open children for execute 'select id,parentid,childcount from ' || quote_ident(tab) || ' tree where parentid <> id and parentid = ' || inid;
    stack[depth] := children;

    while depth >= 0 loop
      children = stack[depth];
      fetch children into data;

      if not found then
        close children;
        stack[depth] = null;
        depth := depth - 1;
      else

        r.depth = depth + 1 + depthoffset;
        r.id = data.id;
        return next r;

        if data.childcount > 0 then
           depth := depth + 1;
           children = null;
           open children for execute 'select id,parentid,childcount from ' || quote_ident(tab) || ' tree where parentid <> id and parentid = ' || data.id;
           stack[depth] = children;
        end if;
      end if;

    end loop;
END;
$$
 STABLE
 LANGUAGE plpgsql;

Note that since we have childcount maintained elsewhere, we can optimise the leaf search – saving many sub-queries looking for linked children we would have to execute otherwise.

Now I can’t really tell if this is going to run very quickly, or if it saves much memory – I guess it should – a limited stack of cursors vs an extra or temporary table or just storing the id’s in an array. I don’t have enough data or the energy to set it up a test case, and every time I run a query in pgadmin I get wildly different timings. Gut feeling is that with decent indexes queries should be in a similar order of magnitude to the auxiliary table, but without most of the overhead of keeping it intact – speeding up and simplifying insertions and updates. It might also benefit from reducing cache load on redundant data and increasing locality of reference for pages i’m probably going to load anyway. And finally, although it means there is a black-box for the query planner, thus reducing potential optimisations – there may not many opportunities for it do much anyway.

Hmmm, you know, all these hassles to use a relational database and shoe-horn structured information into it. I’m just not sure it’s worth it. Yes it is very handy to have an sql window to run queries and test ideas – but most of the work could be done simpler and with less developer time using code and some sort of persistent object store. I guess it depends on whether we move to a multi-user model and take a short-cut by using a database to do it rather than implementing a proper n-tier architecture.

Spewing bits

A followup to the last post about bit manipulation. I ran the same code on a SPU on a PS3. As expected, rather different results.

The branchelss code (fls()) compiled without branches, and was reasonably fast still – 8s or 6.2s inline. Still, my lowly 1.5Ghz Pentium M creamed it in performance (well a little under 2x faster). The loop code (fls2()) which was the fastest on the pentium was dog-slow on a SPU. 10s or very strangely – 15s when inlined. It actually got a lot slower! The shifting version (fls4() – not included last time – rather slow – 15s on pentium) was atrocious – 46s. Which is quite interesting since something like it is often used in examples of implementing this function manually. Also remember my test always has the top 8 bits unset, which should improve the average case of the shift version.

unsigned int fls4(unsigned int v)
{
  register int n;

  for (n =0; v && n<31;n++)
    v >>= 1;

  return n;
}

The spu-specific instruction clz however (as expected) was very fast – once inlined. 5.0s as a function, and 0.43s inlined – beating the pentium but only by ~40%, and only when inlined. Not surprising – branches are a big hit with the SPU and the test loop is too small for software branch prediction to cut in properly. I was also timing the overhead of launching the spu task, which is hard to measure on its own.

As an exercise in compiler bashing, I tried using the SPU intrinsics to hand-code the same algorithm as used in fls() – see fls6(). This came out at a fairly respectable 6.0s or 2.8s inlined. So even using an identical algorithm, and still writing scalar code, it is pretty easy to achieve a 33% to 220% speedup over the compiler-only version.

unsigned int fls6(unsigned int vi)
{
  vec_uint4 v = spu_promote(vi, 0);
  vec_uint4 n = spu_splats((unsigned int)0);
  vec_uint4 cmp;
  vec_uint4 non;
  vec_uint4 von;

  non = spu_add(n, 16);
  von = spu_rlmask(v, -16);
  cmp = spu_cmpgt(v, (1<<16)-1);
  n = spu_sel(n, non, cmp);
  v = spu_sel(v, von, cmp);

  non = spu_add(n, 8);
  von = spu_rlmask(v, -8);
  cmp = spu_cmpgt(v, (1<<8)-1);
  n = spu_sel(n, non, cmp);
  v = spu_sel(v, von, cmp);

  non = spu_add(n, 4);
  von = spu_rlmask(v, -4);
  cmp = spu_cmpgt(v, (1<<4)-1);
  n = spu_sel(n, non, cmp);
  v = spu_sel(v, von, cmp);

  non = spu_add(n, 2);
  von = spu_rlmask(v, -2);
  cmp = spu_cmpgt(v, (1<<2)-1);
  n = spu_sel(n, non, cmp);
  v = spu_sel(v, von, cmp);

  non = spu_add(n, 1);
  cmp = spu_cmpgt(v, (1<<2)-1);
  n = spu_sel(n, non, cmp);

  return spu_extract(n, 0);
}

(PS This has not been fully tested to confirm that it returns the correct result).

The compiler didn’t even schedule the instructions ideally – at least according to spu_timing. It was quite good, but you could easily still get another 3 or 4 cycles (out about 50 iirc – so say >5%) just by re-scheduling a couple of instructions.

I also tried a slightly different SPU version, taking advantage of the shuffle bytes instruction. This version only requires 2 compares and does much of the work as a lookup table. However the overhead of getting things setup or preparing the shuffle tables meant it was only about 1/2 the speed of the one above. It’s timing is shown as fls7().

Summary of results (inline only)

                   fls()  fls2() fls3() fls4() fls5() fls6() fls7()
Pentium M (1.5)     3.5    4.0    0.6    15     9.3    -      -
SPU                 6.2   15      0.43   46     7.2    2.8    4.3

 * fls3() is the single-instruction that does the same thing, on each cpu

Anyway, what did I learn from this:

Different architectures can have very different performance characterstics (no surprise there – particularly on CELL). What works well on one might not work so well on another (e.g. small loops – fls2())
A generally good algorithm is probably generally good however. e.g. fls() uses a binary search and potentially no branches. If branchless it also has a fixed execution time.
Even with the same basic algorithm, implementation details matter. e.g. comparisons vs bit masking (fls() vs fls5()). Large bit masks may require expensive setup though non-immediate operands and/or physically larger instructions.
Compilers are NOT ‘nearly always smarter than you’ – despite the tired line being repeated many times by textbook writers and bloggers. Yes choosing the right algorithm is more important than the language, but conversely you can implement the same improved algorithm in any language. fls() vs fls6(). Or the same bad algorithm in any language.
A commonly used example implementation – one using shifts seems to actually run very poorly on modern hardware (fls4())
High level languages like C can still do with some help from specific CPU support not exposed or accessible directly from the language, particularly with bit operations. You’d have to count builtin functions like __builtin__clz() in this instance – they are not part of C, but extensions (I didn’t try it, but I presume it would be the same as the inline asm version at least). fls3().

Bits of lists

As I suggested I might, I’ve written up a little about some of the list sorting I was playing with lately.

It compares a non-recursive sorting algorithm to GList’s recursive algorithm, it compares an embedded node structure to GList’s internal node/external data structure, and it throws in an array sort for good measure. I try to analyse the results (often incorrectly or inconclusively) where they don’t match expectations in terms of memory access patterns (e.g. cache effects).

I guess it is work in progress – I’ve hacked it up a bit a few times so it isn’t as readable as it might be, and a few things are missing (like some definitions and the GList non-recursive sort code), but the current version can be found here.

unexpected bits of bits

After years of mind-numbing .net stuff I’ve started getting into hacking C again. Sorting, balanced trees, memory allocators, hash tables, playing with literate programming, all sorts of things. Mostly just for the intellectual challenge of it all too – nice change from tweaking buttons in a shitty ui.

This morning I was trying to write a bit of code to find the first bit set (from MSB) in a word and came up with some interesting and unexpected results with gcc on an x86.

First, there’s the simple way you’d think of doing it in c:
int fls2(unsigned int v) { register int n = 0;

for (n = 31; n>0;n–)
if (v & (1<<n))
return n;

return 0;
}
This turned out to be almost as fast as c code could go. All the timings I did were running the function against every number from 1 to 16 million – which isn’t even the average case for this function, and yet it outperformed everything except the x86 instruction which does the same thing. As a function it ran in 4.s, and inlined it made it in 4.0s.

My first go at ‘optimising’ it was to get rid of the branches and loops. So I wrote some code which does a binary search basically. But even though I believe this could be written branchless easily in x86 assembly, unfortunately I had no luck with gcc and it added branches for the comparison tests instead. Still, this ran ok – and for some bizarre reason, when it was inlined, it ran faster than the first one (fls2()) above. Normally 5.5s, but 3.5s when inlined. I couldn’t see any obvious reason why this version would perform so much differently when inlined to the previous one from the generated code.

int fls(unsigned int v) { register int n = 0; register unsigned int off;

off = (v >= (1<<16)) * 16;
n += off;
v >>= off;

off = (v >= (1<<8)) * 8;
n += off;
v >>= off;

off = (v >= (1<<4)) * 4;
n += off;
v >>= off;

off = (v >= (1<<2)) * 2;
n += off;
v >>= off;

off = (v >= (1<<1)) * 1;
n += off;

return n;
}
So I had another go at fooling the compiler into getting rid of the branches. And came up with the one below – it basically does exactly the same thing as the comparison version but encodes the comparisons as bit masks. But although it generated branchless code, it was much larger, and executed significantly slower than the others. Around 11s or 9s inlined.
static inline int fls5(unsigned int v) { register int n = 0; register int off;

off = ((v & 0xffff0000) != 0) * 16;
n += off;
v >>= off;
off = ((v & 0xff00) != 0) * 8;
n += off;
v >>= off;
off = ((v & 0xf0) != 0) * 4;
n += off;
v >>= off;
off = ((v & 0xc) != 0) * 2;
n += off;
v >>= off;
off = ((v & 0x2) != 0) * 1;
n += off;

return n;
}
But none of them beat the raw x86 instruction, which is of course what i’d probably have used.

unsigned int fls3(unsigned int v) { asm("bsrl %1, %0" : "=r" (v) : "r" (v));

return v;
}
1.3s or 0.6s inline. It is still interesting to see that a loop that is iterating on average (32-8)/2+8 = 20 loops is only 5x slower than a single instruction. I guess shifts on x86 are very slow (I don’t know much about x86 and never want to).

I then ‘discovered’, the __builtin_clz() function – and threw away all the tinkering above. But it was still interesting to see how many different ways of doing the same thing, and how much difference it makes to performance. And it shows that C can be lacking.

And i’ll probably never look at it again – ahh the freedom of a hobby.

(boy this editor still sucks for pasting code – what a stupid piece of software for a blog for coders)

Blah.

Why do so many car drivers suck? Selfish probably. Australia has become so selfish and nasty. Full of racists and bigots and crazy religious people. My days lately start pleasantly enough – it’s been warmer than it has for a while, I feed the cat, have a shower, water some pot plants usually. But all that serenity is destroyed by the short ride to work. Blah.

In unrelated news, i’ve still been hacking slowly away on the PS3. I’ve written a flat-shaded triangle and quadratic renderer, or at least played with writing one. Tile based thingy. Then i’ve been writing a minimal OpenGL call set which can be used to call it. I wrote a whole bunch of stuff one day after work but then I didn’t get enough sleep so I haven’t even tried to compile it yet. The PPU/SPU split really makes for some interesting design challenges – that’s on-top of the whole vectorising of code thing. Job queues, distribution, which bit of code does what, and so on.

For starters I thought i’d implement glDrawRangeElements – this lets the code easily do a minimal translation of points, and then convert these into primitives using indices. Basically I have some PPU code to setup and manage the various matrix stacks, vertex arrays and so forth, it then generates a job packet which contains the current state, and drops it to an SPU. Fortunately the SDK comes with some helper functions for the matrix setup and maniuplation. The intention is the SPU will load this packet in, then load in the vertex/etc array(s), and start number crunching while the index array is being DMA’d in. The vertices will be stored in vector float’s, in homogonised coordinates. The index array is ints.

So after i’ve done the vertex calculation, I then process 12 indices at a time to spit out sets of 4 triangles, performing 3d-2d conversion and whatnot. This is because you can only load quad-words from the SPU memory – I load 3 lots of indices so I can do aligned loads for each loop. But I need to work out what to do with clipping at some point – but that can wait for now. The other rasterisation algorithm I found didn’t need clipping at this stage. I then convert the data from array of structures (i.e. one X,Y,Z,W vertex per quad-word) to a structure of arrays – well I have 2 arrays, one for X and one for Y (in 2d coordinates), which simplifies the loads and calculations in the triangle rasteriser. I also store them in sets of 3 (i.e. each triangle), but aligned on quad-word boundaries, so I can easily load each triangle directly, and pre-convert them to fixed-point integers – since that is essentially free at this stage (due to the dual-issue pipeline). So many ways to do things, and each with their own trade-offs …

The vertex calculations are simple, it just goes through the quad-word aligned homogenous coordinates, multiplying them against the matrix as it goes. I unroll the loop to get some scheduling efficiencies, but it’s nothing complicated.

The triangle formulation is a bit trickier, because you effectively have 4 vertex indices with each load, but you only need 3 to form your triangle. So rather than have to do lots of nasty shifting and aligning, it’s easier to unroll the loop a few times until it re-aligns with memory. So the input indices are basically (aligned to quad-words):

 [ triangle 1 vertex 1 ] [ t1 v2 ] [t2 v3] [trinagle 2 vertex 1]
 [ triangle 2 vertex 2] [ t2 v3] [triangle 3 vertex 1] [t2 v2]
 [t3 v3] [ triangle 4 vertex 1] [ t4 v1] [ t4 v2]

So it needs to process 3×4 vertices to re-align the loop.

After the lookup and conversion it gets stored in the most efficient form for the triangle renderer – well so far, there are trade-offs there too. Sometimes the code needs to convert between AOS and SOA there too – but i’m still looking into that. But so far the idea is it gets converted to 2d and stored as 2x arrays:

x array:
 [triangle 1 x1] [ t1 x2] [t1 x3] [-unset-]
 [triangle 2 x1] [ t2 x2] [t2 x3] [-unset-]
 ...
y array:
 [triangle 1 y1] [ t1 y2] [t1 y3] [-unset-]
 [triangle 2 y1] [ t2 y2] [t2 y3] [-unset-]
 ...

And the same basic code can be used to do quads – actually since the whole quad-word is filled it makes some operations easier. e.g. to calculate x1-x2, x2-x3, x3-4, x4-x1 calculation, you just do a rotate-left and subtract. All 4 calculated in 2 instructions. For 3 vertices you need to do a vector permute which takes more than 1 instruction – although perhaps I could store ‘x1’ as the last element in the array – I already need to permute the data when I store it.

Ok, enough thinking out loud. Time to stop thinking and go back to work …

Job queue?

Well I was playing with the SPU code again tonight – trying to get it to spread the load across multiple SPUs. And how to get timing info out of it.

I dunno, about the best I could come up with was:

timing plot

i.e. not a particularly linear scaling. Although I have a feeling the kernel is doing some funky scheduling of the SPUs so I may not be getting real numbers after 3 active SPUs. This is for a 1920×1080 resolution version BTW.

I did a bunch of other things too – double buffered output, and implemented a ‘job queue’ mechanism. Each SPU calculates a pre-determined number of rows of the output, DMA’ing it back to the ‘frame buffer’ after each line is calculated – whilst calculating the next line. After it’s done all the lines in a ‘job’, it checks, using an atomic read/modify/write sequence, what the next line available is. If there are still lines to go, it then processes more. The ‘job queue’ is just this ‘next row to calculate’ number.

There is very little contention with ‘job queue’ – infact if I just check it after every line it’s only a bit slower than doing it after a larger number of lines, so that doesn’t appear to be a bottleneck. I dunno, I’ll have to keep playing I guess – it’s probably just kernel scheduling. Might have to try last night’s version again – that used static allocation.

My first CELL code

Well I finally wrote some code on the CELL. I originally thought I’d do a Mandelbrot calculator – it seemed a pretty simple thing to do, yet still demonstrate some vectorisation concepts. But when I was looking on the net – that seemed to be what everyone else thought too! So it put me off a bit … but then I thought I didn’t care, and I just did it anyway … and here’s the first output:

1/4 mandelbrot

(ok I cheated a little and post-processed it in gimp, but only ’cause I was too lazy to do it in the code).

The code was rather simple – I got stuck with converting the iteration counts to RGB values for a little while, and I forgot DMA was limited to 16K, but it’s all pretty straightforward after those silly mistakes. I don’t know how ‘fast’ a simple brute-force mandelbrot runs on a modern CPU – I know my old C=64 took most of a day to calculate 4x this size (320×200) in 1 bit glory :) A single SPU runs 128 copies of this 128×128 output in 1/3 of a second, including writing the 64K buffer back to main memory each time.

#define WIDTH (128)#define HEIGHT (128)unsigned char buffer[WIDTH * HEIGHT * 4] __attribute__((aligned(16)));

void mandelbrot() {
  vector unsigned char togrey = (vector unsigned char) { 0x80, 3, 3, 3, 0x80, 7, 7, 7, 0x80, 11, 11, 11, 0x80, 15, 15, 15 };
  vector float xnext = (vector float) { 4.0/WIDTH, 4.0/WIDTH, 4.0/WIDTH, 4.0/WIDTH };
  vector float ynext = (vector float) { 1.0/HEIGHT, 1.0/HEIGHT, 1.0/HEIGHT, 1.0/HEIGHT };
  vector float four = (vector float){4.0, 4.0, 4.0, 4.0};
  vector unsigned int iter;
  vector unsigned int maxiter;
  vector unsigned int one = (vector unsigned int){1, 1, 1, 1};

  maxiter = spu_splats((unsigned int)255);

  vector unsigned int * out = (vector unsigned int *)buffer;

  int ix, iy;
  vector float y0 = (vector float){ 0.0, 0.0, 0.0, 0.0 };

  for (iy = 0;iy < HEIGHT;iy += 1) {
    vector float x0 = (vector float){ 0.0, 1.0/WIDTH, 2.0/WIDTH, 3.0/WIDTH };

    for (ix = 0;ix<WIDTH;ix += 4) {
      vector float x = x0;
      vector float y = y0;
      vector unsigned int res0 = spu_splats((unsigned int)0);

      iter = spu_splats((unsigned int)0);
      x = x0;
      y = y0;
      do {
        vector float x2 = x*x;
        vector float y2 = y*y;
        vector float modulus = x2 + y2;
        vector unsigned int mask0 = spu_cmpgt(modulus, four);
        vector unsigned int mask1 = spu_cmpgt(iter, maxiter);

        if (spu_extract(spu_gather(spu_or(mask0, mask1)), 0) == 0x0f)
          break;

        vector float xy = x*y;
        vector float xtemp;

        res0 = spu_sel(iter, res0, mask0);
        xtemp = x2 - y2 + x0;
        y = xy + xy + y0;
        x = xtemp;
        iter = spu_add(iter, one);
      } while (1);

      x0 += xnext;

      // results in res0 as ints - convert to argb values, just use greyscale
      *out++ = spu_shuffle(res0, res0, togrey);
    }
    y0 += ynext;
  }
}

(boy this editor sucks for pasting code)

A few of the ‘funky’ vector lines which let this work on 4 pixels at once:

	if (spu_extract(spu_gather(spu_or(mask0, mask1)), 0) == 0x0f)   break;

This, together with the mask calculation, basically works out if all pixels have completed – i.e. all iterations are complete, or they have all reached their limit. The code basically always calculates the next value for ALL pixels, even those that have already reached their limit. But the next line means we just discard those we’ve already calculated:

	res0 = spu_sel(iter, res0, mask0);

It basically uses mask0 to select either the result we already have, or the next iteration value.

And finally there’s the output ‘function’ … It’s just doing a linear iteration count->greyscale value. But it converts the int (well, the lower-8 bits of the int) to an RGBA value, suitable for the framebuffer (actually i’m just dumping it to a pnm file).

      *out++ = spu_shuffle(res0, res0, togrey);

This uses the togrey vector to blat the result into 4 RGB values which can be written at once – the SPU only supports quad-word writes to the local memory.

The compiler doesn’t seem to do a terribly bad job at optimising this to registers and scheduling the instructions, although perhaps there is room for a small amount of improvement by unrolling the loop, or perhaps calculating more pixels at once – but either way you increase the number of potentially redundant calculations, so I wonder if it would really help in practice. I also noticed the spu_add(iter, one) is a bit silly – it could just be an ‘ai’ instruction. What probably could be done however is to reduce the output buffer size and DMA the result out in chunks asynchronously – so you don’t have the SPU idle just waiting for the DMA to finish. And of course using more than 1 SPU. The output may be 1 iteration out too, and I could try adding smooth-colouring – although with only single point precision … who cares – i’m testing concepts not writing fractint.

A few things to try I guess – maybe it was a good example to try first up after all.