Well I finally wrote some code on the CELL. I originally thought I’d do a Mandelbrot calculator – it seemed a pretty simple thing to do, yet still demonstrate some vectorisation concepts. But when I was looking on the net – that seemed to be what everyone else thought too! So it put me off a bit … but then I thought I didn’t care, and I just did it anyway … and here’s the first output:
(ok I cheated a little and post-processed it in gimp, but only ’cause I was too lazy to do it in the code).
The code was rather simple – I got stuck with converting the iteration counts to RGB values for a little while, and I forgot DMA was limited to 16K, but it’s all pretty straightforward after those silly mistakes. I don’t know how ‘fast’ a simple brute-force mandelbrot runs on a modern CPU – I know my old C=64 took most of a day to calculate 4x this size (320×200) in 1 bit glory :) A single SPU runs 128 copies of this 128×128 output in 1/3 of a second, including writing the 64K buffer back to main memory each time.
#define WIDTH (128)#define HEIGHT (128)unsigned char buffer[WIDTH * HEIGHT * 4] __attribute__((aligned(16))); void mandelbrot() { vector unsigned char togrey = (vector unsigned char) { 0x80, 3, 3, 3, 0x80, 7, 7, 7, 0x80, 11, 11, 11, 0x80, 15, 15, 15 }; vector float xnext = (vector float) { 4.0/WIDTH, 4.0/WIDTH, 4.0/WIDTH, 4.0/WIDTH }; vector float ynext = (vector float) { 1.0/HEIGHT, 1.0/HEIGHT, 1.0/HEIGHT, 1.0/HEIGHT }; vector float four = (vector float){4.0, 4.0, 4.0, 4.0}; vector unsigned int iter; vector unsigned int maxiter; vector unsigned int one = (vector unsigned int){1, 1, 1, 1}; maxiter = spu_splats((unsigned int)255); vector unsigned int * out = (vector unsigned int *)buffer; int ix, iy; vector float y0 = (vector float){ 0.0, 0.0, 0.0, 0.0 }; for (iy = 0;iy < HEIGHT;iy += 1) { vector float x0 = (vector float){ 0.0, 1.0/WIDTH, 2.0/WIDTH, 3.0/WIDTH }; for (ix = 0;ix<WIDTH;ix += 4) { vector float x = x0; vector float y = y0; vector unsigned int res0 = spu_splats((unsigned int)0); iter = spu_splats((unsigned int)0); x = x0; y = y0; do { vector float x2 = x*x; vector float y2 = y*y; vector float modulus = x2 + y2; vector unsigned int mask0 = spu_cmpgt(modulus, four); vector unsigned int mask1 = spu_cmpgt(iter, maxiter); if (spu_extract(spu_gather(spu_or(mask0, mask1)), 0) == 0x0f) break; vector float xy = x*y; vector float xtemp; res0 = spu_sel(iter, res0, mask0); xtemp = x2 - y2 + x0; y = xy + xy + y0; x = xtemp; iter = spu_add(iter, one); } while (1); x0 += xnext; // results in res0 as ints - convert to argb values, just use greyscale *out++ = spu_shuffle(res0, res0, togrey); } y0 += ynext; } }
(boy this editor sucks for pasting code)
A few of the ‘funky’ vector lines which let this work on 4 pixels at once:
if (spu_extract(spu_gather(spu_or(mask0, mask1)), 0) == 0x0f) break;
This, together with the mask calculation, basically works out if all pixels have completed – i.e. all iterations are complete, or they have all reached their limit. The code basically always calculates the next value for ALL pixels, even those that have already reached their limit. But the next line means we just discard those we’ve already calculated:
res0 = spu_sel(iter, res0, mask0);
It basically uses mask0 to select either the result we already have, or the next iteration value.
And finally there’s the output ‘function’ … It’s just doing a linear iteration count->greyscale value. But it converts the int (well, the lower-8 bits of the int) to an RGBA value, suitable for the framebuffer (actually i’m just dumping it to a pnm file).
*out++ = spu_shuffle(res0, res0, togrey);
This uses the togrey vector to blat the result into 4 RGB values which can be written at once – the SPU only supports quad-word writes to the local memory.
The compiler doesn’t seem to do a terribly bad job at optimising this to registers and scheduling the instructions, although perhaps there is room for a small amount of improvement by unrolling the loop, or perhaps calculating more pixels at once – but either way you increase the number of potentially redundant calculations, so I wonder if it would really help in practice. I also noticed the spu_add(iter, one) is a bit silly – it could just be an ‘ai’ instruction. What probably could be done however is to reduce the output buffer size and DMA the result out in chunks asynchronously – so you don’t have the SPU idle just waiting for the DMA to finish. And of course using more than 1 SPU. The output may be 1 iteration out too, and I could try adding smooth-colouring – although with only single point precision … who cares – i’m testing concepts not writing fractint.
A few things to try I guess – maybe it was a good example to try first up after all.