Blah.

Why do so many car drivers suck? Selfish probably. Australia has become so selfish and nasty. Full of racists and bigots and crazy religious people. My days lately start pleasantly enough – it’s been warmer than it has for a while, I feed the cat, have a shower, water some pot plants usually. But all that serenity is destroyed by the short ride to work. Blah.

In unrelated news, i’ve still been hacking slowly away on the PS3. I’ve written a flat-shaded triangle and quadratic renderer, or at least played with writing one. Tile based thingy. Then i’ve been writing a minimal OpenGL call set which can be used to call it. I wrote a whole bunch of stuff one day after work but then I didn’t get enough sleep so I haven’t even tried to compile it yet. The PPU/SPU split really makes for some interesting design challenges – that’s on-top of the whole vectorising of code thing. Job queues, distribution, which bit of code does what, and so on.

For starters I thought i’d implement glDrawRangeElements – this lets the code easily do a minimal translation of points, and then convert these into primitives using indices. Basically I have some PPU code to setup and manage the various matrix stacks, vertex arrays and so forth, it then generates a job packet which contains the current state, and drops it to an SPU. Fortunately the SDK comes with some helper functions for the matrix setup and maniuplation. The intention is the SPU will load this packet in, then load in the vertex/etc array(s), and start number crunching while the index array is being DMA’d in. The vertices will be stored in vector float’s, in homogonised coordinates. The index array is ints.

So after i’ve done the vertex calculation, I then process 12 indices at a time to spit out sets of 4 triangles, performing 3d-2d conversion and whatnot. This is because you can only load quad-words from the SPU memory – I load 3 lots of indices so I can do aligned loads for each loop. But I need to work out what to do with clipping at some point – but that can wait for now. The other rasterisation algorithm I found didn’t need clipping at this stage. I then convert the data from array of structures (i.e. one X,Y,Z,W vertex per quad-word) to a structure of arrays – well I have 2 arrays, one for X and one for Y (in 2d coordinates), which simplifies the loads and calculations in the triangle rasteriser. I also store them in sets of 3 (i.e. each triangle), but aligned on quad-word boundaries, so I can easily load each triangle directly, and pre-convert them to fixed-point integers – since that is essentially free at this stage (due to the dual-issue pipeline). So many ways to do things, and each with their own trade-offs …

The vertex calculations are simple, it just goes through the quad-word aligned homogenous coordinates, multiplying them against the matrix as it goes. I unroll the loop to get some scheduling efficiencies, but it’s nothing complicated.

The triangle formulation is a bit trickier, because you effectively have 4 vertex indices with each load, but you only need 3 to form your triangle. So rather than have to do lots of nasty shifting and aligning, it’s easier to unroll the loop a few times until it re-aligns with memory. So the input indices are basically (aligned to quad-words):

 [ triangle 1 vertex 1 ] [ t1 v2 ] [t2 v3] [trinagle 2 vertex 1]
 [ triangle 2 vertex 2] [ t2 v3] [triangle 3 vertex 1] [t2 v2]
 [t3 v3] [ triangle 4 vertex 1] [ t4 v1] [ t4 v2]

So it needs to process 3×4 vertices to re-align the loop.

After the lookup and conversion it gets stored in the most efficient form for the triangle renderer – well so far, there are trade-offs there too. Sometimes the code needs to convert between AOS and SOA there too – but i’m still looking into that. But so far the idea is it gets converted to 2d and stored as 2x arrays:

x array:
 [triangle 1 x1] [ t1 x2] [t1 x3] [-unset-]
 [triangle 2 x1] [ t2 x2] [t2 x3] [-unset-]
 ...
y array:
 [triangle 1 y1] [ t1 y2] [t1 y3] [-unset-]
 [triangle 2 y1] [ t2 y2] [t2 y3] [-unset-]
 ...

And the same basic code can be used to do quads – actually since the whole quad-word is filled it makes some operations easier. e.g. to calculate x1-x2, x2-x3, x3-4, x4-x1 calculation, you just do a rotate-left and subtract. All 4 calculated in 2 instructions. For 3 vertices you need to do a vector permute which takes more than 1 instruction – although perhaps I could store ‘x1’ as the last element in the array – I already need to permute the data when I store it.

Ok, enough thinking out loud. Time to stop thinking and go back to work …

Leave a Reply