Made good progress over the weekend on the mplayer vo module. Took a
bloody long time though – bugs in code, not undertanding api’s, not
understanding how mplayer wants it’s vo drivers to work. But I got
there in the end – well just, it had me solidly occupied the whole
weekend, much to the chargrin of my housemates who wanted to play
games. Even managed to watch a couple of movies to test it out –
although I still can’t work out how to stop the fucking screen
blanking every 10 minutes under ubuntu (or stop the log output to the
virtual console) (yeah i googled, but couldn’t find any way to do it
programmatically – i can unblank but not disable it).
It can upscale an SD source to 1080p without problems (and without
tearing, oh happy day) – it is only bi-linear scaling, but that’s good
enough for now. There’s still plenty of scope for improvement –
e.g. now I need to work on a job queue mechanism rather than
loading/executing the whole spu programme each frame.
Although I knew dma’s needed to be 16 byte aligned I kept forgetting
to start with – the cause of much frustration and cursing. I also had
problems with the spu seeming to crash after a short time. After
about 200 frames it would just die but the rest of mplayer kept
running and executing the spu programme returned no error. I couldn’t
work out what was going on. It seems that once you load a programme
into the spu context, it isn’t really kept around forever – so it
would semi-randomly dissapear at some future point in time. I think
this is something I’d previously ‘discovered’ but forgotten about
since I last did any spu coding.
It was also a little difficult finding decent documentation on how to
embed spu binaries into ppe code, it often assumed you knew how,
didn’t go into that detail, or used some magic makefiles not specified
directly. I found make.footer in the sdk examples at last, and after
some manual runs distilled that into my own makefiles.
Although I had bugs in the spu code I had written/compiled before I
tried running it, at least they were not terminal issues. The
algorithms were sound, I was just out in little ways or just forgot to
finish bits of them. The YUV converter seems to need to do an awful
lot of work, but I guess that’s why it’s so slow without an spu. The
actual calculation is easy and simple, but clamping 64 values to
[0,255] just takes a lot of instructions for example. So far i’m
sticking with a packed ARGB internal format in memory and 4xeach
colour component/planar when in registers – the loads vs the
loads/shuffle’s aren’t much different, but I’ll try an unpacked 32 bit
int/float planar/interleaved as well at some point. Took me some time
to get the nearest pixel horizontal scalar working – but that’s
because I tried to do too much in my head/by hit and miss without
writing it down and working out the bit numbers properly – i got the
main addressing working but stuffed up the sub-word addressing.
What’s nice is that all the calculations (and memory loads) are the
same for a linear scaler, it then just needs to add a pixel blending
step afterwards – so that fell out very easily once I got the
nearest-pixel scaler working.
The basic algorithm is a very simple one which used fixed-point
integer arithmetic so the inner loop is merely and add, with a shift
used to perform addressing.
void scalex_nearest(unsigned char *srcp, unsigned char *dstp, int inw, int outw)
{
int scalexf = inw * (1<<CCALEBITS) / outw;
int sxf = 0, dx = 0;
while (dx < outw) {
int sx = sxf >> SCALEBITS;
dstp[dx] = srcp[sx];
sxf += scalexf;
dx += 1;
}
}
It may not be as accurate as it could be, but so far it’s accurate enough for what I need.
And it’s branchless which is important for vectorising.
Converting this to vector/simd code is straightforward if not entirely trivial. And the spu has
some nice instructions to help – as you’d expect from a simd processor.
Basically the code calculates 4 adjacent pixels at once – the above
calculation is basically repeated 4 times, with a single pixel offset
for each column.
The basic scale value is the same, but it is a vector now:
vector int scalexf = spu_splats(inw * (1<<CCALEBITS) / outw);
// spu_splats - copy scalar to all vector elements
But the accumulator needs to be initialised with 4 different values –
each as if it were 1 further in the loop. i.e.
vector int sxf = (vector int) { 0, scalexf, scalexf*2, scalexf * 3 };
And each loop we need to increment by 4 (this is a silly little detail
I knew about but forgot to put in the code assuming it was there –
much frustration).
scalexf = spu_sl(scalexf, 2); // spu_sl - element wise shift-left
It is in the inner loop where things get a bit more complicated. We
can only load single quadwords at a time from a single offset, so the
algorithm needs a bit of tweaking. Instead of loading 4 separate
addresses which the original algorithm would require, we can take
advantage of the fact we are only scaling up (i.e. we will never need
to access more words than we’re filling) and use the shuffle
instruction to perform sub-word addressing. The basic addressing thus
remains the same, and we just use the first value for this. This
gives us a base address from which we load 2 quadwords – the
subaddressing will always fit within these 8 values (we need 8 values
since we may have overlapping accesses). We are also referencing 4
values at once, so we have to multiply the address by 4 – or shift by
2 less (12 vs 14)
Explanation of this by way of example:
example: scale from 10 to 15 in size, use 14 bits of scale
input : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, a] (in words)
scalexf will be 10 * (16384) / 15 = 10922
loop0:
sxf = [ 0, 10922, 21844, 32766 ]
address = sxf[0] >> 12 = 0
load0 = [ 0, 1, 2, 3 ]
load1 = [ 4, 5, 6, 7 ]
loop1:
sxf = [ 43688, 54610, 65532, 76454 ]
address = sxf[0] >> 12 = 10 = 0 (quadword aligned)
load0 = [ 0, 1, 2, 3 ]
load1 = [ 4, 5, 6, 7 ]
sxf needs to be remapped to a shuffle instruction to load the right pixels. If each element of sxf is shifted right by 14 we have:
offsets = sxf >> 14 = [ 0, 0, 1, 1 ]
This needs to be mapped into a shuffle instruction to access those words. The shuffle bytes instruction takes 2 registers and a control word. The registers are concatenated together to form a 32 byte array, and the control word looks up each byte at a time into this array. i.e. we need to get a shuffle pattern:
pattern = [ 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 4, 5, 6, 7 ]
If result is the result vector represented as an array of 4-byte words, and source is the two source vectors represented as an 8 element array, then using this pattern is equivalent to:
result[0] = source[0]
result[1] = source[0]
result[2] = source[1]
result[3] = source[1]
The easiest way to do this is to duplicate each offset stored in the lsb of the words in the sxf register into every other byte in the same word. Shift everything by 2 (*4) (or do it initially), and then add a previously initialised variable which holds the byte offsets for each word. i.e.
tmp = offsets << 2 = int: [ 0, 0, 4, 4 ]
pattern = shuffle (tmp, tmp, ((vector unsigned char ) { 3, 3, 3, 3, 7, 7, 7, 7, 11, 11, 11, 11, 15, 15, 15, 15 }) = char: [ 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4 ]
pattern = pattern + spu_splats(0x00010203) = char: [ 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 4, 5, 6, 7];
Which is the desired shuffle pattern. Getting the result is then a simple shuffle:
dstp[dx] = spu_shuffle(load0, load1, pattern);
That is the basic algorithm anyway – there are some added complications in that the offsets aren’t going to be just sxf shifted except for the first column, and they need to be relative to the quadword addressing. So the offset calculation is a little more like:
addr = sxf [0] >> (14-4+2) // this is the byte offset address of the quad-word we're to load (pass this directly to lqx instruction, it will ignore the lower 4 bits for free)
offset = addr & (-16) // mask out the lower 4 bits
offset = addr - offset[0] // find out the offsets of each member of the vector relative to the first one's actual address (shuffle is used to copy offset[0] to all members)
offset = offset >> 2 // only the upper 2 bits since we're addressing 4-member vectors
So for both loops of the example:
loop0:
addr = sxf >> 12 = [ 0, 2, 5, 7 ]
offset = addr & (-16) = [ 0, 0, 0, 0 ]
offset = addr - offset[0] = [ 0, 2, 5, 7 ]
offset = offset >> 2; = [ 0, 0, 1, 1 ]
tmp = offset << 2 = [ 0, 0, 4, 4 ]
pattern = shuffle(tmp) = [ 0,0,0,0, 0,0,0,0, 4,4,4,4, 4,4,4,4 ]
pattern += splats(0x01020304) = [ 0,1,2,3, 0,1,2,3, 4,5,6,7, 4,5,6,7 ]
loop1:
addr = sxf >> 12 = [ 10, 13, 15, 18 ]
offset = addr & (-16) = [ 0, 0, 0, 16 ]
offset = addr - offset[0] = [ 10, 13, 15, 18 ]
offset = offset >> 2; = [ 2, 3, 3, 4 ]
tmp = offset << 2 = [ 8, 12, 12, 16 ]
pattern = shuffle(tmp) = [ 8 (*4), 12 (*4), 12 (*4), 16 (*4) ]
pattern += splats(0x01020304) = [ 8,9,a,b, c,d,e,f, c,d,e,f, 10,11,12,13 ]
And writing this down I now see there’s a couple of redundant shifts happening,
so either my memory of the code is wrong, or I can simplify the code here
And the result of 2 loops would be:
output = [ 1, 1, 2, 2, 3, 4, 4, 5 ] (in words = 4xbytes each)
So put it all together and that’s about it (ok, strictly you need to worry about
masking out the last results beyond what you want, but i’ll assume the input
and output size are both multiples of 4). Now the nice thing about
using the fixed point calculation is that it already gives you the
fractional part of the address as a by-product, and in a form suitable
for our calculations. So adding linear interpolation is at least in
this sense, free. All we need to do is load up a vector which has
every *next* pixel in it, and then interpolate based on the fractional
part in vsxf. We can use the shuffle pattern previously obtained,
offset by 1, to get the second value – it will always address within
the 8 values already loaded so we needn’t even perform another memory
load.
patternnext = pattern + spu_splats(0x04040404) // address the next word of every word
valuenext = spu_shuffle(load0, load1, patternnext) // and we have it
// perform blending - valueout = value0 + ( value1 - value0 ) * scale
diff = valuenext - value // get signed difference
scale = vsxf & 16383 // mask out fractional part
offset = (diff * scale) >> 14 // perform fixed-point fractional multiply
valueout = value + offset // and that's it
(in reality since i’m storing packed format, I need to first unpack
the argb values into separate planes of each component, perform the
diff/scale/offset/valueout calculation once for each component, and
then re-pack the data).
Things to think about: If the code used shorts it could process 8
values at once instead of 4 – except the multiplies can only process 4
at once, so it would complicate these operations a little and require
additional packing instructions. It would probably be a win if the data was
stored as 16 bit fixed-point integer planes. Floats could be used – which
simplifies some of the work (but not much), but complicates the
addressing slightly, since it needs to be converted to integers for
that. Floats would only make sense if it were used as an intermediate
buffer format, and we were working on unpacked arrays of bitplanes.
Also note that the spu has no integer division, so the initial setup
of scalexf is messy – I did this using floats and then converted to
integers to do the work – either way it isn’t a big issue since it
only needs to be done once (using floats creates smaller code). Many
of the values could be pre-calculated with a single row of data,
although this then turns some instructions into loads, which may flood
the load/store pipeline. Maybe some combination could be used,
e.g. calculate the blending values/addressing manually and load the
shuffle values from a table.
Dunno what this fucked editor has done to the alignment – everything aligned when i typed it into emacs. And for some stupid reason, html input isn’t html – it honours carriage returns too. I give up.