Linux is bloated

I’ve literally spent all day trying to install Fedora Core 8 on my old laptop – the one I used to (fairly easily) develop Evolution on. Hmm, default – too slow and bloated to be remotely usable (took about 2 hours to boot the live-cd and get the installer running from the console). Got xfce installed (after some mucking about) – barely better. Got rid of the bloat of gdm – xdm – now the desktop is back to gnome. Sigh. Hacked it to get xfce to run after a lot of buggerising around (there seems to be no easily obvious way to get it to work – and while I was there I saw the twm login wont ever work). What’s all this python shit sucking 50mb of precious ram doing stuff I don’t give a shit about? Sigh. chmod -x /usr/bin/python* … horrah, finally I can log in in under 2 minutes. Cut my login overhead in half – the other half is still too much – xfce is rather bloated. Pity the python chmod broke yum.

Yum isn’t so yummy after-all. Re-enabled python so i could run yum. Wow 120mb of vm to install a couple of packages. Not bad considering the box only has 128mb. This is crap.

Hmm, should I try xubuntu – or will it be just as crappy and bloated and blighted by python poo?

Guys – python isn’t a fucking system language. Just like perl, it isn’t suitable for long-lived applications (or crapplets) either. Use it to write your shitty little throw-away perl scripts in a different syntax, but don’t fuck up linux desktops with this rubbish.

blast from the past

I thought i’d try something a bit different and recreate some Amiga demo routines I wrote 15 years ago (hmm, was it that long).

tunnel image

Hmm, still some way to go … The next bit of this routine is going to be tricky without the blitter and a bitplane oriented display …

sigh

I noticed emacs was a bit funny on my ps3 – just didn’t look right and behaved funny. The scrollbar didn’t work properly, and now you can scroll too far down. And whats with this rubbish that selecting text with the mouse suddenly vanishes when you hit delete? That isn’t emacs behaviour.

Oh. Gtk+. I see. Sigh. I guess it’s menu-bar can’t cope with emacs’ model, or whomever coded it decided it needn’t changing.

Or maybe it’s just ubuntu. The delete thing seems to be an option that’s off by default according to the manual – although the setting seems to have no effect on the delete key on ubuntu’s version.

While searching around for a fix I found some idiot had come up with the idea of modernising emacs … If I wanted to use a modern editor i’d just use one – there’s dozens of the bloody things out there. But if you want emacs, you bloody well want emacs.

I upgraded my laptop to 2008-April, and I had to fix a few other issues with firefox too – I can’t get the fonts to work the way I’d got them to work before, which is a pity. And the stupidly ugly animated sliding tab thing really pissed me off till I found the undocumented option to turn it off. Why anyone thinks that animating any part of the ui toolkit is ‘cool’ is beyond me. When I click on a button I want to see the results as instantly as possible, not after waiting to the end of some stupid and pointless animation that just gets in the way. It might look ‘cute’ once, but even that’s a maybe. Thankfully it still has emacs with no gtk in it – although I wonder how long that will last – the comment on the emacs package I really wanted said I probably didn’t want it, and suggested I chose the gtk one instead.

gnu 1, windows 0

Sometimes it’s the little things.

I needed to scan a page into the computer to send in an email, and I thought ‘no worries, there’s a scanner at work doing nothing i’ll give that a go’. Oops, we only have Windows XP x86_64 and the drivers wont work. Fortunately I have a couple of linux distros installed on one of the machines which I never use, so I thought i’d give one a go. Didn’t work first time, wasn’t the easiest thing in the world, e.g. I had to run xsane as root, and I had to get some firmware from the windows installer (no problem actually, I just used wine to ‘install’ the 32 bit package). But once I did all that – worked fine. Installed the firmware, fired it up, got a scan. Email sent. Job done.

XP 64 bit? Up shit creek without a paddle.

spu jobs, more mplayer

Got my job queue stuff going. In the end I made a few changes, and also it only works a as a single reader/writer, but it is lockless, and I can interrupt-wait on the ppe.

SPE’s have atomic operations which go through the DMA controller – they work on 128 bytes of data atomically which gives an implementation a lot of flexibility. Unfortunately, although the PPE’s atomic instructions are compatible, they only work on 4 bytes at a time (or 8 in 64 bit mode). So if you need to talk to the PPE too it limits the algorithm choice very much. On a side note whilst searching on google for background information I noticed that Sony have a software patent on … get this … using a SPU to remotely perform larger atomic operations for the PPE. Pretty stupid eh? They build a platform which has intentional limitations and then patent the only obvious (and afaict only) way to get around it’s shortcomings. The last paragraph also states ‘this patent also covers the very idea not just this implementation’ – which I find a little hard to swallow. Ahh software patents, crap one day, fucked the next.

Anyway I didn’t want the overhead of signalling a SPE anyway and 32 bits was enough for what I did.

The SPE atomic operations can be waited on (or invoke an interrupt handler), so they are all that is needed to ‘signal’ the SPE. They can wait on a reservation lost, which lets them listen for writes to the cache line. PPE’s are not so flexible. The reservation and release instructions only work on one address at a time and have no interrupt or wait support. So another mechanism must be used to signal the PPE of changes of state (other than using a busy loop). I chose to use an SPE output interrupting mailbox. This will signal data on the mailbox using an interrupt, so it can be waited on without polling. Thus I use a mailbox just to signal to the PPE that some jobs have been completed, and it polls the control information to find out the detail.

I use a 32 bit control word consisting of 2 16 bit values. One is the count of jobs yet to be completed, and the other is the current index of the head of the queue – where in the rotating queue buffer the next job will be allocated from. This must be updated atomically to perform an enqueue operation. Another value required is the tail of the queue – where the consumer is at processing the queue. Since this is only ever updated by the single consumer it doesn’t need to be atomic, but since we need to load/store 128 bytes anyway we may as well put them all together

The base address of the queue is also handy for the SPE – where the job data is actually stored. And a pre-calculated mask is included which trivialises the address calculation as we loop through the rotational buffer.

This gives the following queue header:

struct _queue_header {
   // the queue base effective address - here for convenience
   unsigned long queue_ea;
   // control word - atomically updated by PPE for enqueue
   unsigned int control;
   // where the consumer is at in the queue
   unsigned short tail;
   // mask to wrap the queue easily
   unsigned short queue_size_mask;
};

So we need a few basic operations:

  1. PPE side, enqueue a job – sleep if the queue is full
  2. PPE side, wait for any job to complete
  3. PPE side, wait for all jobs to complete
  4. SPE side, wait for a job to arrive
  5. SPE side, de-queue a job
  6. SPE side, indicate the job is completed

These can be implemented using the atomic operations on both the PPE and SPE pretty simply.

  1. Enqueue a job.

    loop:
      read control word with reservation
      extract jobcount (control >> 16)
      if jobcount == queue size
         wait for any job to finish
      else
         extract queue head (control & 0xffff)
         job record = queue address[head]
    
         fill in job record
    
         jobcount += 1
         head += 1
         write control word with release
      fi
    loop if write failed or we waited
    
  2. Wait for any job to complete

    Just need to read the SPE’s output mailbox – it will block. By reading the output interrupting mailbox using libspe it will also yield the CPU if it needs to (at least I believe it should).

    read the output interrupting mailbox, discarding the result
    
  3. Wait for all jobs to complete

    loop:
      wait for any job to complete
      read the control word
    while job count > 0
    
  4. Wait for a job to arrive / dequeue a job

    This makes use of the reservation lost event (MFC_LLR_LOST_EVENT) to sleep if we have to wait for work to arrive.

    Setup to listen for reservation lost event
    loop:
      atomically read queue header with reservation
      if jobcount == 0 then
        wait for reservation lost event
      else
        job address = queue address + tail index
        update queue header tail index += 1
        atomically write queue header with release
      fi
    loop if we had to wait for event or the atomic write failed
    dma in job from job address
    
  5. Indicate job completion

    We only have to tell the PPE that any job has completed – so we don’t block if the mailbox already has an indicator in it.

    loop:
      atomically read the queue header with reservation
      reduce the jobcount by 1
      atomically write the queue header with release
    while atomic write failed
    
    if the interrupting output mailbox is empty
      write a dummy value to the mailbox
    fi
    

It’s just a very simple lockless queue implementation. But it only works for 1 reader and 1 writer. I think it could be problematic for multiple readers if they were otherwise idle, as they would end up spinning on the reserve/modify/release cycle without managing to pull it off without some contention – I could be wrong. Otherwise I think it should work. Multiple writers are harder since we’d need some sort of allocation mechanism so they didn’t overwrite the new head of the queue before it is enqueued.

Well I put this into mplayer and the problematic video file I had was still problematic – I guess the codec is doing the full decode, then writing the result to the video frame and immediately calling for a page flip.

So I wondered – since the yuv/scale step is now happening by itself, how about I just don’t bother waiting for it to complete before flipping the frame? Some success! Now the CPU could decode the video just fine, but the display was not surprisingly very messed up. Frames were jumping all over the place. So then I tried flipping the page flip logic around a little. Instead of swapping the frame and working on the ‘hidden’ buffer, what about swapping the frame but working on the ‘shown’ buffer – I had nothing to lose in trying. Total success! So long as the SPU can complete 1 whole frame before another one comes along, I get smooth clean video, with no tearing, and still some CPU headroom.

At first I thought this was a little surprising – I expected tearing at the least. But then I thought about it a bit more – although the PS3 is set up to ‘double buffer’ using 2 frame buffers in main memory, they are both actually hidden at all times – it takes a separate step to copy them to the live framebuffer(s). So in effect the PS3 provides triple (quad?) buffering for free – well so long as you get everything done in time – which is pretty easy to guarantee with video post-processing.

I got bored with playing that video over and over so I went back to some post-processing code. I wrote up a floating-point 3×3 convolution, and fiddled with the frontend logic till it worked. Ok – but it couldn’t keep up with the video at full resolution (actually for various reasons i’m only working at full X resolution, and before the Y scaler is applied). So I tried an integer version – bit faster, because it doesn’t need to do the int float and float int conversions. I then did tried hard-coding specific convolutions – using adds rather than multiplies. Not sure how fast this stuff is on another cpu, but some hard-coded convolutions – sobel edge detection. Then worked on unrolled versions, and fought with the compiler’s bad translation choices. Even dabbled a bit writing asm directly as a result. I need to read up more on the issue rules before I spend more time on that.

I was also trying to get non-direct rendering mode working for mplayer. For one this needs to cope with non-aligned 16-byte data. I managed to put the re-align logic into the yuv converter with almost no extra overhead – since it is already shuffling it’s input data, I just added it to the char-to-short converter and loaded the next value ahead of time so it worked. But it still didn’t work – it runs very slowly and stuffs up a lot – I don’t really know what mplayer is trying to do.

For multiple-pass post-processing, I’m coming to the conclusion that the best intermediate format is planes of short values of each component – probably packed into interleaved quadwords. Shorts can be processed natively by different multiplies, and they’re more compact than ints. Interleaving them means you only need 1 pointer. But again this seems to cause more compiler fights.

e.g. the compiler does some weird shit with code like:

foo(char *a, char *b, int count)
{
    do {
      b[0] = a[1];
      b[1] = a[1];
      b[2] = a[2];
      a += 3;
      b += 3;
      count -= 3;
    } while (count > 0);
}

(i.e. a typical unrolled loop)

Instead of translating as is, it tries to do something like:

foo(char *a, char *b, int count)
{
   int loopcount = count / 3;

   loopcount +=1 ;
   while (loopcount) {
      .. same logic
      loopcount -= 1;
   }
}

Maybe it’s something that works well on an architecture with a special decrement-loop instruction, and integer division – but SPE’s have neither! So it generates some nasty inline division code and redundant loop arithmetic.

And yeah it was a crappy wet/windy and cold weekend … sat around playing with code, watching some movies and getting fat for the most part (did a nice roast pork and some baked some banana muffins too). Did manage to get out and clean a gutter though which was overflowing – and I noticed some rotting wood which wasn’t a good thing to notice.

re: The comment below about insomniac, Yes, I try to check them every now and then and have a good read – it’s nice to see stuff like that from them, they are obviously enthusiastic about the technology.

queues, yuv

I finished off the yuv conversion document. Enough of that.

I started looking at implementing a job queue last night. I thought I’d use the atomic update functions to implement it. At least that way I can easily write a single-reader queue, and without too much hassle make it multiple-writer too, if I ever need that (which I don’t yet). After wasting a couple of hours with an incorrect parameter being passed to my test code I got something functional. I think there are some races with signalling the ppu yet, but I had the ppu queueing up jobs for an spe, and the spe responding when it was done, and the ppe being able to non busy-wait for all jobs to be completed – which are all the primitives I need. I guess with a long wet weekend coming up I might get plenty of time to play with it this weekend!

windows suxors

I messed up some queries and sent postgresql into a spin yesterday, a couple of times. So I ended up with 2 processes cpu bound. Just about bought my whole system to a standstill – well, emacs still ran ok, and mozilla wasn’t too bad, but visual studio forgot it was running on a dual-core machine with 4g of memory and started crawling – took minutes just to start a debugging session. Maybe the disk was i/o bound as well as the cpu being cpu bound – I didn’t check the hdd light – I just blamed windows.

Actually – ubuntu can fucking suxors at times as well. No problem has been insurmountable, but there’s been too many little niggling hassles getting my ps3 (or laptop for that matter) setup as a cranking development environment. The docs packages for example are mostly fucked, or too fine grained – you have to install docs for everything, and even then several times – man and info for example. The other night I hit yet another snag – scrollkeeper interfering with installing updates to the point of it failing. Why distro’s keep shipping utterly useless snot like that which just kills your system every now and then (particularly during updates) is beyond me. Patch it away or stop including crap that depends on it – and you can’t uninstall it without losing half the desktop (and with ubuntu’s meta-packages, you can’t really be sure what’s going anyway). I fucked off its crontab entry and replaced scroll-keeper-update with an empty shell script – hopefully that’ll do for now. There are other annoying daily cron tasks which shouldn’t be there either – if you don’t have your machine on all the time you don’t want it running a damn ubuntu-check-update or updatedb every time just after you turn it on.

On a side note – why can’t anyone come up with a decent documentation package? Well, why they’re even trying is beyond me – man is pretty good and info is excellent. They both work, can easily be adapted to non-terminal use, and info even goes all the way to print. Why come up with all these shitty indexed crap systems that nobody uses. While i’m here, the idiot who moved the documentation for the command-line-only tools in netpbm to be web pages only, from the man pages it always had, should have his nuts cut off (then again, I rarely use them anymore – this is one reason).

Enough rant.

The cell coding is going well, if slow. It takes a lot of time. I’m also trying lots of different options with the aim of achieving a higher level of grokness, and writing up some of my experiences as i go – yuv to rgb conversion so far. I have found a video that doesn’t play smoothly only because the screen decoding is too slow – so that should be a good test file for changing the way I call the SPU code. Put in a job queue, async notifcation, even splitting the task.

I also played with some preliminary filtering/effects code. Quantise is simple, although I had a bug in one run which gave a nice cartoony effect. Played with a horizontal-blur filter. Using the spu average bytes instruction makes this simple if not terribly flexible. But working out how to load data streams offset from each other efficiently on the spu is useful – i’ll look at 2d convolution next, after I get the job queue stuff working.

mplayer

Made good progress over the weekend on the mplayer vo module. Took a
bloody long time though – bugs in code, not undertanding api’s, not
understanding how mplayer wants it’s vo drivers to work. But I got
there in the end – well just, it had me solidly occupied the whole
weekend, much to the chargrin of my housemates who wanted to play
games. Even managed to watch a couple of movies to test it out –
although I still can’t work out how to stop the fucking screen
blanking every 10 minutes under ubuntu (or stop the log output to the
virtual console) (yeah i googled, but couldn’t find any way to do it
programmatically – i can unblank but not disable it).

It can upscale an SD source to 1080p without problems (and without
tearing, oh happy day) – it is only bi-linear scaling, but that’s good
enough for now. There’s still plenty of scope for improvement –
e.g. now I need to work on a job queue mechanism rather than
loading/executing the whole spu programme each frame.

Although I knew dma’s needed to be 16 byte aligned I kept forgetting
to start with – the cause of much frustration and cursing. I also had
problems with the spu seeming to crash after a short time. After
about 200 frames it would just die but the rest of mplayer kept
running and executing the spu programme returned no error. I couldn’t
work out what was going on. It seems that once you load a programme
into the spu context, it isn’t really kept around forever – so it
would semi-randomly dissapear at some future point in time. I think
this is something I’d previously ‘discovered’ but forgotten about
since I last did any spu coding.

It was also a little difficult finding decent documentation on how to
embed spu binaries into ppe code, it often assumed you knew how,
didn’t go into that detail, or used some magic makefiles not specified
directly. I found make.footer in the sdk examples at last, and after
some manual runs distilled that into my own makefiles.

Although I had bugs in the spu code I had written/compiled before I
tried running it, at least they were not terminal issues. The
algorithms were sound, I was just out in little ways or just forgot to
finish bits of them. The YUV converter seems to need to do an awful
lot of work, but I guess that’s why it’s so slow without an spu. The
actual calculation is easy and simple, but clamping 64 values to
[0,255] just takes a lot of instructions for example. So far i’m
sticking with a packed ARGB internal format in memory and 4xeach
colour component/planar when in registers – the loads vs the
loads/shuffle’s aren’t much different, but I’ll try an unpacked 32 bit
int/float planar/interleaved as well at some point. Took me some time
to get the nearest pixel horizontal scalar working – but that’s
because I tried to do too much in my head/by hit and miss without
writing it down and working out the bit numbers properly – i got the
main addressing working but stuffed up the sub-word addressing.
What’s nice is that all the calculations (and memory loads) are the
same for a linear scaler, it then just needs to add a pixel blending
step afterwards – so that fell out very easily once I got the
nearest-pixel scaler working.

The basic algorithm is a very simple one which used fixed-point
integer arithmetic so the inner loop is merely and add, with a shift
used to perform addressing.

void scalex_nearest(unsigned char *srcp, unsigned char *dstp, int inw, int outw)
{
  int scalexf = inw * (1<<CCALEBITS) / outw;
  int sxf = 0, dx = 0;

  while (dx < outw) {
    int sx = sxf >> SCALEBITS;

    dstp[dx] = srcp[sx];

    sxf += scalexf;
    dx += 1;
  }
}

It may not be as accurate as it could be, but so far it’s accurate enough for what I need.
And it’s branchless which is important for vectorising.

Converting this to vector/simd code is straightforward if not entirely trivial. And the spu has
some nice instructions to help – as you’d expect from a simd processor.

Basically the code calculates 4 adjacent pixels at once – the above
calculation is basically repeated 4 times, with a single pixel offset
for each column.

The basic scale value is the same, but it is a vector now:

  vector int scalexf = spu_splats(inw * (1<<CCALEBITS) / outw);
       // spu_splats - copy scalar to all vector elements

But the accumulator needs to be initialised with 4 different values –
each as if it were 1 further in the loop. i.e.

  vector int sxf = (vector int) { 0, scalexf, scalexf*2, scalexf * 3 };

And each loop we need to increment by 4 (this is a silly little detail
I knew about but forgot to put in the code assuming it was there –
much frustration).

  scalexf = spu_sl(scalexf, 2);   // spu_sl - element wise shift-left

It is in the inner loop where things get a bit more complicated. We
can only load single quadwords at a time from a single offset, so the
algorithm needs a bit of tweaking. Instead of loading 4 separate
addresses which the original algorithm would require, we can take
advantage of the fact we are only scaling up (i.e. we will never need
to access more words than we’re filling) and use the shuffle
instruction to perform sub-word addressing. The basic addressing thus
remains the same, and we just use the first value for this. This
gives us a base address from which we load 2 quadwords – the
subaddressing will always fit within these 8 values (we need 8 values
since we may have overlapping accesses). We are also referencing 4
values at once, so we have to multiply the address by 4 – or shift by
2 less (12 vs 14)

Explanation of this by way of example:

  example: scale from 10 to 15 in size, use 14 bits of scale
  input  :  [ 1, 2, 3, 4, 5, 6, 7, 8, 9, a] (in words)

  scalexf will be 10 * (16384) / 15 = 10922

  loop0:
      sxf = [      0, 10922, 21844, 32766 ]

  address = sxf[0] >> 12 = 0
    load0 = [ 0, 1, 2, 3 ]
    load1 = [ 4, 5, 6, 7 ]

  loop1:
      sxf = [  43688, 54610, 65532, 76454 ]

  address = sxf[0] >> 12 = 10 = 0 (quadword aligned)
    load0 = [ 0, 1, 2, 3 ]
    load1 = [ 4, 5, 6, 7 ]

sxf needs to be remapped to a shuffle instruction to load the right pixels. If each element of sxf is shifted right by 14 we have:

   offsets = sxf >> 14 = [ 0, 0, 1, 1 ]

This needs to be mapped into a shuffle instruction to access those words. The shuffle bytes instruction takes 2 registers and a control word. The registers are concatenated together to form a 32 byte array, and the control word looks up each byte at a time into this array. i.e. we need to get a shuffle pattern:

  pattern = [ 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 4, 5, 6, 7 ]

If result is the result vector represented as an array of 4-byte words, and source is the two source vectors represented as an 8 element array, then using this pattern is equivalent to:

   result[0] = source[0]
   result[1] = source[0]
   result[2] = source[1]
   result[3] = source[1]

The easiest way to do this is to duplicate each offset stored in the lsb of the words in the sxf register into every other byte in the same word. Shift everything by 2 (*4) (or do it initially), and then add a previously initialised variable which holds the byte offsets for each word. i.e.

   tmp =  offsets << 2   = int: [ 0, 0, 4, 4 ]
   pattern = shuffle (tmp, tmp, ((vector unsigned char ) { 3, 3, 3, 3, 7, 7, 7, 7, 11, 11, 11, 11, 15, 15, 15, 15 }) = char: [ 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4 ]
   pattern = pattern + spu_splats(0x00010203)  = char: [ 0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 4, 5, 6, 7];

Which is the desired shuffle pattern. Getting the result is then a simple shuffle:

  dstp[dx] = spu_shuffle(load0, load1, pattern);

That is the basic algorithm anyway – there are some added complications in that the offsets aren’t going to be just sxf shifted except for the first column, and they need to be relative to the quadword addressing. So the offset calculation is a little more like:

    addr = sxf [0] >> (14-4+2) // this is the byte offset address of the quad-word we're to load (pass this directly to lqx instruction, it will ignore the lower 4 bits for free)
  offset = addr & (-16)          // mask out the lower 4 bits
  offset = addr - offset[0]          // find out the offsets of each member of the vector relative to the first one's actual address (shuffle is used to copy offset[0] to all members)
  offset = offset >> 2         // only the upper 2 bits since we're addressing 4-member vectors

So for both loops of the example:

loop0:
    addr = sxf >> 12    = [  0,  2,  5,  7 ]
  offset = addr & (-16)   = [  0,  0,  0,  0 ]
  offset = addr - offset[0]   = [  0,  2,  5,  7 ]
  offset = offset >> 2;   = [  0,  0,  1,  1 ]

     tmp = offset << 2  = [  0,  0,  4,  4 ]
 pattern = shuffle(tmp)       = [ 0,0,0,0, 0,0,0,0, 4,4,4,4, 4,4,4,4 ]
pattern += splats(0x01020304) = [ 0,1,2,3, 0,1,2,3, 4,5,6,7, 4,5,6,7 ]

loop1:
    addr = sxf >> 12    = [ 10, 13, 15, 18 ]
  offset = addr & (-16)   = [  0,  0,  0, 16 ]
  offset = addr - offset[0]   = [ 10, 13, 15, 18 ]
  offset = offset >> 2;   = [  2,  3,  3,  4 ]

     tmp = offset << 2  = [  8, 12, 12, 16 ]
 pattern = shuffle(tmp)       = [  8 (*4), 12 (*4), 12 (*4), 16 (*4) ]
pattern += splats(0x01020304) = [ 8,9,a,b, c,d,e,f, c,d,e,f, 10,11,12,13 ]

And writing this down I now see there’s a couple of redundant shifts happening,
so either my memory of the code is wrong, or I can simplify the code here

And the result of 2 loops would be:

  output = [ 1, 1, 2, 2, 3, 4, 4, 5 ] (in words = 4xbytes each)

So put it all together and that’s about it (ok, strictly you need to worry about
masking out the last results beyond what you want, but i’ll assume the input
and output size are both multiples of 4). Now the nice thing about
using the fixed point calculation is that it already gives you the
fractional part of the address as a by-product, and in a form suitable
for our calculations. So adding linear interpolation is at least in
this sense, free. All we need to do is load up a vector which has
every *next* pixel in it, and then interpolate based on the fractional
part in vsxf. We can use the shuffle pattern previously obtained,
offset by 1, to get the second value – it will always address within
the 8 values already loaded so we needn’t even perform another memory
load.

  patternnext = pattern + spu_splats(0x04040404)     // address the next word of every word
  valuenext = spu_shuffle(load0, load1, patternnext) // and we have it

  // perform blending - valueout = value0 + ( value1 - value0 ) * scale
  diff = valuenext - value                           // get signed difference
  scale = vsxf & 16383                      // mask out fractional part
  offset = (diff * scale) >> 14           // perform fixed-point fractional multiply
  valueout = value + offset                       // and that's it

(in reality since i’m storing packed format, I need to first unpack
the argb values into separate planes of each component, perform the
diff/scale/offset/valueout calculation once for each component, and
then re-pack the data).

Things to think about: If the code used shorts it could process 8
values at once instead of 4 – except the multiplies can only process 4
at once, so it would complicate these operations a little and require
additional packing instructions. It would probably be a win if the data was
stored as 16 bit fixed-point integer planes. Floats could be used – which
simplifies some of the work (but not much), but complicates the
addressing slightly, since it needs to be converted to integers for
that. Floats would only make sense if it were used as an intermediate
buffer format, and we were working on unpacked arrays of bitplanes.
Also note that the spu has no integer division, so the initial setup
of scalexf is messy – I did this using floats and then converted to
integers to do the work – either way it isn’t a big issue since it
only needs to be done once (using floats creates smaller code). Many
of the values could be pre-calculated with a single row of data,
although this then turns some instructions into loads, which may flood
the load/store pipeline. Maybe some combination could be used,
e.g. calculate the blending values/addressing manually and load the
shuffle values from a table.

Dunno what this fucked editor has done to the alignment – everything aligned when i typed it into emacs. And for some stupid reason, html input isn’t html – it honours carriage returns too. I give up.

ps3 stuff

Hmm, that was fun, a week of playing around with xcel to massage some data before it was loaded. I’m surprised how crap xcel is when you really have to use it to do something. All of it’s little quirks and funny ways of doing things – it’s something that just feels old and out of date. Well hopefully the data loads and I wont have to worry about using it again in anger for some time.

I had been busy redesigning a lot of code in our application – slimming down the db layer, fixing a lot of outstanding problems, adding multiple select copy/paste/undo and the like, but I got a bit sidetracked by this utterly boring data load stuff. In the end trying to remove the aux table for trees in the database didn’t work too well – it slowed down some queries too much. Still the work wasn’t wasted – they are handy functions I needed anyway.

At home I finally booted ps3 linux again after a bit of a break. I’m working on yet-another-patch to mplayer to support the ps3 framebuffer. So far i’ve got a basic module which just copies the yuv data to the screen buffer, but I’ve already worked on some SPU code which will accelerate the process. Even without any rendering the PPE can’t decode a HD stream though – so no amount of video acceleration will help there, but it’s a start. Someone’s already done the same thing, but I really just want to get some experience coding with the SPU, so i’m basically doing the whole lot from scratch. mplayer dumps the raw YUV planes to some buffers and an SPU can load those, do YUV conversion and scaling – which provides a good speed boost. I guess if I get that to work nicely I can look into other stages to pipeline through an SPU, but I think that’ll be a long time coming – it’s a big codebase to grok and there’s a lot to learn about the CBE.

For starters I wrote a SIMD YUV converter – maps nicely to SIMD. I’m trying fixed-point rather than floating point, although over time I hope to try several different variations to see what works and what doesn’t. I’ve also worked out some code to do software scaling. Interesting problem to convert to SIMD – and it’s really nice to be able to do basically assembly language from within C, so you get access to all the nice shuffle/rotate/bit functions you can’t get to from within C which makes a real difference to speed. Although some of the intrinsics are a bit of a pain to use – lots of casting and crap that just makes it harder to do in C than it would in assembly – but with so many registers and scheduling issues I just can’t be bothered with approaching it from the assembly side.

Once I’ve tested that it actually works, I’ll have to post some code somewhere.