Got my job queue stuff going. In the end I made a few changes, and also it only works a as a single reader/writer, but it is lockless, and I can interrupt-wait on the ppe.
SPE’s have atomic operations which go through the DMA controller – they work on 128 bytes of data atomically which gives an implementation a lot of flexibility. Unfortunately, although the PPE’s atomic instructions are compatible, they only work on 4 bytes at a time (or 8 in 64 bit mode). So if you need to talk to the PPE too it limits the algorithm choice very much. On a side note whilst searching on google for background information I noticed that Sony have a software patent on … get this … using a SPU to remotely perform larger atomic operations for the PPE. Pretty stupid eh? They build a platform which has intentional limitations and then patent the only obvious (and afaict only) way to get around it’s shortcomings. The last paragraph also states ‘this patent also covers the very idea not just this implementation’ – which I find a little hard to swallow. Ahh software patents, crap one day, fucked the next.
Anyway I didn’t want the overhead of signalling a SPE anyway and 32 bits was enough for what I did.
The SPE atomic operations can be waited on (or invoke an interrupt handler), so they are all that is needed to ‘signal’ the SPE. They can wait on a reservation lost, which lets them listen for writes to the cache line. PPE’s are not so flexible. The reservation and release instructions only work on one address at a time and have no interrupt or wait support. So another mechanism must be used to signal the PPE of changes of state (other than using a busy loop). I chose to use an SPE output interrupting mailbox. This will signal data on the mailbox using an interrupt, so it can be waited on without polling. Thus I use a mailbox just to signal to the PPE that some jobs have been completed, and it polls the control information to find out the detail.
I use a 32 bit control word consisting of 2 16 bit values. One is the count of jobs yet to be completed, and the other is the current index of the head of the queue – where in the rotating queue buffer the next job will be allocated from. This must be updated atomically to perform an enqueue operation. Another value required is the tail of the queue – where the consumer is at processing the queue. Since this is only ever updated by the single consumer it doesn’t need to be atomic, but since we need to load/store 128 bytes anyway we may as well put them all together
The base address of the queue is also handy for the SPE – where the job data is actually stored. And a pre-calculated mask is included which trivialises the address calculation as we loop through the rotational buffer.
This gives the following queue header:
struct _queue_header { // the queue base effective address - here for convenience unsigned long queue_ea; // control word - atomically updated by PPE for enqueue unsigned int control; // where the consumer is at in the queue unsigned short tail; // mask to wrap the queue easily unsigned short queue_size_mask; };
So we need a few basic operations:
- PPE side, enqueue a job – sleep if the queue is full
- PPE side, wait for any job to complete
- PPE side, wait for all jobs to complete
- SPE side, wait for a job to arrive
- SPE side, de-queue a job
- SPE side, indicate the job is completed
These can be implemented using the atomic operations on both the PPE and SPE pretty simply.
- Enqueue a job.
loop: read control word with reservation extract jobcount (control >> 16) if jobcount == queue size wait for any job to finish else extract queue head (control & 0xffff) job record = queue address[head] fill in job record jobcount += 1 head += 1 write control word with release fi loop if write failed or we waited
- Wait for any job to complete
Just need to read the SPE’s output mailbox – it will block. By reading the output interrupting mailbox using libspe it will also yield the CPU if it needs to (at least I believe it should).
read the output interrupting mailbox, discarding the result
- Wait for all jobs to complete
loop: wait for any job to complete read the control word while job count > 0
- Wait for a job to arrive / dequeue a job
This makes use of the reservation lost event (MFC_LLR_LOST_EVENT) to sleep if we have to wait for work to arrive.
Setup to listen for reservation lost event loop: atomically read queue header with reservation if jobcount == 0 then wait for reservation lost event else job address = queue address + tail index update queue header tail index += 1 atomically write queue header with release fi loop if we had to wait for event or the atomic write failed dma in job from job address
- Indicate job completion
We only have to tell the PPE that any job has completed – so we don’t block if the mailbox already has an indicator in it.
loop: atomically read the queue header with reservation reduce the jobcount by 1 atomically write the queue header with release while atomic write failed if the interrupting output mailbox is empty write a dummy value to the mailbox fi
It’s just a very simple lockless queue implementation. But it only works for 1 reader and 1 writer. I think it could be problematic for multiple readers if they were otherwise idle, as they would end up spinning on the reserve/modify/release cycle without managing to pull it off without some contention – I could be wrong. Otherwise I think it should work. Multiple writers are harder since we’d need some sort of allocation mechanism so they didn’t overwrite the new head of the queue before it is enqueued.
Well I put this into mplayer and the problematic video file I had was still problematic – I guess the codec is doing the full decode, then writing the result to the video frame and immediately calling for a page flip.
So I wondered – since the yuv/scale step is now happening by itself, how about I just don’t bother waiting for it to complete before flipping the frame? Some success! Now the CPU could decode the video just fine, but the display was not surprisingly very messed up. Frames were jumping all over the place. So then I tried flipping the page flip logic around a little. Instead of swapping the frame and working on the ‘hidden’ buffer, what about swapping the frame but working on the ‘shown’ buffer – I had nothing to lose in trying. Total success! So long as the SPU can complete 1 whole frame before another one comes along, I get smooth clean video, with no tearing, and still some CPU headroom.
At first I thought this was a little surprising – I expected tearing at the least. But then I thought about it a bit more – although the PS3 is set up to ‘double buffer’ using 2 frame buffers in main memory, they are both actually hidden at all times – it takes a separate step to copy them to the live framebuffer(s). So in effect the PS3 provides triple (quad?) buffering for free – well so long as you get everything done in time – which is pretty easy to guarantee with video post-processing.
I got bored with playing that video over and over so I went back to some post-processing code. I wrote up a floating-point 3×3 convolution, and fiddled with the frontend logic till it worked. Ok – but it couldn’t keep up with the video at full resolution (actually for various reasons i’m only working at full X resolution, and before the Y scaler is applied). So I tried an integer version – bit faster, because it doesn’t need to do the int float and float int conversions. I then did tried hard-coding specific convolutions – using adds rather than multiplies. Not sure how fast this stuff is on another cpu, but some hard-coded convolutions – sobel edge detection. Then worked on unrolled versions, and fought with the compiler’s bad translation choices. Even dabbled a bit writing asm directly as a result. I need to read up more on the issue rules before I spend more time on that.
I was also trying to get non-direct rendering mode working for mplayer. For one this needs to cope with non-aligned 16-byte data. I managed to put the re-align logic into the yuv converter with almost no extra overhead – since it is already shuffling it’s input data, I just added it to the char-to-short converter and loaded the next value ahead of time so it worked. But it still didn’t work – it runs very slowly and stuffs up a lot – I don’t really know what mplayer is trying to do.
For multiple-pass post-processing, I’m coming to the conclusion that the best intermediate format is planes of short values of each component – probably packed into interleaved quadwords. Shorts can be processed natively by different multiplies, and they’re more compact than ints. Interleaving them means you only need 1 pointer. But again this seems to cause more compiler fights.
e.g. the compiler does some weird shit with code like:
foo(char *a, char *b, int count) { do { b[0] = a[1]; b[1] = a[1]; b[2] = a[2]; a += 3; b += 3; count -= 3; } while (count > 0); }
(i.e. a typical unrolled loop)
Instead of translating as is, it tries to do something like:
foo(char *a, char *b, int count) { int loopcount = count / 3; loopcount +=1 ; while (loopcount) { .. same logic loopcount -= 1; } }
Maybe it’s something that works well on an architecture with a special decrement-loop instruction, and integer division – but SPE’s have neither! So it generates some nasty inline division code and redundant loop arithmetic.
And yeah it was a crappy wet/windy and cold weekend … sat around playing with code, watching some movies and getting fat for the most part (did a nice roast pork and some baked some banana muffins too). Did manage to get out and clean a gutter though which was overflowing – and I noticed some rotting wood which wasn’t a good thing to notice.
re: The comment below about insomniac, Yes, I try to check them every now and then and have a good read – it’s nice to see stuff like that from them, they are obviously enthusiastic about the technology.
Interesting reading. I was just wondering if you have noticed that Insomniac Games has lots of interesting articles on programming the PS3:
http://www.insomniacgames.com/tech/techpage.php