A followup to the last post about bit manipulation. I ran the same code on a SPU on a PS3. As expected, rather different results.
The branchelss code (fls()) compiled without branches, and was reasonably fast still – 8s or 6.2s inline. Still, my lowly 1.5Ghz Pentium M creamed it in performance (well a little under 2x faster). The loop code (fls2()) which was the fastest on the pentium was dog-slow on a SPU. 10s or very strangely – 15s when inlined. It actually got a lot slower! The shifting version (fls4() – not included last time – rather slow – 15s on pentium) was atrocious – 46s. Which is quite interesting since something like it is often used in examples of implementing this function manually. Also remember my test always has the top 8 bits unset, which should improve the average case of the shift version.
unsigned int fls4(unsigned int v) { register int n; for (n =0; v && n<31;n++) v >>= 1; return n; }
The spu-specific instruction clz
however (as expected) was very fast – once inlined. 5.0s as a function, and 0.43s inlined – beating the pentium but only by ~40%, and only when inlined. Not surprising – branches are a big hit with the SPU and the test loop is too small for software branch prediction to cut in properly. I was also timing the overhead of launching the spu task, which is hard to measure on its own.
As an exercise in compiler bashing, I tried using the SPU intrinsics to hand-code the same algorithm as used in fls() – see fls6(). This came out at a fairly respectable 6.0s or 2.8s inlined. So even using an identical algorithm, and still writing scalar code, it is pretty easy to achieve a 33% to 220% speedup over the compiler-only version.
unsigned int fls6(unsigned int vi) { vec_uint4 v = spu_promote(vi, 0); vec_uint4 n = spu_splats((unsigned int)0); vec_uint4 cmp; vec_uint4 non; vec_uint4 von; non = spu_add(n, 16); von = spu_rlmask(v, -16); cmp = spu_cmpgt(v, (1<<16)-1); n = spu_sel(n, non, cmp); v = spu_sel(v, von, cmp); non = spu_add(n, 8); von = spu_rlmask(v, -8); cmp = spu_cmpgt(v, (1<<8)-1); n = spu_sel(n, non, cmp); v = spu_sel(v, von, cmp); non = spu_add(n, 4); von = spu_rlmask(v, -4); cmp = spu_cmpgt(v, (1<<4)-1); n = spu_sel(n, non, cmp); v = spu_sel(v, von, cmp); non = spu_add(n, 2); von = spu_rlmask(v, -2); cmp = spu_cmpgt(v, (1<<2)-1); n = spu_sel(n, non, cmp); v = spu_sel(v, von, cmp); non = spu_add(n, 1); cmp = spu_cmpgt(v, (1<<2)-1); n = spu_sel(n, non, cmp); return spu_extract(n, 0); }
(PS This has not been fully tested to confirm that it returns the correct result).
The compiler didn’t even schedule the instructions ideally – at least according to spu_timing. It was quite good, but you could easily still get another 3 or 4 cycles (out about 50 iirc – so say >5%) just by re-scheduling a couple of instructions.
I also tried a slightly different SPU version, taking advantage of the shuffle bytes instruction. This version only requires 2 compares and does much of the work as a lookup table. However the overhead of getting things setup or preparing the shuffle tables meant it was only about 1/2 the speed of the one above. It’s timing is shown as fls7().
Summary of results (inline only)
fls() fls2() fls3() fls4() fls5() fls6() fls7() Pentium M (1.5) 3.5 4.0 0.6 15 9.3 - - SPU 6.2 15 0.43 46 7.2 2.8 4.3 * fls3() is the single-instruction that does the same thing, on each cpu
Anyway, what did I learn from this:
- Different architectures can have very different performance characterstics (no surprise there – particularly on CELL). What works well on one might not work so well on another (e.g. small loops – fls2())
- A generally good algorithm is probably generally good however. e.g. fls() uses a binary search and potentially no branches. If branchless it also has a fixed execution time.
- Even with the same basic algorithm, implementation details matter. e.g. comparisons vs bit masking (fls() vs fls5()). Large bit masks may require expensive setup though non-immediate operands and/or physically larger instructions.
- Compilers are NOT ‘nearly always smarter than you’ – despite the tired line being repeated many times by textbook writers and bloggers. Yes choosing the right algorithm is more important than the language, but conversely you can implement the same improved algorithm in any language. fls() vs fls6(). Or the same bad algorithm in any language.
- A commonly used example implementation – one using shifts seems to actually run very poorly on modern hardware (fls4())
- High level languages like C can still do with some help from specific CPU support not exposed or accessible directly from the language, particularly with bit operations. You’d have to count builtin functions like __builtin__clz() in this instance – they are not part of C, but extensions (I didn’t try it, but I presume it would be the same as the inline asm version at least). fls3().
Did you try adding some __builtin_expect to help branch prediction ?
http://kerneltrap.org/node/4705
Something was whacked with your commenting interface the last time you posted on this so I couldn’t comment then, but I’d be curious to see the measured performance of this technique:
http://supertech.csail.mit.edu/papers/debruijn.pdf
This algorithm is constant in space and time in the theoretic sense. Its only costly operations are a memory access and a single multiplication; the variation will of course be entirely due to real-world effects, with caching probably dominating any discussion with respect to modern hardware.
I haven’t seen this algorithm mentioned very many places, probably because most CPUs (even many at the time it was published) make it irrelevant, but it’s an interesting idea nonetheless.