unexpected bits of bits

After years of mind-numbing .net stuff I’ve started getting into hacking C again. Sorting, balanced trees, memory allocators, hash tables, playing with literate programming, all sorts of things. Mostly just for the intellectual challenge of it all too – nice change from tweaking buttons in a shitty ui.

This morning I was trying to write a bit of code to find the first bit set (from MSB) in a word and came up with some interesting and unexpected results with gcc on an x86.

First, there’s the simple way you’d think of doing it in c:

int fls2(unsigned int v)
{
register int n = 0;

for (n = 31; n>0;n–)
if (v & (1<<n))
return n;

return 0;
}
This turned out to be almost as fast as c code could go. All the timings I did were running the function against every number from 1 to 16 million – which isn’t even the average case for this function, and yet it outperformed everything except the x86 instruction which does the same thing. As a function it ran in 4.s, and inlined it made it in 4.0s.

My first go at ‘optimising’ it was to get rid of the branches and loops. So I wrote some code which does a binary search basically. But even though I believe this could be written branchless easily in x86 assembly, unfortunately I had no luck with gcc and it added branches for the comparison tests instead. Still, this ran ok – and for some bizarre reason, when it was inlined, it ran faster than the first one (fls2()) above. Normally 5.5s, but 3.5s when inlined. I couldn’t see any obvious reason why this version would perform so much differently when inlined to the previous one from the generated code.


int fls(unsigned int v)
{
register int n = 0;
register unsigned int off;

off = (v >= (1<<16)) * 16;
n += off;
v >>= off;

off = (v >= (1<<8)) * 8;
n += off;
v >>= off;

off = (v >= (1<<4)) * 4;
n += off;
v >>= off;

off = (v >= (1<<2)) * 2;
n += off;
v >>= off;

off = (v >= (1<<1)) * 1;
n += off;

return n;
}
So I had another go at fooling the compiler into getting rid of the branches. And came up with the one below – it basically does exactly the same thing as the comparison version but encodes the comparisons as bit masks. But although it generated branchless code, it was much larger, and executed significantly slower than the others. Around 11s or 9s inlined.

static inline int fls5(unsigned int v)
{
register int n = 0;
register int off;

off = ((v & 0xffff0000) != 0) * 16;
n += off;
v >>= off;
off = ((v & 0xff00) != 0) * 8;
n += off;
v >>= off;
off = ((v & 0xf0) != 0) * 4;
n += off;
v >>= off;
off = ((v & 0xc) != 0) * 2;
n += off;
v >>= off;
off = ((v & 0x2) != 0) * 1;
n += off;

return n;
}
But none of them beat the raw x86 instruction, which is of course what i’d probably have used.


unsigned int fls3(unsigned int v)
{
asm("bsrl %1, %0" : "=r" (v) : "r" (v));

return v;
}
1.3s or 0.6s inline. It is still interesting to see that a loop that is iterating on average (32-8)/2+8 = 20 loops is only 5x slower than a single instruction. I guess shifts on x86 are very slow (I don’t know much about x86 and never want to).

I then ‘discovered’, the __builtin_clz() function – and threw away all the tinkering above. But it was still interesting to see how many different ways of doing the same thing, and how much difference it makes to performance. And it shows that C can be lacking.

And i’ll probably never look at it again – ahh the freedom of a hobby.

(boy this editor still sucks for pasting code – what a stupid piece of software for a blog for coders)

2 thoughts on “unexpected bits of bits”

  1. Rather than needlessly criticising the choice of tool and/or the software itself, why don’t you just turn off the GUI editor? You can change it on your profile page.

    Remember: Be excellent to each other.

  2. It isn’t needless criticism. It is a slow, clumsy, and confusing bit of software. And i’m sick of dealing with software like that.

    Anyway thanks for pointing out where the option is – it isn’t under any of the half a dozen ‘option’ tabs that made sense to me for it to be.

Leave a Reply

Your email address will not be published. Required fields are marked *