While working on the performance of SpectMorph, a project I’ll blog about later, using *oprofile* and *valgrind –tool=callgrind* both indicated that much time was spent in libm sincos() and/or libm sin(). Now the libm sincos() function uses the fsincos FPU instruction on x86 processors, and this instruction takes *about 100 cpu cycles*. If you need really really a lot of these sin values – like I do in SpectMorph – then this adds up to a significant percentage of all CPU time used.

So the first idea was to use a table and an interpolation scheme to implement an own sincos() function that is faster than the FPU version.

However, after thinking again, there is an even better way, if you need sincos (phase), sincos (phase + phase_inc), sincos (phase + 2 * phase_inc) … sincos (phase + n * phase_inc). So if what you really want is creating a sine/cosine wave, there is a trick:

Using a complex number for the start phase c_start, and a complex number for the phase increment c_inc, you can rewrite the above as c_start, c_start * c_inc^1, c_start * c_inc^2, c_start * c_inc^3 and so on. So the whole task is reduced to multiplying c_start with c_inc again and again. From my measurements, implementing this reduces the amount of work to *about 10 cpu cycles*. To get good precision, every once in a while, the code should do a real sincos() call to compute a new c_start.

Is that all that can be optimized? No. The math can be made even faster when using SSE instructions. Then you can compute four of these complex multiplications at once, so you effectively end up with *about 3.5 cpu cycles* per sincos() value, or *less than 2.5 cpu cycles* if you only need sin() values. In this case the imaginary part of the complex value needs to be computed, but not the real part.

Here is the source code for generating an array of sin()/sincos() values. Note that you need to compile with -msse for SSE instructions, and that all float arrays need to be 16-byte aligned.

Also note that if you need the *weighted sum of many sine waves*, you might be better off using an inverse FFT; I need that sometimes but not always, so this case can be optimized seperately.