11.04.2010 Dear lazyweb: Screencasting under Linux

I’d like to produce screencasts of BEAST, an sequencer/synthesizer under Linux. So ideally, I’d like to record my voice via headset, the screen with the mouse and the output from BEAST. I tried to install Istanbul, mainly because its already packaged for Debian, but it didn’t work.

So if anyone has done screencasting of audio apps (or even normal apps) under Linux and can recommend something that just-works-out-of-the-box, any suggestion is welcome.

19.12.2010 Switching from x2x to synergy

For some time now, I’ve been using x2x to combine two X servers to one virtual workspace. It’s nice, because you can move your mouse off the display (say, your desktop machine) to the X server of another computer (in my use case: my laptop), and continue typing and clicking with your main mouse/keyboard. However, x2x gave me a few headaches, since the copypasting (middle mouse button) would not always work correctly between the two monitor setup you get.

However, I found out that synergy does the right thing when it comes to copypasting, so I switched to using synergy instead of x2x, and if you’re seeing the same problems with x2x, I can only recommend trying synergy. Most likely it should “just work” for you, too.

19.11.2010 SpectMorph: its fast now, too – and has many sound examples

I’ve finally managed to make a new release of SpectMorph, a C++ based project for creating and morphing sound models from samples; it still doesn’t have the morphing part, but at least its fast now, too. Depending on the CPU used, 100 to 300 simultanaeous voices are realistic, which should be enough for almost any composition.

Since SpectMorph can now import SoundFont files, I used this to build many many sound examples to compare how the SpectMorph models sound, and how the SoundFont sounds. Ideally they would be identical. After listening to quite a few of these files, I’d say that the SpectMorph approach in principle works for a wide variety of sounds, BUT that the encoding algorithm will produce more or less audible artefacts for some sounds, which hopefully can be fixed by improving the encoder.

The SpectMorph Homepage has all the samples (flac, ogg and mp3), so I’m just going to link one example here, to give you an idea of how good SpectMorph and original can match: Bach on a Church Organ using SpectMorph and using the Original Samples

05.11.2010 Profiling in the new millenium

In the last weeks, my top priority for SpectMorph development was performance optimization. I wanted to make sure that musicians will be able to use SpectMorph in highly polyphonic music, so the main goal was to get the per-voice CPU usage down to something close to the theoretical minimum.

In this blog post I’d like to describe the tools I used, because any developer will need to do performance optimizations every once in a while. There are three basic techniques that I found of great use. In fact, I believe using these three techniques is sufficient, and other tools (like gprof) may have been useful in the past, but are no longer ideal these days.

Direct time measurement is the perfect, undistorted view of what really happens. If a function is slow, running that function in a loop for 100000 times (or so), calling gettimeofday() before and after the loop will give you an idea of how much time is spent within that function per invocation. This is the obvious way to measure performance, and since it is not impacted by the measurement process, its the only “true” performance measure. There is no way around this technique, because anything else you do to measure performance will always affect execution time. So even if your profiling tool (discussed below) tells you that you have made some progress with a change in the code: do not believe it unless the actual execution time goes down even without the profiling tool. I’ve had my share of “good advice” and given by some profiling tool, which turned out to have absolutely no effect (or making things slower) when applied to the real world code.

Valgrind Callgrind + KCachegrind is an easy to use combination, to get an overview of the parts of a program are expensive. Using valgrind virtualization (–tool=callgrind), you can run your program and count the number of instructions that are used in each function. This may be the time to talk about the “new millenium”. Back in 199x, optimization was a fairly easy task. Each processor instruction would take a more or less well defined amount of cycles, so to minimize the running time for a function was the same as to minimize the sum of cycles of the instructions. You could know beforehand that replacing two instructions taking 3 cycles each with one instruction taking 4 cycles would result in a performance gain. If valgrind had been written back then, it probably would have a table with instruction costs built into it, so it could tell you the precise running time for each function (instead of the number of instructions).

However, processors have changed. There are multiple units for integer and float computations, so sometimes two additions can be executed in parallel, so sometimes replacing two additions with one will not speed up anything. There is branch prediction which means that a correctly predicted jump may take a lot less time than an incorrectly predicted jump. Modern processors have pipelining, so they are effectively not executing one instruction at a time, but many in parallel. Data dependancies between instructions will cause this process to be slower in some cases and faster in other cases. Caches further complicate the issue because if something is not in the cache one instruction might take forever, although its something simple (like an integer addition).

Why bother counting instructions with valgrind, then? Because still the number of instructions is often a good first step to get an idea of what is going on. Rewriting a function that used to take 100 instructions in a way that it only takes 20 instructions is not guaranteed to speed things up (do a direct time measurement to find out), but it is likely. Still you have to be careful when interpreting the results of valgrind, because for instance the FSIN instruction will take forever (90 cycles or so), so replacing one FSIN with ten other instructions will usually speed up things, although valgrind tells you the function takes a bigger CPU share of everything now.

Finally, one last remark about why I like valgrind’s callgrind tool so much, when visualized with kcachegrind: it can give you a good visualization of which functions call which functions, and how many instructions (or percentage) are spent where. You do not need to recompile anything (like with gprof) to do this. You can inspect the assembler code along with the source code, to see exactly which instructions a line of source code produces. And its not a statistical tool, but it counts exactly what would happen when the program would be running without valgrind.

OProfile + KCachegrind is the most sophisticated combination I found, to approximate what really happens while the program is executed. As I’ve described above, there are a lot of factors that influence how long something /really/ takes. The number of instructions is an approximation, but not the truth. There is no table to use to find out how long something will take (like in 199x). However, OProfile simply uses statistics it collects while the program is running (on the real CPU, not a virtualized one like valgrind), and then assigns the costs to the source/assembler code. There is a nice tool called op2calltree, Debian has it in kcachegrind-converters, that allows browsing the source code/assembler code of each function while displaying the oprofile cost.

Its basically what you want, and I found it really useful. Just one word of warning: I’ve briefly described that in reality, a processor does not execute one instruction after another, but there are complex, parallel processes that really execute many instructions at a time, on many units (like executing more than one integer multiplication simultaneously, maybe even in combination with fpu instructions). So any tool that assigns costs to instructions (and thats what OProfile does) can never represent what is really happening in this way. So even with this tool, you can not see what truely happens if the program runs, but only get a simplified view of what happens.

One example: I’ve seen cases a good share of the costs of computing a for() loop was assigned to the loop instructions (jump, add,…). This made me believe that if I would unroll the loop, the function would be faster. I unrolled the loop, and the function was not faster. Why? Probably because before unrolling, the instructions within the loops’ body were executed in parallel with the loop instructions (jump, add, …), and after the unrolling there was no more parallel execution so that in both cases the factor that determined the time it took to execute the for() loop were the instructions within the loop, and not the jump, add, … .

To sum it up: I think these days, understanding why something is slow or how to make it faster is a little more tricky than it used to be in 199x, but by combining the three methods I described, you can get the information you need to optimize the performance of your code.

This blog entry is already longer than what most people would like to read, and I’ve spent many weeks doing nothing else than performance optimizations, but if you managed to read this far, I’ll just give you a few bullet points to give you an idea on the other issues that may be worth keeping in mind, but would make this much much longer if described in detail:

* optimize only what you know to take a lot of time – if something is not really an issue, you need not bother spending time with making it faster

* your intuition of why something is slow will often be wrong – you need to understand the reason why something is slow if you want to make it faster

* branch prediction is important for fast code – if you have a branch which the processor can not predict, your code will be much slower than if you have a branch which the processor can predict

* SSE instructions are your friend when optimizing FPU code; if you know that some part of your algorithm is using lots of FPU instructions, then SSEifying may make it 2-4 times faster

* changing algorithms is often the most powerful optimization – if you can replace an O(N log N) algorithm with an O(N) algorithm, this may reduce the time it takes to do this step much more dramatically than if you try to squeeze out instructions of your original approach

* what is optimal depends on the machine that executes your code – for instance naive float->int conversion is notoriously slow on x86, but not so slow on AMD64; if you want good performance across both CPUs, you need to measure/profile on both targets

15.07.2010 gst123 – command line media player – interview

The editors at FamousWhy have conducted an interview about gst123, which has been published today on their site. So if you’re interested in gst123, this might be useful to read.

02.07.2010 Releases, releases

Release early, release often is one of the most valuable things to keep in mind when doing free software development (if you haven’t heard of this advice before, try reading “The Cathedral and the Bazaar”). So… last couple of days I’ve spent making new releases of the software I am maintaining.

* The Foldroid mail folder selection tool should in the 0.0.2 release build out of the box on OpenSUSE and other systems that have ncurses includes not in /usr/include
* gst123 media player has seen a lot of improvements since the initial release, thanks to community input. The gst123-0.1.1 release should be much better for watching videos, while it maintains the ability to be run in text only mode.
* SpectMorph 0.0.3 should sound much better than any previous release; its still work in progress, but if you listen to the piano samples for 0.0.2 and 0.0.3 (on the web page), you should hear much better sound reproduction of the attack of the piano.

If you happen to find problems with any of those release, let me know 🙂 For gst123 there are new fixes in the git repo that were too late for the release, so if you run into bugs there, try git first. I will make another gst123 release in the nearby future.

21.06.2010 foldroid – Mail Folder Selection Tool – released

I’ve been using mutt to manage my e-mail for many years now, and the one thing I’ve been missing is a nice overview over the folders that I have (due to many mailing lists I read, I use procmail for putting each list in a seperate folder). So I’ve written a curses based tool that lets you choose which (mbox style) mailbox you want to read, and it starts mutt with that mbox. It also shows the number of messages in each mbox, and the number of new messages. In my setup, I use it together with screen, so that each mutt is started in a new screen, which allows me to have many folders open at once.

So I’ve now fixed a couple of annoying things in foldroid, and made the initial public release. Its LGPLd and can be found on The Foldroid page, which also has a screenshot of foldroid at work.

17.05.2010 SpectMorph – or what makes a piano sound like a piano?

The open source project I’ve been working on since quite some time is called SpectMorph. It is so far a partial answer to the question what makes a piano sound like a piano. The initial idea was that by analyzing single notes of a piano, it would be possible to decompose the samples into a representation that allows to reproduce the original sound, yet allowing advanced operations (in a better way than operating on the wave files directly).

If the SpectMorph model of a piano could capture the piano-like aspects of piano samples, and if the SpectMorph model of a trumpet could capture the trumpet-like aspects of trumpet samples, then a morphing algorithm could blend these two models into a sound that sounds somewhat between a piano and a trumpet. Thats the goal of SpectMorph in general. The kind of model that SpectMorph uses is that it decomposes the original wave files into a sum of pure sine waves, with time-varying magnitude and phase information. This is a much better representation for morphing samples – at least I think so – than the original wave files. Note that I haven’t invented something new here, the basic technique is known as Spectral Modelling Synthesis.

Yesterday, I released spectmorph-0.0.2, which can analyze piano samples (and probably others like trumpet or saxophone samples) and produce a fairly convincing imitation, based on the models alone. The morphing remains to be implemented, and other problems which become apparent during listening to original and SpectMorph versions of the sounds – I created a few listening examples – remain to be fixed; mostly the sound attack of the piano samples sounds harder than the SpectMorph reproduction.

So this is not end user digestable software yet, but if you’re interested in the current state of the code, you can find it here.

12.05.2010 Optimizing sincos() for consecutive phase values

While working on the performance of SpectMorph, a project I’ll blog about later, using oprofile and valgrind –tool=callgrind both indicated that much time was spent in libm sincos() and/or libm sin(). Now the libm sincos() function uses the fsincos FPU instruction on x86 processors, and this instruction takes about 100 cpu cycles. If you need really really a lot of these sin values – like I do in SpectMorph – then this adds up to a significant percentage of all CPU time used.

So the first idea was to use a table and an interpolation scheme to implement an own sincos() function that is faster than the FPU version.

However, after thinking again, there is an even better way, if you need sincos (phase), sincos (phase + phase_inc), sincos (phase + 2 * phase_inc) … sincos (phase + n * phase_inc). So if what you really want is creating a sine/cosine wave, there is a trick:

Using a complex number for the start phase c_start, and a complex number for the phase increment c_inc, you can rewrite the above as c_start, c_start * c_inc^1, c_start * c_inc^2, c_start * c_inc^3 and so on. So the whole task is reduced to multiplying c_start with c_inc again and again. From my measurements, implementing this reduces the amount of work to about 10 cpu cycles. To get good precision, every once in a while, the code should do a real sincos() call to compute a new c_start.

Is that all that can be optimized? No. The math can be made even faster when using SSE instructions. Then you can compute four of these complex multiplications at once, so you effectively end up with about 3.5 cpu cycles per sincos() value, or less than 2.5 cpu cycles if you only need sin() values. In this case the imaginary part of the complex value needs to be computed, but not the real part.

Here is the source code for generating an array of sin()/sincos() values. Note that you need to compile with -msse for SSE instructions, and that all float arrays need to be 16-byte aligned.

Also note that if you need the weighted sum of many sine waves, you might be better off using an inverse FFT; I need that sometimes but not always, so this case can be optimized seperately.

27.03.2010 gst123

I’ve released gst123 to the public today. Its a command line tool like ogg123 or mpg123, capable of playing MP3, Ogg, FLAC and whatever other audio files gstreamer supports. So it works well on music collections where not all files are encoded in the same format.

Since I’ve been using (and thus testing) it for quite some time now, it should just work. You can grab it here.