The world’s fastest VP8 decoder: FFmpeg

Performance graph for FFmpeg's VP8 decoder vs. libvpx

Performance chart for FFmpeg's VP8 decoder vs. libvpx

Jason does a great job explaining what we did and how we did it.

Posted in General | 4 Comments

Google’s VP8 video codec

Now that the hype is over, let’s talk the real deal. How good is Google’s VP8 video codec? Since “multiple independent implementations help a standard mature quicker and become more useful to its users”, me and others (David for the decoder core and PPC optimizations, Jason for x86 optimizations) decided that we should implement a native VP8 decoder in FFmpeg. This has several advantages from other approaches (e.g. linking to libvpx, which is Google’s decoder library for VP8):

  • we can share code (and more importantly: optimizations) between FFmpeg’s VP8 decoder and decoders for previous versions of the VPx codec series (e.g. the entropy coder is highly similar compared to VP5/6). Thus, your phone’s future media player will be smaller and faster.
  • since H.264 (the current industry standard video codec) and VP8 are highly similar, we can share code (and more importantly: optimizations) between FFmpeg’s H.264 and VP8 decoders (e.g. intra prediction). Thus, again, your desktop computer’s future media player will be smaller and faster.
  • Since FFmpeg’s native VP3/Theora and Vorbis decoders (these are video/audio codecs praised by free software advocates) already perform better than the ones provided by Xiph (libvorbis/libtheora), it is highly likely that our native VP8 decoder will (once properly optimized) also perform better than Google’s libvpx. The pattern here is that since each libXYZ has to reinvent its own wheel, they’ll always fall short of reaching the top. FFmpeg comes closer simply because our existing wheels are like what you’d want on your next sports car.
  • Making a video decoder is fun!

In short, we wrote a video decoder that heavily reuses existing components in FFmpeg, leading to a vp8.c file that is a mere 1400 lines of code (including whitespace, comments and headers) and another 450 for the DSP functions (the actual math backend of the codec, which will be heavily optimized using SIMD). And it provides binary-identical output compared to libvpx for all files in the vector testsuite. libvpx’ vp8/decoder/*.c plus vp8/common/*.c alone is over 10,000 lines of code (i.e. this excludes optimizations), with another > 1000 lines of code in vpx/, which is the public API to actually access the decoder.

Current work is ongoing to optimize the decoder to outperform libvpx on a variety of computer devices (think beyond your desktop, it will crunch anything; performance becomes much more relevant on phones and such devices). More on that later.

Google's Test Suite, Vector 15 screenshot

Google's Test Suite, Vector 15 screenshot

Things to notice so so far:

  • Google’s VP8 specs are not always equally useful. They only describe the baseline profile (0). Other profiles (including those part of the vector testsuite, i.e. 1-3) use features not described in the specifications, such as chroma fullpixel motion vector (MV) rounding, a bilinear motion compensation (MC) filter (instead of a subpixel six-tap MC filter). Several parts of the spec are incomplete (“what if a MV points outside the frame?”) or confusing (the MV reading is oddly spread through 3 sections in a chapter, where the code in each section specifically calls code from the previous section, i.e. they really are one section), which means that in the end, it’s much quicker to just read libvpx source code rather than depend on the spec. Most importantly, the spec really is a straight copypaste of the decoder’s source code. As a specification, that’s not very useful or professional. We hope that over time, this will improve.
  • Google’s libvpx is full of (hopefully) well-performing assembly code, quite some of which isn’t actually compiled or used (e.g. the PPC code), which makes some of us wonder what the purpose of its presence is.
  • Now that VP8 is released, will Google release specifications for older (currently undocumented) media formats such as VP7?
Posted in General | 20 Comments

WMAVoice postfilter

I previously posted about my ongoing studies on the WMA Voice codec. A basic implementation of the actual codec was submitted and accepted/applied into FFmpeg SVN. Speech codecs work at ultra-low bitrates (~10kbps and lower) and suffer from obvious encoding artifacts, leading to “robotic” output sounds. Also, depending on the source (imaging a phone conversation in a mall), samples often have considerable levels of background noise. These types of artifacts are common to all speech codecs, and there are a variety of postfilters meant to reduce their effects. In fact, most speech codecs use the exact same filters. Imagine the smile on a developer’s face if a common proprietary postfilter can be implemented by calling no more than 3-4 already-implemented functions (as was the case with QCELP, another speech codec).

This was almost the case with WMAVoice, with one exception. This was the first time we saw an implementation of a Wiener filter. The purpose of the filter is noise reduction. Clearly, if noisy signal = signal + noise, then signal = noisy signal – noise. Sounds simple, right? The math is actually a little complex, but fortunately this is quite well-documented in the scientific literature of signal processing. The idea is that noise has lower signal strength than the intended signal. By increasing the contrast between the strength of these two, you decrease noise and thus enhance perception of the signal itself.

Here’s what the filter does:

  • Take FFT (“frequency distribution”) of the LPCs (“time-independent representation of signal”);
  • Calculate a power spectrum of these, which is basically a representation of the strongest power/frequency pairs versus the weakest ones, along with the desired level/strength of noise subtraction, as quasi-coefficients;
  • turn these into actual denoising filter coefficients using a Hilbert/Laplace transform;
  • apply these to the FFT of the “noisy” output of the speech synthesis filter.

The resulting patch was applied to SVN trunk last week. Thanks to Alex (hm, old…) and Vitor (hm, no blog…) for helping me understand! Time for something new, I guess…

Posted in General | 1 Comment

Google Summer-of-Code 2010 deadline nearing

I blogged about it before, but let’s remind all students that you can work on FFmpeg this summer, and earn money ($5000) while doing so. The deadline is this Friday, the 9th.

Google’s Summer-of-Code is a yearly recurring event where students spend their summer coding for free software projects, and make a buck. In the past few years, some of our much-valued contributions created during the Summer-of-Code have included a VC-1/WMV9, RealVideo3/4, WMAPro and AMR-NB decoder and an MPEG-4/AAC encoder/decoder (and many, many more!). This year, we have had several high-quality proposals from students wanting to work on network-related protocols or audio codecs, but are still looking for applications related to:

If you’re interested in learning more about the innermost workings of multimedia, you have good C-skills and are willing to learn a lot more about these, then send an email to the ffmpeg-soc mailinglist, or come to IRC (#ffmpeg-devel on Freenode) to find out more. Please apply before Friday!

Posted in General | 1 Comment

Google’s Summer of Code 2010

It’s that time again – the time where Google will announce their Summer of Code! In the summer of code, students can work on free software projects during their summer break, and make $4500 while they’re at it. FFmpeg has traditionally been a strong contender, and some of its highest profile code (VC-1, WMAPro and RealVideo4 decoders, just to name a few) was developed in part in the Summer of Code.

Are you a student, proficient in C, with excellent technical skills / insight (or you want to learn to develop these) and you want to contribute to one of the most exciting free software projects out there? Then apply for one of FFmpeg’s suggested projects for GSoC 2010!

Posted in General | 1 Comment

WMAVoice codec dissection

Since my last post,I’ve completed my first version of a free (as in speech) WMA-Voice/Speech codec. The patch is available on the FFmpeg mailinglist. As described previously, the codec has some very interesting behaviour, e.g. related to integer rounding. At unofficial forums, you will generally get the impression that use of WMAVoice is discouraged, because other codecs, notable WMAPro, offer better quality at the same bitrate.

This is essentially true. For low-bitrate streams, for which CELP-based voice codecs are optimized, WMAVoice has considerable noise compared to other Voice codecs. Why is this? While studying the codec, I found one particularly interesting bug (?) in the codec. Most behaviour in CELP-based codecs is based on the concept of a pitch. A pitch is like the wavelength of a sine-like frequency curve. Usually, the bitstream will code an adaptive and a fixed codebook, where the adaptive codebook is basically some sort of a modified (e.g. different gain) repetition of a previous excitation signal, whereas the fixed codebook contains completely new (“innovative”) excitation signal that was not based on any previous excitation signal. The excitation signal is a series of pulses (in an otherwise zero background) at a pitch-interval, and thus both codebooks are based on the pitch-value. After generation of the excitation signal using these two codebooks, the pulses from the two codebooks are then interleaved, so that LPCs can be used to synthesize the actual wave frequencies from the excitation pulses.

So, pitches are important. Let’s see how WMAVoice calculates pitch: pitch is coded per-frame (a frame contains 160 audio samples). This code uses the frame-pitch (an actual integer, with no fractional bits) to calculate the pitch-per-block (which can be 20/40/80/160 samples, depending on the frame type) and the pitch-per-sample. These values are obviously fractional, e.g. if you go from a pitch of 40 to a pitch of 41, then somewhere halfway the frame you’ll have a pitch of “40.5”. In the next calculation, I’ve replaced the int-math by float-math to make it clearer what they’re doing.

for (n = 0; n < s->n_blocks; n++) {
float f = (samples_per_block + 0.5) / samples_per_frame;
block_pitch[n] = round(f * prev_frame_pitch + (1 - f) * cur_frame_pitch);

Why is this wrong? Well, look at what f means. f increments from (a bit above) zero to (a bit below) one for each block in the frame. If the frame has two blocks, f will subsequently have the values 0.25 and 0.75. However, block_pitch[n] receives a value for the first block of 0.25 * prev_frame_pitch + 0.75 * cur_frame_pitch. For the second block, we’ll get the reverse, i.e. 0.75 * prev_frame_pitch + 0.25 * cur_frame_pitch. Instead of creating an incremental array that slides us from the pitch of (the end of) the last frame towards the pitch of the (end of the) current frame, an array was created that slides back from the (end of the) current frame’s pitch to the (end of the) last frame’s pitch. The result is audible noise introduced in the decoded audio:

stddev: 1588.23 PSNR: 32.31 bytes: 264014/ 264014

Of course, the FFmpeg decoder does not have this bug.

Posted in General | 5 Comments

Codec woes

I’ve recently become interested in codecs. We all know the brilliant FFmpeg project, which has provided free (as per FSF definition) implementations of a variety of popular codecs, such as Windows Media or MPEG/H.26x.

I’ve started to study one member of the Windows Media Audio family, the WMAVoice codec. I’m studying an integer version of it, which means that multiplications are done by mult/shift pairs. For example, if you imagine a (32-bit) integer with 1 bit sign (+/-), 1 bit for pre-digit numbers and the other 30 bits being fractional, then the number 1.0 would be encoded as 0x3FFFFFFF. To multiply 1.0 * 1.0 (which should give 1.0 as result) in this integer scheme, you’d use:

#define mulFrac30(a, b) (((int64_t) a * (int64_t) b) >> 30)

By changing the “30” in this macro, you could multiply numbers with different amounts of fractional bits. Those trying it themselves will notice that mulFrac30(0x3FFFFFFF, 0x3FFFFFFF) results in 0x3FFFFFFE (i.e. 0.999999999), so the result is not precise. But it’s close enough to be useful on embedded devices where floating-point operations are expensive. (It’s true that I’m basically demonstrating (n * n) / ((n + 1) * (n + 1)) here, i.e. you can get around it for this particular example by encoding 1.0 as 0x40000000 instead of 0x3FFFFFFF, but similar behaviour would show up for other numbers.)

This “rounding error” becomes more prominent if you decrease the number of fractional bits. For example, 1.0 * 1.0 in frac-16 integers (0xFFFF) = 0xFFFE, which is 0.99997. It also gets worse as you convert between fractional types, e.g. to go from a frac-16 1.0 (0xFFFF) to a frac-30 1.0, would be done by:

#define convertFrac16ToFrac30(x) (x << 14)

And the result of that on 1.0 in Frac-16 (0xFFFF) would be 0x3FFFC000. Why is that a problem? Well, look at this particular piece of code present in the codec:

void function(int array[], int v, int size)
    int stp = mulFrac31(v, 0xCCCCCC /* 0.00625 */),
        min = mulFrac31(v, 0x189374 /* 0.00075 */),
        max = mulFrac31(v, 0x3FE76C8B /* 0.49925 */), n;
    int array2[size], array3[size];

    for (n = 0; n < size; n++)
        array2[n] = array3[n] =
            mulFrac31(array1[n], v);

    array3[0] = MAX(array3[0], min);
    for (n = 1; n < size; n++)
        array3[n] = MAX(array3[n - 1] + stp,
    array3[size - 1] = MIN(array3[size - 1],

    for (n = 0; n < size; n++)
        if (array2[n] != array3[n])
            array1[n] = divFrac31(array3[n], v);

This loosely translates to the simpler float version:

void function(float array[], int v, int size)
    array1[0] = MAX(array1[0], 0.00075);
    for (int n = 1; n < size; n++)
        array1[n] = MAX(array1[n - 1] + 0.00625,
    array1[size - 1] = MIN(array1[size - 1], 0.49925);

But there's a catch. The original code converts from a scale as given by the input array1 to a scale provided by the magnitude of the "v" variable, whatever "v" may be. Turns out that "v" is the samplerate. For this type of codec, samperate is generally low, e.g. 8000Hz. Therefore, if the value in array1 is e.g. 0.25 (0x1FFFFFFF), then in array2/3, this same value would be mulFrac31(0x1FFFFFFF, 8000) = 1999. Converting this back would bring it to divFrac31(1999, 8000) = 0x1FFBE76E. We just chopped off 19 bits of our number! Needless to say, the quality of the audio is affected by this kind of behaviour (stddev is the average square-root difference between samples decoded with and without this behaviour):

stddev:    0.15 PSNR:112.41 bytes:   264014/   264014
stddev:    0.56 PSNR:101.29 bytes:   264014/   264014
stddev:   13.73 PSNR: 73.57 bytes:   527906/   527906
stddev:    2.35 PSNR: 88.87 bytes:  2538240/  2538240

Which makes me wonder: why do they do that?

Posted in General | 2 Comments

My first publications

In daily life, I’m a Ph.D. student studying the development of the central nervous system. A few months ago, the first paper was released with my name on it (as a second author), and just yesterday, I received notification of publication of my first paper as a first author.

  • Yu YC, Bultje RS, Wang X & Shi SH. Specific synapses develop preferentially among sister excitatory neurons in the neocortex. Nature 458 (7237), pp. 501-4 (2009) [ abstract at Nature ];
  • Bultje RS, Castaneda-Castellanos DR, Jan LY, Jan YN, Kriegstein AR & Shi SH. Mammalian Par3 Regulates Progenitor Cell Asymmetric Division via Notch Signaling in the Developing Neocortex. Neuron 63 (2), pp. 189-202 (2009) [ explanation, abstract at Neuron ].

Needless to say, I’m very excited and hope for more novel findings in the future. In short, my publication describes the identification of a mechanism by which radial glial cells, the “stem cells” that give rise to excitatory neurons in the cerebral cortex (the brain region that handles most higher-level functions in mouse and man), divide “asymmetrically”. In this process, stem cells divide to give rise to two different kind of daughter cells: one is another radial glial cells, which will undergo the same process again. The other daughter cell will be a neuron or a transit-amplifying cell (a committed neuronal precursor that will divide to give rise to two neurons). We identified and analyzed the protein mPar3, which seggregates into one half of the dividing cell, and we identified two other molecules downstream of mPar3, through which mPar3 regulates cell fate (“stem cell” versus “neuron”) of the daughter cells.

Asymmetric Division of Radial Glial Cells

Asymmetric Division of Radial Glial Cells

Posted in General | 2 Comments

A Real Shame

As Christian pointed out, Real Networks is an odd pea in our pot. They do Free Software at some level, but at other levels they appear to truly dislike “us” – the Free Software community. A friend of mine runs a company shipping a media product based on FFmpeg. This media product includes decoding capabilities for Microsoft’s and Real’s proprietary audio formats. When trying to buy patent licenses for their free software-based product (which can be legal; Google does this also in Chrome), this company received the following responses:

  • Microsoft: “Sure, no problem”
  • Real: “FFmpeg developers are thieves so we don’t want your money if you’ll use their product”

It’s interesting to point out here that the once-so-hated Microsoft has – especially after the EU antitrust case – done everything that we once asked for. They have published many of their protocols on MSDN. They might be corporate to the bone, but they play relatively fair and there’s will & potential for co-existence on both ends. Real wants no co-existence. They do not subscribe to our “Free” ideals.

Posted in General | 3 Comments

PowerPoint Remote for your iPhone

First, a short introduction to how this all came about. Lately, I’ve been interested in iPhone applications. The one thing that Apple did really well (after some initial crack) is to deliver a SDK and ecosystem to extend the iPhone with all the crap that you could come up with. Most of them are games, I don’t really play games, but some of them are quite useful.

For example, consider that with minimal effort, you could write applications to make your iPhone act as:

  • a USB data stick
  • a remote control for tv, audio or computer
  • a notepad

This all comes in addition to basic features that all already exist:

  • calendar
  • portable e-mail
  • SMS, IM, social networks

Suddenly the very thought of carrying 20 items with you to office every day feels really tragic. Don’t get me started about all these devices that are lingering around at home in the living room. How many remote controls do you have? You can probably imagine that I would love the Blackberry a lot, too.

For my studies, I use PowerPoint. A lot. Imagine my excitement when I found that several applications existed to use your iPhone as a remote control for PowerPoint, such as iClickr and iPresenter. Both failed in the usability department, especially when compared to Apple’s Keynote iPhone Remote application, and therefore I decided that I would have to write my own. I present you: iPhone PowerPoint Remote (more screenshots available by clicking the link).

  • swipe and tap your finger to switch between slides
  • drag your finger to show an on-screen (in your actually running slideshow) arrow pointer
  • hold-and-drag your finger to make colored annotations


The application is available in the Apple iTunes store, and requires a small piece of software to run on your host Mac, available here.

Posted in General | Tagged , , | 3 Comments