VP9 Analyzer

Almost a year ago, I decided to quit my job and start my own business. Video coding technology in general, and VP9 specifically, seemed interesting enough that I should be able to build a business on top of it, right? The company is called Two Orioles.

Two Orioles Logo

As a first product, I’ve created a VP9 bitstream analyzer. What’s a bitstream analyzer? It’s a tool to analyze the VP9 bitstream, of course! As such, it will visualize coding tools used for each VP9 frame, such as block/transform decompositions, intra/inter prediction modes used, segmentation maps; it also displays the frame buffer at each decoding stage (prediction, pre-loopfilter, final reconstruction), differences between each of these stages, and error between each stage and the source. It can also export block-, frame- and stream-level statistics  to external tools (e.g. Google Sheets or Microsoft Excel) for further analysis.

Screen Shot 2016-01-13 at 11.18.36 AM Screen Shot 2016-01-13 at 11.16.06 AM

I’m considering adding support for more codecs to it, let me know if you’re interested in that.

Posted in General | 5 Comments

VP9 encoding/decoding performance vs. HEVC/H.264

A while ago, I posted about ffvp9, FFmpeg‘s native decoder for the VP9 video codec, which significantly outperforms Google’s decoder (part of libvpx). We also talked about encoding performance (quality, mainly), and showed VP9 significantly outperformed H.264, although it was much slower. The elephant-in-the-room question since then has always been: what about HEVC? I couldn’t address this question back then, because the blog post was primarily about decoders, and FFmpeg’s decoder for HEVC was immature (from a performance perspective). Fortunately, that concern has been addressed! So here, I will compare encoding (quality+speed) and decoding (speed) performance of VP9 vs. HEVC/H.264. [I previously presented this at the Webm Summit and VDD15, and a Youtube version of that talk is available also.]

Encoding quality

The most important question for video codecs is quality. Scientifically, we typically encode one or more video clips using standard codec settings at various target bitrates, and then measure the objective quality of each output clip. The recommended objective metric for video quality is SSIM. By drawing these bitrate/quality value pairs in a graph, we can compare video codecs. Now, when I say “codecs”, I really mean “encoders”. For the purposes of this comparison, I compared libvpx (VP9), x264 (H.264) and x265 (HEVC), each using 2-pass encodes to a set of target bitrates (x264/x265: –bitrate=250-16000; libvpx: –target-bitrate=250-16000) with SSIM tuning (–tune=ssim) at the slowest (i.e. highest-quality) setting (x264/5: –preset=veryslow; libvpx: –cpu-used=0), all forms of threading/tiling/slicing/wpp disabled, and a 5-second keyframe interval. As test clip, I used a 2-minute fragment of Tears of Steel (1920×800).

vp9-x264-x265-encoding-quality

This is a typical quality/bitrate graph. Note that both axes are logarithmic. Let’s first compare our two next-gen codecs (libvpx/x265 as encoders for VP9/HEVC) with x264/H.264: they’re way better (green/ref is left of blue, which means “smaller filesize for same quality”, or alternatively you could say they’re above, which means “better quality for same filesize”). Either way, they’re better. This is expected. By how much? So, we typically try to estimate how much more bits “blue” needs to accomplish the same quality as (e.g.) “red”, by comparing an actual point of red to an interpolated point (at the same SSIM score) of the blue line. For example, the red point at 1960kbps has an SSIM score of 18.16. The blue line has two points at 17.52 (1950) and 18.63 (3900kbps). Interpolation gives an estimated point for SSIM=18.16 around 2920kbps, which is 49% larger. So, to accomplish the same SSIM score (quality), x264 needs 49% more bitrate than libvpx. Ergo, libvpx is 49% better than x264 at this bitrate, this is called the bitrate improvement (%). x265 gets approximately the same improvement over x264 as libvpx at this bitrate. The distance between the red/green lines and blue line get larger as the bitrate goes down, so the codecs have a higher bitrate improvement at low bitrates. As bitrates go up, the improvements go down. We can also see slight differences between x265/libvpx for this clip: at low bitrates, x265 slightly outperforms libvpx. At high bitrates, libvpx outperforms x265. These differences are small compared to the improvement of either encoder over x264, though.

Encoding speed

So, these next-gen codecs sound awesome. Now let’s talk speed. Encoder devs don’t like to talk speed and quality at the same time, because they don’t go well together. Let’s be honest here: x264 is an incredibly well-optimized encoder, and many people still use it. It’s not that they don’t want better bitrate/quality ratios, but rather, they complain that when they try to switch, it turns out these new codecs have much slower encoders, and when you increase their speed settings (which lowers their quality), the gains go away. Let’s measure that! So, I picked a target bitrate of 4000kbps for each encoder, using otherwise the same settings as earlier, but instead of using the slow presets, I used variable-speed presets (x265/x264: –preset=placebo-ultrafast; libvpx: –cpu-used=0-7).

vp9-x264-x265-encoding-speed

This is a graph people don’t talk about often, so let’s do exactly that. Horizontally, you see encoding time in seconds per frame. Vertically, we see bitrate improvement, the metric we introduced previously, basically a combination of the quality (SSIM) and bitrate, compared to a reference point (x264 @ veryslow is the reference point here, which is why the bitrate improvement over itself is 0%).

So what do these results mean? Well, first of all, yeah, sure, x265/libvpx are ~50% better than x264, as claimed. But, they are also 10-20x slower. That’s not good! If you normalize for equal CPU usage, you’ll notice that (again looking at the x264 point at 0%, 0.61 sec/frame), if you look at intersected points of the red line (libvpx) vertically above it, the bitrate improvement normalized for CPU usage is only 20-30%. For x265, it’s only 10%. What’s worse is that the x265 line actually intersects with the x264 line just left of that. In practice, that means that if your CPU usage target for x264 is anything faster than veryslow, you basically want to keep using x264, since at that same CPU usage target, x265 will give worse quality for the same bitrate than x264. The story for libvpx is slightly better than for x265, but it’s clear that these next-gen codecs have a lot of work left in this area. This isn’t surprising, x264 is a lot more mature software than x265/libvpx.

Decoding speed

Now let’s look at decoder performance. To test decoders, I picked the x265/libvpx-generates files at 4000kbps, and created an additional x264 file at 6500kbps, all of which have an approximately matching SSIM score of around 19.2 (PSNR=46.5). As decoders, I use FFmpeg’s native VP9/H264/HEVC decoders, libvpx, and openhevc. OpenHEVC is the “upstream” of FFmpeg’s native HEVC decoder, and has slightly better assembly optimizations (because they used intrinsics for their idct routines, whereas FFmpeg still runs C code in this place, because it doesn’t like intrinsics).

ffmpeg-libvpx-openhevc-decoder-speed

So, what does this mean? Let’s start by comparing ffh264 and ffvp9. These are FFmpeg’s native decoders for H.264 and VP9. They both get approximately the same decoding speed, ffvp9 is in fact slightly faster, by about 5%. Now, that’s interesting. When academics typically speak about next-gen codecs, they claim it will be 50% slower. Why don’t we see that here? The answer is quite simple: because we’re comparing same-quality (rather than same-bitrate) files. Decoders that are this well optimized and mature, tend to spend most of their time in decoding coefficients. If the bitrate is 50% larger, it means you’re spending 50% more time in coefficient decoding. So, although the codec tools in VP9 may be much more complex than in VP8/H.264, the bitrate savings cause us to not spend more time doing actual decoding tasks at the same quality.

Next, let’s compare ffvp9 with libvpx-vp9. The difference is pretty big: ffvp9 is 30% faster! But we already knew that. This is because FFmpeg’s codebase is better optimized than libvpx. This also introduces interesting concepts for potential encoder optimizations: apparently (in theory) we should be able to make encoders that are much better optimized (and thus much faster) than libvpx. Wouldn’t that be nice?

Lastly, let’s compare ffvp9 to ffhevc: VP9 is 55% faster. This is partially because HEVC is much, much, much more complex than VP9, and partially because of the C idct routines in ffhevc. To normalize, we also compare to openhevc (which has idct intrinsics). It’s still 35% slower, so the story for VP9 at this point seems more interesting than for HEVC. A lot of work is left to be done on FFmpeg’s HEVC decoder.

Multi-threaded decoding

Lastly, let’s look at multi-threaded decoding performance:

ffmpeg-libvpx-openhevc-decoder-mt

Again, let’s start by comparing ffvp9 with ffh264: ffh264 scales much better. This is expected, the backwards adaptivity feature in VP9 affects multithreaded scaling somewhat, and ffh264 doesn’t have such a feature. Next, ffvp9 versus ffhevc/openhevc: they both scale about the same. Lastly: libvpx-vp9. What happened? Well, when backwards adaptivity is enabled and tiling is disabled in the VP9 bitstream, libvpx doesn’t use multi-threading at all, so I’ll call it a TODO item in libvpx. There is no reason why this is the case, as is proven by ffvp9.

Conclusions

  • Next-gen codecs provide 50% bitrate improvements over x264, but are 10-20x as slow at the top settings required to accomplish such results.
  • Normalized for CPU usage, libvpx already has some selling points when compared to x264; x265 is still too slow to be useful in most practical scenarios except in very high-end scenarios.
  • ffvp9 is an incredibly awesome decoder that outperforms all other decoders.

Lastly, I was asked this question during my VDD15 talk, and it’s fair question so I want to address it here: why didn’t I talk about encoder multi-threading? There’s certainly a huge scope of discussion there (slicing, tiling, frame-multithreading, WPP).  The answer is that the primary target of my encoder portion was VOD (e.g. Youtube), and they don’t really care about multi-threading, since it doesn’t affect total workload. If you encode four files in parallel on a 4-core machine and each takes 1 minute, or you encode each of them serially using 4 threads, where each takes 15 seconds, you’re using the full machine for 1 minute either way. For clients of VOD streaming services, this is different, since you and I typically watch one Youtube video at a time.

Posted in General | 36 Comments

The world’s fastest VP9 decoder: ffvp9

As before, I was very excited when Google released VP9 – for one, because I was one of the people involved in creating it back when I worked for Google (I no longer do). How good is it, and how much better can it be? To evaluate that question, Clément Bœsch and I set out to write a VP9 decoder from scratch for FFmpeg. The goals never changed from the original ffvp8 situation (community-developed, fast, free from the beginning). We also wanted to answer new questions: how does a well-written decoder compare, speed-wise, with a well-written decoder for other codecs? TLDR (see rest of post for details):

  • as a codec, VP9 is quite impressive – it beats x264 in many cases. However, the encoder is slow, very slow. At higher speed settings, the quality gain melts away. This seems to be similar to what people report about HEVC (using e.g. x265 as an encoder).
  • single-threaded decoding speed of libvpx isn’t great. FFvp9 beats it by 25-50% on a variety of machines. FFvp9 is somewhat slower than ffvp8, and somewhat faster than ffh264 decoding speed (for files encoded to matching SSIM scores).
  • Multi-threading performance in libvpx is deplorable, it gains virtually nothing from its loopfilter-mt algorithm. FFvp9 multi-threading gains nearly as much as ffh264/ffvp8 multithreading, but there’s a cap (material-, settings- and resolution-dependent, we found it to be around 3 threads in one of our clips although it’s typically higher) after which further threads don’t cause any more gain.

The codec itself

To start, we did some tests on the encoder itself. The direct goal here was to identify bitrates at which encodings would give matching SSIM-scores so we could do same-quality decoder performance measurements. However, as such, it also allows us to compare encoder performance in itself. We used settings very close to recommended settings for VP8, VP9 and x264, optimized for SSIM as a metric. As source clips, we chose Sintel (1920×1080 CGI content, source), a 2-minute clip from Tears of Steel (1920×800 cinematic content, source), and a 3-minute clip from Enter the Void (1920×818 high-grain/noise content, screenshot). For each, we encoded at various bitrates and plotted effective bitrate versus SSIM.

sintel_ssimtos_ssim

 

You’ll notice that in most cases, VP9 can indeed beat x264, but, there’s some big caveats:

  • VP9 encoding (using libvpx) is horrendously slow – like, 50x slower than VP8/x264 encoding. This means that encoding a 3-minute 1080p clip takes several days on a high-end machine. Higher –cpu-used=X parameters make the quality gains melt away.
  • libvpx’ VP9 encodes miss the target bitrates by a long shot (100% off) for the ETV clip, possibly because of our use of –aq-mode=1.
  • libvpx tends to slowly decay towards normal at higher bitrates for hard content – again, look at the ETV clip, where x264 shows some serious mature killer instinct at the high bitrate end of things. [edit 6/3/’14: original results showed x264 beating libvpx by a lot at high bitrates, but the source had undergone double compression itself so we decided to re-do these experiments – thanks to Clement for picking up on this.]

Overall, these results are promising, although the lack-of-speed is a serious issue.

Decoder performance

For decoding performance measurements, we chose (Sintel) 500 (VP9)1200 (VP8) and 700 (x264) kbps (SSIM=19.8); Tears of Steel 4.0 (VP9)7.9 (VP8) and 6.3 (x264) mbps (SSIM=19.2); and Enter the Void 9.7 (VP9)16.6 (VP8) and 10.7 (x264) mbps (SSIM=16.2). We used FFmpeg to decode each of these files, either using the built-in decoder (to compare between codecs), or using libvpx-vp9 (to compare ffvp9 versus libvpx). Decoding time was measured in seconds using “time ffmpeg -threads 1 [-c:v libvpx-vp9] -i $file -f null -v 0 -nostats – 2>&1 | grep user”, with this FFmpeg and this libvpx revision (downloaded on Feb 20th, 2014).

sintel_archs

tos_archsetv_archs

 

A few notes on ffvp9 vs. libvpx-vp9 performance:

  • ffvp9 beats libvpx consistently by 25-50%. In practice, this means that typical middle- to high-end hardware will be able to playback 4K content using ffvp9, but not using libvpx. Low-end hardware will struggle to playback even 720p content using libvpx (but do so fine using ffvp9).
  • on Haswell, the difference is significantly smaller than on sandybridge, likely because libvpx has some AVX2 optimizations (e.g. for MC and loop filtering), whereas ffvp9 doesn’t have that yet; this means this difference might grow over time as ffvp9 gets AVX2 optimizations also.
  • on the Atom, the differences are significantly smaller than on other systems; the reason for this is likely that we haven’t done any significant work on Atom-performance yet. Atom has unusually large latencies between GPRs and XMM registers, which means you need to take special care in ordering your instructions to prevent unnecessary halts – we haven’t done anything in that area yet (for ffvp9).
  • Some users may find that ffvp9 is a lot slower than advertised on 32bit; this is correct, most of our SIMD only works on 64bit machines. If you have 32bit software, port it to 64bit. Can’t port it? Ditch it. Nobody owns 32bit x86 hardware anymore these days. [Edit: as of 12/27/2014, all ffvp9 optimizations work on 32-bit, and baseline has moved from SSSE3 to SSE2.]

So how does VP9 decoding performance compare to that of other codecs? There’s basically two ways to measure this: same-bitrate (e.g. a 500kbps VP8 file vs. a 500kbps VP9 file, where the VP9 file likely looks much better), or same-quality (e.g. a VP8 file with SSIM=19.2 vs. a VP9 file with SSIM=19.2, where the VP9 file likely has a much lower bitrate). We did same-quality measurements, and found:

  • ffvp9 tends to beat ffh264 by a tiny bit (10%), except on Atom (which is likely because ffh264 has received more Atom-specific attention than ffvp9).
  • ffvp9 tends to be quite a bit slower than ffvp8 (15%), although the massive bitrate differences in Enter the Void actually makes it win for that clip (by about 15%, except on Atom). Given that Google promised VP9 would be no more than 40% more complex than VP8, it seems they kept that promise.
  • we did some same-bitrate comparisons, and found that x264 and ffvp9 are essentially identical in that scenario (with x264 having slightly lower SSIM scores); vp8 tends to be about 50% faster, but looks significantly worse.

Multithreading

One of the killer-features in FFmpeg is frame-level multithreading, which allows multiple cores to decode different video frames in parallel. Libvpx also supports multithreading. So which is better?

sintel_decspeedtos_decspeedetv_decspeed

 

Some things to notice:

  • libvpx multithreading performance is deplorable. It gains virtually nothing. This is likely because libvpx’ VP9 decoder supports only loopfilter-multithreading (which is enabled here), or tile multithreading, which is only enabled if files are encoded with –frame-parallel (which disables backwards adaptivity, a major source of quality improvement in VP9 over VP8) and –tile-rows=0 –tile-cols=N for N>0 (i.e. only tile columns, but specifically no tile rows). It’s confusing why this combination of restriction exists before tile-multithreading is enabled (in theory, it could be enabled whenever –tile-cols=N for N>0, but for now it looks like libvpx’ decoding performance won’t gain anything from multithreading in most practical settings.
  • ffvp9 multithreading performance is mostly on-par with that of ffvp8/ffh264, although it scales slightly less well (i.e. the performance improvement is marginally worse for ffvp9 than for ffvp8/ffh264)…
  • … but you’ll notice a serious issue at 4 threads in Enter the Void – suddenly it stops improving. Why? Well, this clip is very noisy and encoded at a high bitrate, which effectively means that there will be many non-zero coefficients, and thus a dispropotionally high percentage of decoding time (as much as 30%) will be spent in coefficient decoding. Remember when I mentioned backwards adaptivity? A practical side-effect of this feature is that the next frame can only start decoding when the previous frame has finished decoding all coefficients (and modes), so that adaptivity updates can actually take place before the next thread starts decoding the next frame. If coefficient decoding takes 30% plus another 5-10% for mode decoding and other overhead, it means 35-40% of processing time is non-reconstruction-related and can’t be parallelized in VP9 – thus performance reaches a ceiling at 2.5-3 threads. The solution? –frame-parallel=1 in the encoder, but then quality will drop.

Next steps

So is ffvp9 “done” now? Well, it’s certainly usable, and has been fuzzed extensively, thus it should be relatively secure (so not to repeat this), but it’s nowhere near done:

  • many functions (idct16/32, iadst16, motion compensation, loopfilter) could benefit from AVX2 implementations.
  • there’s no SIMD optimizations for non-x86 platforms yet (e.g. arm-neon).
  • more special-use-cases like Atom have not been explored yet.
  • ffvp9 does not yet support SVC or 444. [Edit: as of 05/06/2015, SVC, profile 1 (4:2:2, 4:4:0 ad 4:4:4) and profile 2-3 (10-12 bpp support) are supported.]

But all of this is decoder-only, and the 800-pound gorilla issue for VP9 adoption – at this point – is encoder performance (i.e. speed).

What about HEVC?

Well, HEVC has no optimized, opensource decoder yet, so there’s nothing to measure. It’s coming, but not yet finished. We did briefly look into x265, one of the more popular HEVC encoders. Unfortunately, it suffers from the same basic issue as libvpx: it can be fast, and it can beat x264, but it can’t do both at the same time.

Raw data

See here. Also want to high-five Clément Bœsch for writing the decoder with me, and thank Clément Bœsch (again) and Hendrik Leppkes for helping out with the performance measurements.

Posted in General | 18 Comments

Brute-force thread-debugging

Thread debugging should be easy; there’s advanced tools like helgrind and chess, so it’s a solved problem, right?

Once upon a time, FFmpeg merged the mt-branch, which allowed frame-level multi-threading. While one CPU core decodes frame 1, the next CPU core will decode frame 2 in parallel (and so on for any other CPU cores you have). This might sound somewhat odd, because don’t most video codecs use motion vectors to access data in previously coded reference frames? Yes, they do, but we can simply add a condition variable so that thread 2 waits for the relevant data in the reference frame (concurrently decoded by thread 1) to have finished reconstructing that data, and all works fine. Although this might seem to destroy the whole point of concurrency, it works well in practice (because motion vectors tend to not cross a whole frame).

Heisenbugs and their tools

Like any other software feature, this feature contained bugs. Threading bugs have the funny name “heisenbugs”: by virtue of the scheduling of instructions on your 2 CPU cores not being identical between different runs, the interaction between 2 threads will not be identical between 2 runs of exact the same commandline. In FFmpeg, we use an elaborate framework knows as FATE to test for video decoder regressions, and we set up some stations to specifically test various multithreading configurations. As you’d expect with heisenbugs, some of these would occasionally fail a test, but otherwise run OK. So how do you debug this?

Let me start with chess. Chess is actually an extension to MSVC, so I actually first had to port FFmpeg to MSVC (which was also useful for Chrome). With that problem out of the way, this should be easy right? Last release 5 years ago, forum dead as of 2011, right… Anyway, what chess attempts to do, is to settle a fixed scheduling path between your different threads, such that they will interact in the same way between multiple runs, thus allowing you to consistently reproduce the same bug for debugging purposes. That’s incredibly helpful, but I never tried it out at the end. I’m looking forward to this appearing in some next version of MSVC.

So, helgrind. FATE actually has a helgrind station, and it sucks, reporting 1000s of potential races for files that have never failed decoding (that is, they are pixel-perfect every single time). Is there a race? Who knows, maybe. But I’m not interested in debugging theoretical races, I want a tool that helps me debug stuff that is happening. Imagine how infuriating asan, valgrind or gdb would be if they told us about stuff that might crash instead of the crash we’re investigating. (Now, post-hoc, it turns out that helgrind did indeed identify one of the bugs causing the heisenbugs in ffmpeg-mt, but it was lost in the noise.)

Brute-force heisen-debugging

So now that all our best tools are not all that helpful, what to do? I ended up doing it the brute-force way (In this example, I’m debugging the h264-conformance-cama2_vtc_b FATE test in FFmpeg):

$ make THREADS=2 V=1 fate-h264-conformance-cama2_vtc_b
[..]
ffmpeg -nostats -threads 2 -thread_type frame -i cama2_vtc_b.avc -f framecrc -

Note that it didn’t fail! So now that we know what commandline it’s executing, let’s change that into something that brute-forces a heisenbug out of its hiding. First, let’s generate a known-good reference:

$ ./ffmpeg -threads 1 -i cama2_vtc_b.avc -f md5 -nostats -v 0 -
MD5=ec33975ec4d2fccc55485da3f37a755b

Note that that used only 1 thread, since it serves as our known-good reference. Lastly, let’s see how (and how often) we can make that fail by running it as often as it takes until it fails:

$ cat test.sh
i=0
while [ true ]; do
  MD5=$(./ffmpeg -threads 2 -thread_type frame \
            -i cama2_vtc_b.avc -f md5 -nostats -v 0 -)
  if [ "$MD5" != "MD5=ec33975ec4d2fccc55485da3f37a755b" ]; then
    echo "$i failed! $MD5"
  else
    printf "$i\r"
  fi
  ((i++))
done
$ bash test.sh
2731 failed! MD5=9cdbf390e5aed1e723c7c3a2def96377
3681 failed! MD5=64a112a2cfc61610a5f75c65293bbbbc
5892 failed! MD5=10224e406d4a2451c60e642a24fc3dce

And we have a reproducible failing testcase! One problem with thread debugging is failures are hard to reproduce, and another is that we may be looking at different failures at the same time (as is demonstrated by the different outputs for the 2 shown failures). However, we’d like to focus on runs that fail in one particular type of way (assuming that the cause for identical-output failures is consistent), thus taking the heisen- out of the bug. We can adjust the script slightly to focus on any one of our choosing (it turned out that all failures for this particular FATE test were caused by the same bug, displaying itself in slightly different ways).

$ cat test2.sh
i=0
while [ true ]; do
  MD5=$(./ffmpeg -threads 2 -thread_type frame \
            -i cama2_vtc_b.avc -f md5 -nostats \
            -v 0 - -y -f yuv4mpegpipe out.y4m)
  if [ "$MD5" != "MD5=64a112a2cfc61610a5f75c65293bbbbc" ]; then
    echo "$i failed! $MD5"
    break
  elif [ "$MD5" != "MD5=ec33975ec4d2fccc55485da3f37a755b" ]; then
    echo "$i failed (the wrong way): $MD5"
  else
    printf "$i\r"
  fi
  ((i++))
done
$ bash test2.sh
2201 failed (the wrong way): MD5=9cdbf390e5aed1e723c7c3a2def96377
9587 failed! MD5=64a112a2cfc61610a5f75c65293bbbbc

And with the heisen-part out of the way, we can now start debugging this as any other bug (printf debugging is easy this way, but you could even get fancy and try to attach to gdb when a particular situation occurs). Below is a comparison of ref.y4m (left, decoded with -threads 1) and out.y4m (right, delta from left with enhanced contrast). The differences are the 3 thin horizontal black/white lines towards the top of the frame. Further research by focussing more narrowly on the decoding process for these specific blocks (using the same technique) led to this fix, and the same technique was also used to fix two other heisenbugs.

delta

 

Posted in General | 3 Comments

Microsoft Visual Studio support in FFmpeg and Libav

An often-requested feature for FFmpeg is to compile it using Microsoft Visual Studio’s C compiler (MSVC). The default (quite arrogant) answer used to be that this is not possible, because the godly FFmpeg code is too good for MSVC. Usually this will be followed by some list of C language features/extensions that GCC supports, but MSVC doesn’t (e.g. compound literals, designated initializers, GCC-style inline assembly). There are complete patches and forks related to this one single feature.

Reality is, many of these C language features are cosmetic extensions introduced in C99 that are trivially emulated using classic C89 syntax. Consider designated initializers:

struct {
    int a, b;
} var = { .b = 1, };

This can be trivially emulated in C89 by using the following syntax:

struct {
    int a, b;
} var = { 0, 1 };

For unions, you can change the initialization (as long as the size of the first field is large enough to hold the contents of any other field in the union) to do a binary translation of the initialized field type to the first field type:

union {
    unsigned int a;
    float b;
} var = { .b = 1.0, };

becomes:

union {
    unsigned int a;
    float b;
} var = { 0x3f800000, };

Here, 0x3f800000 is the binary representation of the floating point number 1.0. If the value to be converted is not static, the assignment can simply become a statement on its own:

union {
    unsigned int a;
    float b;
} var;
var.b = 1.0;

Other C99 language features (e.g. compound literals) can be translated in a similar manner:

struct {
    int *list;
} var = { (int *) { 0, 1 } };

becomes:

int *list = { 0, 1 };
struct {
    int *list;
} var = { list };

Two other Libav developers (Derek Buitenhuis and Martin Storsjo) and I wrote a conversion tool that automatically translates these C99 language features to C89-compatible equivalents. With this tool, the FFmpeg and Libav source trees can be translated and subsequently compiled with MSVC. A wrapper is provided so that you can tell the FFmpeg build script to use that as compiler. The wrapper will then (internally) call the conversion utility to convert the source file from C99 to C89, and then it calls the MSVC build tools to compile the resulting “C89’ified source file”. In the end, this effectively means FFmpeg and Libav can be compiled with MSVC, and the resulting binaries are capable of decoding all media types covered by the test suite (32bit, 64bit) and can be debugged using the Visual Studio debugger.

For the adventurous, here’s a quick guide (this is being added to the official Windows build documentation as-we-speak):

Requirements:

  • Microsoft Visual Studio 2010 or above (2008 may work, but is untested; 2005 won’t work);
  • msys (part of mingw or mingw-64);
  • yasm;
  • zlib, compiled with MSVC;
  • a recent version (e.g. current git master) of Libav or FFmpeg.

Build instructions:

  • from the Start menu, open a “Visual Studio Command Prompt” for whatever version of Visual Studio you want to use to compile FFmpeg/Libav;
  • from this DOS shell, open a msys shell;
  • first-time-only – build c99-to-c89 (this may be tricky for beginners):
    • you’ll need clang, compiled with MSVC, for this step;
    • check out the c99-to-c89 repository;
    • compile it with clang (this probably requires some manual Makefile hackery; good luck!);
    • at some point in the near future, we will provide pre-compiled static binaries to make this easier (then, you won’t need clang anymore);
  • get the C99 header file inttypes.h from code.google.com and place it in the root folder of your source tree;
  • use the configure option “–toolchain=msvc” to tell it to use the MSVC tools (rather than the default mingw tools) to compile FFmpeg/Libav. Ensure that the c99-to-c89 conversion tools (c99wrap.exe and c99conv.exe, generated two steps up) are in your $PATH;
  • now, “make” will generate the libraries and binaries for you.

If you want to run tests (“fate”), use the “–samples=/path/to/dir” configure option to tell it where the test suite files are located. You need bc.exe (not included in default msys install) in your $PATH to run the testsuite.

It’s probably possible to generate Visual Studio solutions (.sln files) to import this project in the actual Visual Studio user interface (e.g. libvpx does that) so you no longer need the msys shell for compilation (just for configure). Although we haven’t done that yet, we’re very interested in such a feature.

Posted in General | 46 Comments

Time for something new

In the beginning of December, Frederik was born. He’s growing up nicely.

At the end of December, I succesfully defended my PhD thesis (see earlier post) and was awarded a PhD for my research titled “Notch signaling in forebrain neurogenesis”. In January, the PhD was officially awarded.

So as my family expands and needs a bigger house, and my old way-to-spend-the-day came to an end, it was time for something new. Earlier this week, I started a new job as engineer at the big G. Rumor has it that I’ll be working on something related to video.

Posted in General | 2 Comments

Meet Frederik

The latest addition to our little sprouting family: Frederik Jie Bultje. Born December 12th, 2010 in New York.

Frederik Jie Bultje

Frederik Jie Bultje

Posted in General | 6 Comments

The world’s fastest VP8 decoder: FFmpeg

Performance graph for FFmpeg's VP8 decoder vs. libvpx

Performance chart for FFmpeg's VP8 decoder vs. libvpx

Jason does a great job explaining what we did and how we did it.

Posted in General | 4 Comments

Google’s VP8 video codec

Now that the hype is over, let’s talk the real deal. How good is Google’s VP8 video codec? Since “multiple independent implementations help a standard mature quicker and become more useful to its users”, me and others (David for the decoder core and PPC optimizations, Jason for x86 optimizations) decided that we should implement a native VP8 decoder in FFmpeg. This has several advantages from other approaches (e.g. linking to libvpx, which is Google’s decoder library for VP8):

  • we can share code (and more importantly: optimizations) between FFmpeg’s VP8 decoder and decoders for previous versions of the VPx codec series (e.g. the entropy coder is highly similar compared to VP5/6). Thus, your phone’s future media player will be smaller and faster.
  • since H.264 (the current industry standard video codec) and VP8 are highly similar, we can share code (and more importantly: optimizations) between FFmpeg’s H.264 and VP8 decoders (e.g. intra prediction). Thus, again, your desktop computer’s future media player will be smaller and faster.
  • Since FFmpeg’s native VP3/Theora and Vorbis decoders (these are video/audio codecs praised by free software advocates) already perform better than the ones provided by Xiph (libvorbis/libtheora), it is highly likely that our native VP8 decoder will (once properly optimized) also perform better than Google’s libvpx. The pattern here is that since each libXYZ has to reinvent its own wheel, they’ll always fall short of reaching the top. FFmpeg comes closer simply because our existing wheels are like what you’d want on your next sports car.
  • Making a video decoder is fun!

In short, we wrote a video decoder that heavily reuses existing components in FFmpeg, leading to a vp8.c file that is a mere 1400 lines of code (including whitespace, comments and headers) and another 450 for the DSP functions (the actual math backend of the codec, which will be heavily optimized using SIMD). And it provides binary-identical output compared to libvpx for all files in the vector testsuite. libvpx’ vp8/decoder/*.c plus vp8/common/*.c alone is over 10,000 lines of code (i.e. this excludes optimizations), with another > 1000 lines of code in vpx/, which is the public API to actually access the decoder.

Current work is ongoing to optimize the decoder to outperform libvpx on a variety of computer devices (think beyond your desktop, it will crunch anything; performance becomes much more relevant on phones and such devices). More on that later.

Google's Test Suite, Vector 15 screenshot

Google's Test Suite, Vector 15 screenshot

Things to notice so so far:

  • Google’s VP8 specs are not always equally useful. They only describe the baseline profile (0). Other profiles (including those part of the vector testsuite, i.e. 1-3) use features not described in the specifications, such as chroma fullpixel motion vector (MV) rounding, a bilinear motion compensation (MC) filter (instead of a subpixel six-tap MC filter). Several parts of the spec are incomplete (“what if a MV points outside the frame?”) or confusing (the MV reading is oddly spread through 3 sections in a chapter, where the code in each section specifically calls code from the previous section, i.e. they really are one section), which means that in the end, it’s much quicker to just read libvpx source code rather than depend on the spec. Most importantly, the spec really is a straight copypaste of the decoder’s source code. As a specification, that’s not very useful or professional. We hope that over time, this will improve.
  • Google’s libvpx is full of (hopefully) well-performing assembly code, quite some of which isn’t actually compiled or used (e.g. the PPC code), which makes some of us wonder what the purpose of its presence is.
  • Now that VP8 is released, will Google release specifications for older (currently undocumented) media formats such as VP7?
Posted in General | 20 Comments

WMAVoice postfilter

I previously posted about my ongoing studies on the WMA Voice codec. A basic implementation of the actual codec was submitted and accepted/applied into FFmpeg SVN. Speech codecs work at ultra-low bitrates (~10kbps and lower) and suffer from obvious encoding artifacts, leading to “robotic” output sounds. Also, depending on the source (imaging a phone conversation in a mall), samples often have considerable levels of background noise. These types of artifacts are common to all speech codecs, and there are a variety of postfilters meant to reduce their effects. In fact, most speech codecs use the exact same filters. Imagine the smile on a developer’s face if a common proprietary postfilter can be implemented by calling no more than 3-4 already-implemented functions (as was the case with QCELP, another speech codec).

This was almost the case with WMAVoice, with one exception. This was the first time we saw an implementation of a Wiener filter. The purpose of the filter is noise reduction. Clearly, if noisy signal = signal + noise, then signal = noisy signal – noise. Sounds simple, right? The math is actually a little complex, but fortunately this is quite well-documented in the scientific literature of signal processing. The idea is that noise has lower signal strength than the intended signal. By increasing the contrast between the strength of these two, you decrease noise and thus enhance perception of the signal itself.

Here’s what the filter does:

  • Take FFT (“frequency distribution”) of the LPCs (“time-independent representation of signal”);
  • Calculate a power spectrum of these, which is basically a representation of the strongest power/frequency pairs versus the weakest ones, along with the desired level/strength of noise subtraction, as quasi-coefficients;
  • turn these into actual denoising filter coefficients using a Hilbert/Laplace transform;
  • apply these to the FFT of the “noisy” output of the speech synthesis filter.

The resulting patch was applied to SVN trunk last week. Thanks to Alex (hm, old…) and Vitor (hm, no blog…) for helping me understand! Time for something new, I guess…

Posted in General | 1 Comment