Since my last post,I’ve completed my first version of a free (as in speech) WMA-Voice/Speech codec. The patch is available on the FFmpeg mailinglist. As described previously, the codec has some very interesting behaviour, e.g. related to integer rounding. At unofficial forums, you will generally get the impression that use of WMAVoice is discouraged, because other codecs, notable WMAPro, offer better quality at the same bitrate.
This is essentially true. For low-bitrate streams, for which CELP-based voice codecs are optimized, WMAVoice has considerable noise compared to other Voice codecs. Why is this? While studying the codec, I found one particularly interesting bug (?) in the codec. Most behaviour in CELP-based codecs is based on the concept of a pitch. A pitch is like the wavelength of a sine-like frequency curve. Usually, the bitstream will code an adaptive and a fixed codebook, where the adaptive codebook is basically some sort of a modified (e.g. different gain) repetition of a previous excitation signal, whereas the fixed codebook contains completely new (“innovative”) excitation signal that was not based on any previous excitation signal. The excitation signal is a series of pulses (in an otherwise zero background) at a pitch-interval, and thus both codebooks are based on the pitch-value. After generation of the excitation signal using these two codebooks, the pulses from the two codebooks are then interleaved, so that LPCs can be used to synthesize the actual wave frequencies from the excitation pulses.
So, pitches are important. Let’s see how WMAVoice calculates pitch: pitch is coded per-frame (a frame contains 160 audio samples). This code uses the frame-pitch (an actual integer, with no fractional bits) to calculate the pitch-per-block (which can be 20/40/80/160 samples, depending on the frame type) and the pitch-per-sample. These values are obviously fractional, e.g. if you go from a pitch of 40 to a pitch of 41, then somewhere halfway the frame you’ll have a pitch of “40.5”. In the next calculation, I’ve replaced the int-math by float-math to make it clearer what they’re doing.
for (n = 0; n < s->n_blocks; n++) {
float f = (samples_per_block + 0.5) / samples_per_frame;
block_pitch[n] = round(f * prev_frame_pitch + (1 - f) * cur_frame_pitch);
}
Why is this wrong? Well, look at what f means. f increments from (a bit above) zero to (a bit below) one for each block in the frame. If the frame has two blocks, f will subsequently have the values 0.25 and 0.75. However, block_pitch[n] receives a value for the first block of 0.25 * prev_frame_pitch + 0.75 * cur_frame_pitch. For the second block, we’ll get the reverse, i.e. 0.75 * prev_frame_pitch + 0.25 * cur_frame_pitch. Instead of creating an incremental array that slides us from the pitch of (the end of) the last frame towards the pitch of the (end of the) current frame, an array was created that slides back from the (end of the) current frame’s pitch to the (end of the) last frame’s pitch. The result is audible noise introduced in the decoded audio:
stddev: 1588.23 PSNR: 32.31 bytes: 264014/ 264014
Of course, the FFmpeg decoder does not have this bug.
Nice work!
Pingback: Hobo’s Codebook | Free Coder
Pingback: Codebreaker Codebook teacher’s Edition | Free Coder
Pingback: WMA Voice in FFmpeg | Breaking Eggs And Making Omelettes
Pingback: WMAVoice postfilter « Ronald S. Bultje