I previously posted about my ongoing studies on the WMA Voice codec. A basic implementation of the actual codec was submitted and accepted/applied into FFmpeg SVN. Speech codecs work at ultra-low bitrates (~10kbps and lower) and suffer from obvious encoding artifacts, leading to “robotic” output sounds. Also, depending on the source (imaging a phone conversation in a mall), samples often have considerable levels of background noise. These types of artifacts are common to all speech codecs, and there are a variety of postfilters meant to reduce their effects. In fact, most speech codecs use the exact same filters. Imagine the smile on a developer’s face if a common proprietary postfilter can be implemented by calling no more than 3-4 already-implemented functions (as was the case with QCELP, another speech codec).
This was almost the case with WMAVoice, with one exception. This was the first time we saw an implementation of a Wiener filter. The purpose of the filter is noise reduction. Clearly, if noisy signal = signal + noise, then signal = noisy signal – noise. Sounds simple, right? The math is actually a little complex, but fortunately this is quite well-documented in the scientific literature of signal processing. The idea is that noise has lower signal strength than the intended signal. By increasing the contrast between the strength of these two, you decrease noise and thus enhance perception of the signal itself.
Here’s what the filter does:
- Take FFT (“frequency distribution”) of the LPCs (“time-independent representation of signal”);
- Calculate a power spectrum of these, which is basically a representation of the strongest power/frequency pairs versus the weakest ones, along with the desired level/strength of noise subtraction, as quasi-coefficients;
- turn these into actual denoising filter coefficients using a Hilbert/Laplace transform;
- apply these to the FFT of the “noisy” output of the speech synthesis filter.