One project we been working on here at Red Hat Brno is to make sure we have a nicely working voice and video calling with Empathy in Fedora 18. The project is being spearheaded by Debarshi Ray with me trying to help out with the testing. We are still not there, but we are making good progress thanks to the help of people like Brian Pepple, Sjoerd Simons, Olivier Crete and Guillaume Desmottes and more.
But having been involved with open source multimedia for so long I thought it could be interesting for people to know why free video calling have taken so long to get right and why we still have a little bit to go. So I decided to do this write up of some of the challenges involved. Be aware though that this article is mostly discuss the general historical challenges of getting free VoIP up and running, but I will try to tie that into the specific issues we are trying to resolve currently where relevant.
Protocols
The first challenge that had to be overcome was the challenge of protocols. VoIP and video calling has been around for a while (which an application like Ekiga is proof of), but it has been hampered by a jungle of complex standards, closed protocols, lack of interoperability and so on. Some of the older standards also require non-free codecs to operate. The open standard that has started to turn this around is XMPP which is the protocol that came out of the Jabber project. Originally it was just an open text chat network, but thanks to ongoing work it now features voice and video conferencing too. It also got a boost as Google choose it as the foundation for their GTalk offering ensuring that anyone with a gmail address suddenly was available to chat or call. That said like any developing protocol it has its challenges, and some slight differences in behaviour between a Google jabber server and most others is causing us some pain with video calls currently, which is one of the issues we are trying to figure out how to resolve.
Codecs and interoperability
The other thing that has hounded us is the combination of non-free codecs and the need for interoperability. For a video calling system to be interesting to use you would need to be able to use it to contact at least a substantial subset of your friends and family. For the longest time this either meant using a non-free codec, because if you relied solely on free codecs no widely used client out there would be able to connect with you. But thanks to the effort of first Xiph.org to create the Speex audio codec and now most recently the Opus audio codec, and later the adoption of Speex by Google has at least mostly resolved things on the audio side of things. On the video side things are still not 100% there. We have the Theora video codec from Xiph.org, but unfortunately when the RTP specification for that codec was written, the primary usecase in mind was RTSP streaming and not video conferencing, making the Theora RTP a bit hairy to use for video conferencing. The other bigger issue with Theora is that outside the Linux world nobody adopted Theora for video calling, so once again you are not likely able to use it to call a very large subset of your friends and family unless they are all on Linux systems.
There might be a solution on the way though in the form of new kid on the block, VP8. VP8 is a video codec that Google released as part of their WebM HTML5 video effort. The RTP specification for VP8 is still under development, so adoption is limited, but the hope and expectation is that Google will support VP8 in their GTalk client once the RTP specification is stable and thus we should have a good set of free codecs for both Audio and Video available and in the hands of a large user base.
Frameworks
Video calling is a quite complex technical issue, with a lot of components needing to work together from audio and video acquisition on your local machine, integrating with your address book, negotiating the call between the parties involved, putting everything into RTP packets on one side and unpacking and displaying them on the other side, taking into account the network, firewalls and and audio and video sync. So in order for a call to work you will need (among others) ALSA, PulseAudio, V4L2, GStreamer, Evolution Data Server, Farstream, libnice, the XMPP server, Telepathy and Empathy to work together across two different systems. And if you want to interoperate with a 3rd party system like GTalk the list of components that all need to work perfectly with each other grows further.
A lot of this software has been written in parallel with each other, written in parallel with evolving codecs and standards, and it tries to interoperate with as many 3rd party systems as possible. This has come at the cost of stability, which of course has turned people of from using and testing the video call functionality of Empathy. But we believe that we have reached a turning point now where the pieces are in place, which is why we are now trying to help stabilize and improve the experience to make doing VoIP and video conferencing calls work nicely out of the box on Fedora 18.
Missing pieces
In addition to the nitty gritty of protocols and codecs there are other pieces that has been lacking to give users a really good experience. The most critical one is good echo cancellation. This is required in order to avoid having an ugly echo effect when trying to use your laptop built-in speakers and microphone for a call. So people have been forced to use a headset to make things work reasonably well. This was a quite hard issue to solve as there was neither any great open source code available which implemented echo cancellation or a good way to hook it into the system. To start addressing this issue while I was working for Collabora Multimedia we reached out to the Dutch non-profit NLnet Foundation who sponsored us to have Wim Taymans work on creating an echo cancellation framework for PulseAudio. The goal was to create the framework within PulseAudio to support pluggable echo cancellation modules, turn two existing open source echo cancellation solutions into plugins for this framework as examples and proof of concept, and hope that the availability of such a framework would encourage other groups or individuals to release better echo cancellation modules going forward.
When we started this work the best existing open source echo cancellation system was Speex DSP. Unfortunately SpeexDSP had a lot of limitations, for instance it could not work well with two soundcards, which meant using your laptop speakers for output and a USB microphone for input would not work. Although we can claim no direct connection as things would have it Google ended up releasing a quite good echo cancellation algorithm as part of their WebRTC effort. This was quickly turned into a library and plugin for PulseAudio by Arun Raghavan. And this combined PulseAudio and WebRTC echo cancellation system is what we will have packaged and available in Fedora 18.
Summary
So I outlined a few of the challenges around having a good quality VoIP and video conferencing solution shipping out of the box on a Linux Distribution. And some of the items like the Video Codec situation and general stack stability is not 100% there yet. There also is quite a few bugs in Empathy in terms of behaviour, but Debarshi are already debugging those and with the help of the Telepathy and Empathy teams we should hopefully get those issues patched and merged before Fedora 18 is shipping. Our goal is to get Empathy up to a level where people want to be using it to make VoiP and Video calls, as that is also the best way to ensure things stay working going forward.
In addition to Debarshi, another key person helping us with this effort in the Fedora community is Brian Pepple, who are making sure we are getting releases and updates of GStreamer, Telpathy, Farstream, libnice and so on packaged for Fedora 18 almost on the day. This is making testing and verifying bugfixes a lot easier for us.
Future plans
There are also some nice to have items we want to look at going forward after having stabilized the current functionality. For instance Red Hat and Xiph.org codec guru Monty Montgomery suggested we add a video noise reduction video to the GStreamer pipeline inside Empathy in order to improve quality and performance when using a low quality built in web camera. [Edit: Sjoerd just tolm me the Gst 0.10 version of the code had such a plugin available, so this might not be to hard to resolve.]
Debarshi is also interested in seeing if we can help move the multiparty chat feature forward. But we are not expecting to be able to work on these issues before Fedora 18 is released.
Thx. Good overview!
Beside: I was the one who added noise filter to 0.10 empathy and pointed to the audio conversation/performance problems which should be fixed in gst 1.0
I said it because the stars like Monty already have their names. The people who do “small” work should have some chance too :)
[WORDPRESS HASHCASH] The poster sent us ‘0 which is not a hashcash value.
Nice to see this focus – and progress.
I’ve had great success using linphone which despite it’s name is available for platforms other than Linux as well like Windows and even Android phones. Since one can make calls to IP addresses directly, the calls stay within our VPN which is a nice extra for corporate communication…
Thanks for the overview, and thanks to everybody involved!
I am already using Empathy on Ubuntu 12.04 for making video calls with those my family who are also running Linux. It has a few issues, but in general, it works quite well!
The bigger issue seems to be interoperability with Windows clients. I am not aware of any Windows client that supports XMPP video chat, except for Jitsi and GMail, both of which have their issues as well.
IMHO, the one technology that has the best chance of winning out in this realm is WebRTC, esp. as it will be delivered to a huge amount of machines by Firefox and Chrome (I guess Chromium as well) supporting it by default in upcoming versions.
Of course, having that technology in there doesn’t mean that everyone can use it right away – we still need people to create (web) apps using that technology and build actual conferencing systems based on it.
> The other thing that has hounded us is the combination of non-free
> codecs and the need for interoperability.
This is particularly important in the context of WebRTC. There is an upcoming debate at the Atlanta IETF meeting in a few weeks on picking a Mandatory To Implement (MTI) video codec for WebRTC, and there are some vocal people who think this should be H.264 instead of VP8. Several proposals were posted to the mailing list this week (one for VP8 and three for H.264):
https://www.ietf.org/mail-archive/web/rtcweb/current/msg05462.html
Even if you don’t plan to interoperate with WebRTC, getting VP8 made MTI can have an effect on what other systems deploy it, and increasing the number of people we can call successfully. If this is important to you, you should weigh in on the list (unlike many SDOs, the IETF allows anyone to participate and all important decisions are confirmed on the mailing lists, even if they are debated at in-person meetings).
Another big challenge for video calls is doing dynamic bitrate adaptation. This normally requires both sides to cooperate, which means that you need a standard that both sides implement. And there is currently no such standard. All proprietary software vendors treat their adaptation techniques as their highly proprietary value add. And with good reason, it’s entirely non-trivial to get right. I hope that with the rtcweb/webrtc effort, such protocols can be standardises.
It also requires encoders that are flexible enough and the Open Source encoder implementations are often lacking in that regard, although libvpx (Google’s VP8 encoder/decoder) has quite a few video-calling oriented features.
I know it’s probably not high on the list, but is interoperability with Microsoft Lync being considered?
I think at the moment the important thing is to get one protocol up and running really well, and only once we have achieved that can we consider other protocols.
“when the RTP specification for that codec was written, the primary usecase in mind was RTSP streaming and not video conferencing, making the Theora RTP a bit hairy to use for video conferencing.”
Can you elaborate a bit on this? Most effort has been going into vp8, but the Theora RTP draft hasn’t made it to rfc yet, so if there are issues we can fix them.
I’m the maintainer of Farstream. The main issue with Theora is the stupidly large “configuration” string. SIP normally operates over UDP, meaning that you need to fit the entire message, including the whole SDP in under 1400 bytes. The Theora configuration string should can often be like 8k bytes. And if I understand correctly, it is more or less hardcoded in the encoder library and you just want the freedom to change it? I suggest you select one, and just forget about the configuration string entirely. Put the resolution in each packet’s header and tada, no need to carry that annoying configuration line anymore.
Thanks for clarifying. We put the configuration string in the SDP for cases where it does go over TCP. The more natural thing is to inline it in the RTP stream, which of course has the same problem with packet loss. I take it there’s no mechanism for continued packets in SIP’s SDP-over-UDP?
> it is more or less hardcoded in the encoder library and you just want the freedom to change it?
Yes, including the freedom to change it to something that’s not hard-coded.
We discussed enumerating some static configuration strings, but didn’t want to do so without feedback from implementors. Would you be willing to construct theora info and comment packets to pass to the encoder based on SDP parameters if we defined a table with the setup packet data? The info header has things like the frame size which should vary from stream to stream, but can easily be described by human-readable SDP. It’s the tables in the setup header which occupy most of the space, and which don’t change much for current encoder implementations.
Yes, we can definitely do that in the RTP payloader/depayloader. We have significantly more complicated payloaders for other common formats like H.264 and JPEG. It probably also maybe makes sense to add functions to convert those parameters to the headers to libtheora.
Actually, it makes everyone life much easier if you entirely avoid stream description the SDP, and put all of that in the RTP header. That said, the current draft is widely implemented enough that you probably need to come up with a different encoding-name than THEORA… maybe THEORA-SANE or something !