drivers – Dan Williams’ blog

The Incredible Magical Pantech UML290

The Basics

Along with the LG VL600 this modem was the launch device for the Verizon 4G LTE network late last year. Despite being quite large (over twice the size of a normal 3G modem) it’s not a bad device and performs quite well in speed tests. Inside is a Qualcomm MDM9600 chipset providing both CDMA 1xRTT and EVDO on the standard North American 850 MHz Cellular and 1900 MHz PCS bands, and LTE on Verizon’s Upper 700 MHz C-block band. This device cannot roam internationally.

Linux Support

The UML290 exposes four USB interfaces: a standard CDC-ACM AT command port which supports PPP, a QCDM port, a WMC port, and a raw IP network port. Of these, only the AT command and the QCDM ports are really usable in Linux. You can connect to the LTE network using standard ETSI 27.007 GSM-style AT commands like AT+CGDCONT and ATD#99* and such. Connections to the 3G EVDO network can be made with the standard ATD#777 command. Unfortunately, the PPP functionality does not support data connection handoff between the EVDO and LTE networks, so you have to break the connection and reconnect with the appropriate ATD command when necessary. Why is that?

To allow seamless operation between the EVDO and LTE networks Verizon upgraded parts of their core network to eHRPD. HRPD (High Rate Packet Data) is the new name for HDR (High Data Rate) which was the old name for the IS-856 standard developed by Qualcomm ten years ago for high speed 3G packet data. EVDO (Evolution Data Only) is just the marketing name for all that. eHRPD stands for “evolved” or “enhanced” HRPD and essentially drops in pieces of the LTE core network modified to work with older EVDO protocols. Normally your device uses the eHRPD protocol when starting a data session since both the network and the modem support it. But when you use traditional CDMA PPP via ATD#777 the session is between pppd on your computer and the packet data gateway in the network, in contrast to GSM/WCDMA/LTE where the PPP session is only between pppd and the modem itself, not over the air. My theory here is that to maintain backwards compatibility or for some other reason, PPP data sessions using ATD#777 only allow HRPD, and thus handoffs between EVDO and LTE don’t work because the LTE side doesn’t like the older HRPD.

This leads to the problem where you, as the user, have to poke values into the NV_HDRSCP_FORCE_AT_CONFIG_I NVRAM item to manually switch between HRPD and eHRPD just to get connected. Why does this matter? Because the only way to connect to the EVDO network on Linux is with a direct PPP data session using ATD#777. That sucks.

All Hail WMC (wait, what?)

Hardware often makes me want to dress all in black, sit at the end of the bar, drink, and cry. Often Matthew Garrett is right there with me so at least I have company on my trip to black, black oblivion. The hope is that talking to the UML290 on the WMC port and using the modem’s native network interface makes this stupid handoff problem just go away because the modem firmware takes care of the data session protocols and handoffs when you’re not using direct PPP. But that means that we need to reverse engineer both the WMC protocol and the network interface. I’ll drink to that.

It turns out the network interface appears to just be passing raw IP packets over USB. At least that’s what the Windows USB traces tell me unless I’ve had to much Jacky D in which case they just look like Care Bears and rainbows. Qualcomm posted some driver patches for the “smd_rmnet” driver for Android devices that describe a “raw IP” mode for RMNET interfaces that lead me to believe I’m on the right track here. We’ll see.

The WMC bits are the best part though. This Pantech-specific (as far as I can tell) protocol that has been around at least since 2005 since I’ve got an Audiovox PC5740 that uses it and a Pantech PX-500 on Sprint that looks similar yet different. WMC is just another binary protocol; essentially encoding structs on the wire but with a bunch of stupid at the front and some idiot at the end. It’s got a frame start marker of 0xC8, except when there’s more shit at the front. It’s got a frame terminator of 0x7E, except when it doesn’t. It gets HDLC escaped, except when even control characters get escaped instead of just the escape characters. It’s got standard command numbers, except when it doesn’t.

The basic WMC frame starts with 0xC8. The PC5740 and the PX-500 both accept plain WMC requests like this. The UML290 on the other hand uses just about the most convoluted format I can think of. I’d really love to know why. I hope there’s a good reason. Instead the Verizon connection manager sends the WMC packet prefixed with “AT*WMC=”, then 0xC8, and then a bunch of binary data. And not only are the HDLC escape characters escaped, all control characters under 0x20 are escaped too. Even better, the request terminates with a 0x0D instead of the standard 0x7E. So you end up with something looking like this:

41542a574d433dc87d2a87b80d

and when all the framing and shit is removed, it comes down to a single byte: 0x0A. That’s it. Really. Why is this so hard? It’s USB for crying out loud. We’re not on serial links anymore where if somebody picks up the telephone downstairs you get a bunch of garbage in your XMODEM transfer.

It gets better. There’s a CRC-16 at the end, which is pretty standard with these sorts of binary modem protocols. Qualcomm writes the original firmware for all these modems anyway and they all include a Qualcomm DIAG port which speaks a protocol using the standard HDLC framing with CRC-16 (polynomial 0x8408 and seed of 0xFFFF) and a frame terminator of 0x7E. So you’d think they’d re-use those bits. THINK AGAIN. Perhaps because they woke up one day and decided to make life hard for everyone on the planet, the Pantech engineers working on the UML290 decided to use a CRC-16 initial seed of 0xAAFE. What the fuck? Even the PC5740 and the PX-500 use a standard HDLC CRC-16 seed of 0xFFFF like just about everything else on the planet.

But it gets better. The responses from the UML290 don’t bother to include a valid CRC-16; instead it’s just 0x3030. Wow, class work guys. I’m sure there’s good reason for that. Or not. At least the PC5740 and PX-500 get points for valid CRCs.

Which begs the question: why do people still use these serial protocols? Every other piece of USB-connected wireless hardware I’ve seen, from WiFi devices to WiMAX cards, don’t bother with this serial framing shit at all. Even for firmware uploads. They just push packed structs up and down the wire. USB already has a 16-bit CRC check for data packets. Let’s re-invent the wheel for no good reason just because it’s fun.

Why do mobile broadband modems have to be different? Why all the framing and escaping and general eye gouging with shards broken glass? Why duplicate what USB already does? If your modem doesn’t use USB, doesn’t that protocol already have integrity protection and error checking? Cause if it doesn’t you’re already in for a world of hurt.

As an embedded engineer you just have to wake up one morning and say “This is fucking stupid.” But I suppose that’s not something a 6-month product cycle allows. Which is why, as open-source engineers that have to talk to hardware, we tend to drink. And then cry a lot.

I’ll Take All 4 Gs

And where did every single one of those Gs come from? WiMAX of course. If you’re one of the 13 million WiMAX users or you pack an EVO 4G you know about WiMAX, but you might still be interested to know some how every one of those mind-blowing Gs showed up in your life. Yeah, there’s LTE too, and we’ll save that for another post as the networks and hardware are both quite new.

History to the MAX

wimax-area — Home, office, park, wherever... (credit: Intel)

The WiMAX committees got organized when your desktop looked like this or maybe even this. Seriously, that’s a GIF file and the dude is playing a MUSH. Yeah that’s right, around the time you were totally into Backstreet and there were only 1 or 2 Gs playing with your heart, not 4. Its rise to popularity began in 2005 with Mobile WiMAX (IEEE 802.16e-2005), a set of enhancements that allowed roaming and hand-off between base stations on the network. That’s when WiMAX became a viable cellular technology instead of one just for “Last Mile” wireless Internet access. While WiMAX is deployed by more than 200 operators, most people know it through Clear, Sprint, or Yota. Nokia even built WiMAX into the n810 two years ago (then canceled it 2 months later and pretended it didn’t exist). But in late 2008 (when your desktop may have looked like this) you could easily get Mobile WiMAX on your laptop thanks to…

Halfway to Open-Source Nirvana

intel5x50 — The 5150 and 5350 WiFi/WiMAX combo cards (credit: Intel)

Intel! Around then Intel busted out the 5150 and 5350 WiFi/WiMAX combo cards, letting you connect to 802.11n WiFi and WiMAX networks all using the same cheap add-in module. Tons of laptops started offering WiMAX as a build-to-order option. It costs about $40 to add WiMAX, without a contract commitment; contrast that to the couple hundred dollars a good 3G dongle costs in many countries. Not a Huawei or ZTE prepaid thing from 3UK, but a solid well-built-and-engineered Sierra, Novatel, Option, or Ericsson part.

Amazingly, long before the devices actually shipped Intel’s Inaky Perez-Gonzalez pushed the WiMAX driver for these cards into the kernel. The firmware was reasonably licensed too! Rock on Intel. But… the userspace parts necessary for performing authentication with the WiMAX network were a closed, binary-only hacked up copy of wpa_supplicant. Worse, it was 32-bit only, and since it was closed you couldn’t rebuild it natively for 64-bit systems. Yet worse, the code is complete crap and includes re-implementations of lists, queues, and crypto. But at least it existed.

After almost two years of whining from me and others, Intel finally rewrote the binary pieces and in June 2010 released the public, open authentication code, giving us a completely open and re-distributable WiMAX stack for Intel devices. I did half of the 64-bit architecture port last fall, and the other half was recently completed by other rocking devs. As of 2011, the full Intel WiMAX stack works on 32 and 64-bit x86-compatible machines. Big-endian support is in-progress.

Fashionably Late to the 4G Party but still Smoking Hot

Since I’ve been paying $45 a month to Clear for over a year, and my USB dongle decided to commit suicide, I spent some of the holidays finishing up the NetworkManager WiMAX support that Tambet Ingo started. Then I merged it to master because it worked and you know, that’s what we care about. NM 0.9 will ship with full support for Intel WiMAX devices. And it looks like this:

Just pick a network. Or maybe you’re already connected because NetworkManager did the right thing. It’s really that simple. The applet’s Connection Information dialog shows all the details, including dynamically updating signal strength and the connected base station ID:

applet-ci — I love me some CINR and BSID

Since everyone loves the command line, we’ve got nmcli and nm-tool support as well. You don’t ever have to leave the terminal. You even get more information via the terminal than you do the applet, because we know that’s what you like. If you were ever interested in the center frequency your WiMAX card is using, you’ll think these pics are hot:

So it’s almost all there. Yeah, there’s a bit of the connection editor to finish up, but that’s not hard and it’ll get done. But Intel devices aren’t the only ones out there, and not everyone can grab a 5350 and drop it into their machine.

Half Tragedy, Half Drama

Most external WiMAX USB devices sold in the US are based on the Beceem BCS250 chipset. For a long time the open-source community had nothing, but after almost two years of hiding in the shadows we’ve got a driver: Stephen Hemminger somehow found it and pushed it to Greg as a staging driver late last year. But it’s still ugly, needs help and cleanup, and ideally the Intel wimax network service can be modified to work with these devices so we don’t have to run different userspace daemons for different hardware. If you’re willing to help, most of these devices go for about $30 on Ebay.

Next Up

We’ll chat about LTE and the fun that is the Pantech UML290. It’s great. Really. Unless you’re a developer or a user, which is most of us. Just because it’s fast doesn’t mean it’s a pleasure to work with. Whee!

Not a Jackass Episode #1

Donkey with a circle and slash — Straight from the horse's mouth (via Lamerie)

Why WEXT Sucks Episode #52,334

The world only needs a few jackasses and I’d like to think I’m not one of them. So instead of being a jackass and making fun of people who bought the wrong hardware, tonight I’m going to throw a bone to everyone who mistakenly bought a Broadcom WiFi card thinking that Broadcom cares about open-source and that any bugs you had with their binary driver would be fixed in a timely manner.

In a great example of how WEXT is underspecified, the frequency returned from SIOCGIWFREQ has been interpreted to mean one of two things depending on the driver you have. Some drivers report the associated channel, while others report the tuned channel. Of course during a scan the card tunes to a bunch of different channels. So when you hit up SIOCGIWFREQ you have no idea what the card is going to report.

Some configurations use the same BSSID/SSID combination on different bands. Thus we need to know what the associated frequency is so we can match up the exact AP the card’s talking to with an entry in the scan list. Otherwise the scan list doesn’t represent any sort of reality, and that’s not a good thing. If the card reports the tuned frequency when it’s background scanning or finding a better roaming candidate then the match will fail.

Tossing the Bone

What’s the only thing more common than a dual-band single BSSID/SSID network configuration? If you guessed “drivers which make talking to that network hard” then you win a big wet donkey kiss from an ugly goddamn donkey. So in complete violation of my Fix the Stupid Drivers Instead of Hacking Around Them policy I’ve checked a fix into NetworkManager that handles this situation better. If you ever saw:

NetworkManager[666]: <info> (wlan0): roamed from BSSID 11:22:33:44:66 (cakehole) to (none) ((none))

then I just fixed 15% of your problem. You’re welcome. The other 85% is your proprietary driver. The real fix for this is to use the much more capable nl80211/cfg80211 kernel interfaces instead of WEXT. That still doesn’t help all you proprietary driver users out there, because Broadcom pretty much ignores upstream kernel wireless advances. So next time spend another $5 and make your life easier by getting an Intel or Atheros card instead.

Mobile Broadband and Qualcomm Proprietary Protocols

padlock — NO PROTOCOL FOR YOU (via bassclarinetist)

There are two major mobile broadband technology families: GSM/UMTS (which three quarters of the world uses) and CDMA/EVDO (used by the rest). Keep in mind that UMTS uses CDMA as the radio technology, but incompatibly from CDMA/EVDO.

Back to School

GSM is a TDMA (Time Division Multiple Access) technology; communication is divided into a number of slots in which specific devices talk. Each slot contains voice, data, or signalling information. When it’s not your turn, you can’t talk. Pretty simple, but given that it’s a TDMA technology, it’s prone to multipath interference and hard capacity limits. You also have to carefully plan out your cell layout to ensure that adjacent towers don’t use the same frequency.

CDMA, on the other hand, is an ingenious spread-spectrum technology. It’s got a great back story with movie stars and a war and stuff. In contrast to GSM, in a CDMA system every user talks at the same time. Each user is given a unique sequence of zeros and ones called a “spreading code” which is used to modulate the data stream over a certain frequency range (hence the spread-spectrum part). On the receive side, when you know a user’s spreading code you apply it to the RF signal and retrieve the original data. Each user in the cell just sees every other user’s signal as slightly increased background noise. This is why CDMA is extremely robust against snooping and multipath interference, and why its capacity gracefully degrades as cell utilization increases.

What about Qualcomm?

Qualcomm holds many of the patents on CDMA since they spent a ton of time and money turning CDMA into a viable cellular radio technology 20 years ago. They are also one of the largest sellers of cellular chipsets in the world. We as open-source developers have to care, because their stuff shows up in tons of the devices we support. Users don’t like being told “no”.

Most mobile broadband devices (Qualcomm’s included) appear as USB interfaces providing two or more serial ports. One port is usually AT-command capable. If you’re lucky, you get a secondary AT-capable port to use for signal quality and status while the primary port is using PPP for data transmission. Most GSM/UMTS modems have a second AT port. Most CDMA modems do not.

So when your device only has one AT-capable port, what language do the other ports speak?

Proprietary Protocol #1: QMI

This protocol is found on newer Qualcomm chipsets like the MSM7k series that show up in Android handsets Qualcomm Gobi data cards. Google exposed some of the QMI protocol in the Android drivers. Other details have recently turned up through the Gobi Linux driver sources, though Qualcomm doesn’t distribute sources for the “QCQMI DLKM” that probably contains the protocol mechanics. It shouldn’t be too hard to reverse-engineer most of the protocol given these sources and a USB sniffer, but nobody has had the time yet. QMI uses an HDLC-type framing which is quite common in proprietary mobile broadband protocols: a CRC-16 and 0x7E terminates a frame, and the frame is escaped such that 0x7E doesn’t show up in the data. But since we haven’t reverse-engineered QMI yet, it isn’t the main focus of this post.

Proprietary Protocol #2: DM

Diagnostic Monitor is an older protocol found in most Qualcomm devices. I’ve been interested in QCDM for a while, since without it, you can’t get signal strength and status from most CDMA devices while connected. So I’ve been trawling the web for the past couple years looking for anything related to QCDM, and I finally hit the jackpot last fall: the GPL sources for the Sprint-branded Linksys WRT54G3G-V2 router, which have since disappeared. They include a GPL-licensed tool called ‘nvtlstatus’ which implements various pieces of the QCDM protocol. The code is complete junk (as you’d expect from many embedded device manufacturers with schedules to hit) but it worked.

There’s also a sketchy Chinese package called “CDMA_Test.rar” that includes lists of the NVRAM items and some of the DM command numbers. While not GPL, we can use the command numbering and structure definitions because it falls under the phonebook and interoperability copyright exceptions. Additionally, there’s the TCL-based (ick) “RTManager” tool that implements some interesting QCDM commands, which, while we can’t use any of the code, is useful for structure field names that I hadn’t already guessed. Third, some guy did some reverse engineering of Novatel devices on Windows and built up a list of commands, subsystems, and NVRAM locations that were useful for confirming what I found in the other sources.

So through a combination of reverse engineering and these sources I wrote libqcdm, which we now use extensively in ModemManager for controlling CDMA devices.

DM Commands

Since DM is a pretty old protocol (2000 and possibly earlier), many of the commands are purely historical and currently unused. The most interesting ones are:

DIAG_CMD_VERSION_INFO: grabs firmware build dates and version information
DIAG_CMD_ESN: grabs the CDMA device’s ESN, which is essentially the IMEI of a CDMA device
DIAG_CMD_NV_READ and DIAG_CMD_NV_WRITE: NVRAM read/write commands, see below
DIAG_CMD_SUBSYS: subsystem commands; see below
DIAG_CMD_STATUS_SNAPSHOT: gives information about the current state and registration of the device on the CDMA 1x network

But given that many aren’t really used anymore, Qualcomm started running out of command IDs a long time ago…

Subsystems

So Qualcomm used command 75 (DIAG_CMD_SUBSYS) to extended the number of available commands; this command takes a subsystem selector and a subsystem command ID, thus getting around the original 8-bit command ID limitation.

There are a number of standard subsystems (Call Manager, HDR Manager, WCDMA, GSM, GPS, etc) but each manufacturer generally implements their own subsystem too. In this way QCDM isn’t that different from AT commands; while supposedly standardized, each manufacturer inevitably implements a bunch of proprietary commands for their own device because the specs simply don’t cover everything. This just makes our life harder.

The currently identified subsystems are:

Call Manager: the most important command here reports the general state of the device, including the registered SID/NID, the terminal state (online/offline), the network mode (2G/3G), and various preferences that control which network the mobile registers with. This is what we use to determine online/offline mode for CDMA devices since there aren’t any “standard” AT commands we can use to detect both 1x and EVDO registration. Other commands start and end voice or data calls.
HDR (High Data Rate, ie EVDO): the most important command here provides EVDO state, which is mostly taken from the state machines specified in the IS-856 standard. This lets us figure out if the modem is registered on the EVDO network or the CDMA 1x network.
Novatel: only implemented on Novatel Wireless devices, obviously. But it provides access to a lot of stuff we want: the Extended Roaming Indicator (ERI) which shows detailed roaming state, the current access-technology the device is using (AMPS, digital, IS-95, CDMA 1x, EVDO r0, EVDO rA, etc), the voice mail and SMS indicators, and more.
ZTE: for ZTE devices, obviously. I actually did reverse engineer this one using a ZTE AC2726 kindly provided by Huzaifas S. from Red Hat India. All we’ve got so far is the signal strength, the other fields of the command are unknown.

There are also GSM and WCDMA subsystems used with Qualcomm UMTS chipsets, but since most UMTS devices have multiple AT-capable ports we’re less interested in using QCDM there.

NVRAM Locations

Each device has a number of NVRAM locations in which it stores various parameters like mode preference, roaming, home networks, radio parameters, and a whole bunch of other stuff. Not all devices implement every location. I’ve only included the locations that we actually use in libqcdm, but there a couple thousand. The ones we currently use are:

DIAG_NV_MODE_PREF: sets the mode preference: analog (ie AMPS), digital (TDMA), CDMA 1x, or EVDO (HDR)
DIAG_NV_DIR_NUMBER: retrieves your Mobile Directory Number (MDN), aka your phone #
DIAG_NV_ROAM_PREF: controls whether your device will roam on a partner network or not

The values each contains took a bit of time reverse-engineer using the Sprint connection manager, 3 different Sprint CDMA cards, and some USB traces, but now we’ve got the important parts.

Pulling It All Together

Earlier this year we had a number of bugs from Russian, Indian, and Czech Fedora users where ModemManager simply wouldn’t connect. MM is pretty clever (a good thing) but the IS-707 AT commands aren’t useful enough to tell us what we need (not good). The IS-707 standard AT+CAD? and AT+CSS commands really apply to the CDMA 1x network, not the EVDO network, and all these users had EVDO-only plans. So when ModemManager checked AT+CSS and found that the device wasn’t registered, we sat around polling the registration state for a while. The modem was already registered on the EVDO network, but not on a CDMA 1x network; of course AT+CSS doesn’t tell us that so MM got it wrong.

The real fix was to utilize QCDM and ask the Call Manager whether the modem was online or not, and if so, whether it had a 1x or an EVDO connection. Sounds simple, but it took a lot of work to get there.

Next, since most CDMA devices only expose one AT-capable port, we need a way to get signal strength from the device while it’s connected and the primary port is talking PPP. I’ll cover that in another blog post; stay tuned. We still don’t have a good way to figure out which EVDO revision (either 0 or A) we’re using, nor can we get a reliable roaming indicator yet.

All of this is built in Fedora 12, 13, and rawhide if you’d like to take it for a spin.

The Kernel Side

Many devices provide the AT port via the standard CDC-ACM serial mechanism, which is picked up automatically by the kernel drivers. But their QCDM-capable ports are only exposed via vendor-specific USB interfaces, so I created the qcaux driver to handle these ports; it’s in the 2.6.34 kernel. With qcaux.ko and a recent version of ModemManager stuff will Just Work.

Why You Care

First a big shout to Qualcomm for keeping this shit secret. NOT. Double-plus-shout-out for keeping QMI secret; it’s a pretty simple protocol and there’s not much there worth keeping under wraps. It might be nice to let open-source developers actually talk to your hardware.

With that out of the way, you care because we now have better support for a whole bunch of mobile broadband devices. We even have support for CDMA signal strength while connected for the vast majority of CDMA devices that only expose one AT port. I’ll talk about that later, since it’s quite an interesting story.

Why Sierra Wireless Rocks and Qualcomm Doesn’t

Buy Sierra stuff. It’s top quality and they actually care about open-source, unlike Qualcomm’s mobile broadband division. Last year I initiated a dialogue with Sierra about releasing some details of their proprietary Command and Status (CnS) protocol. Being able to talk CnS to their modems gets us a lot that AT commands and even QCDM don’t provide, like roaming indicator, access technology, and RSSI.

And guess what? They actually listened, did the work, and put the documentation under a Creative Commons license too. I hear it’ll show up soon on their support site if it’s not there already (document #2131024, “CDMA 1xEV-DO CnS Reference”).

Sierra rocks. Now if only Qualcomm would do it too…

Few Surprised at New Evidence of Staging Driver Suckage

Thomas Johnson (High School Janitor)

“Oh yeah, I’ve seen that code. It’s worse than what I clean up in the bathrooms after Prom or Homecoming. The kids get high and drunk and party too hard and puke all over the place. I deal with enough vomit from 7:30 to 6; I wouldn’t touch the staging drivers with a mop twice as long as the one I have at work.”

Just Say No

Thomas just found out that none of the “staging” wifi drivers will work with hidden access points because they don’t set the IW_SCAN_CAPA_ESSID capability bit. Furthermore, the most popular “staging” drivers (for the Ralink hardware used in many netbooks) don’t even have specific SSID scanning capability at all.

Why do you care? Hidden APs don’t broadcast their network ID, which misinformed people think is more secure (hint: it’s not). Before a driver can associate to the network, it needs to discover available APs and capabilities, which requires a probe-request, which exposes the network ID to everyone anyway. But that requires driver support which none of the staging drivers have.

I fixed this issue upstream two years ago by adding IW_SCAN_CAPA_ESSID to Wireless Extensions. Of course the staging WiFi drivers that many distros enable never got fixed because the vendor it came from didn’t bother to work with the community in the first place. And people wonder why they don’t work.

Broadly speaking, staging WiFi drivers come in two flavors: (a) old dried gum from under the cafeteria table (drivers with a future), and (b) fresh vomit from the hung-over kid in your math class (those without a future).

The drivers with a future (winbond, rtl81xx) are or will based on the kernel-standard mac80211 wireless stack, which implements the 802.11 WiFi specification in the kernel. Since they use the standard mac80211 stack, they get all these nice features like probe-scanning and the correct capability bits for free. All you have to do is work on supporting the hardware itself.

The drivers without a future (rt2860, rt2870, rt3070, rt3090, wlan-ng, vt665x) are based on forks of the ancient ieee80211 stack that Intel’s ipw2x00 drivers forked from the hostap driver. Each of these drivers includes their own copy of the core ieee80211 stack forked at different times and with different hacks. When a bug shows up, that means 4x the work, and 4x the chance for the fix to slip through the cracks. Which is why these drivers have no future. They are a maintenance nightmare. Besides, they have crap like this:

pAdapter->StaCfg.bScanReqIsFromWebUI = TRUE;

It just blows my mind why people think staging wifi drivers are a great idea. There’s a reason staging drivers set the TAINT_CRAP flag in your kernel; because that’s what they literally are.

So what’s the right thing to do?

There’s one huge reason why dead-end staging drivers are a bad idea: there aren’t enough developers. So do you spend that effort on maintaining unmaintainable shit code? Or do you spend it on fixing the code that has a future? Most of the time you can’t do both.

If you choose to maintain the staging drivers, then things become worse over time since the staging code is simply less tested and less maintainable. So you continue to drop hacks and fixes onto an ever-growing steaming pile of manure. Nobody cares much about the driver (because it doesn’t use the standard kernel interfaces and thus doesn’t have a future), so your staging driver never benefits from all the great feature work and bug fixing that the mac80211 and wireless developers are doing.

But if you choose to help fix the upstream drivers that do use mac80211 (like rt2x00), and thus have a future, maybe for a few months some users won’t have great wireless. But they didn’t before either. But then 6 months later, all the users get great wireless with features like power saving, background scanning, WiFi Direct, Bluetooth 3, access-point mode, etc. Those things will never be done to the staging drivers, because those drivers are a dead-end maintenance nightmare, because their code is awful, and because they don’t use the standard kernel wireless stack.

I know I’d invest the effort where it helps users the most, even if it means a few more months of subpar driver support while the official upstream drivers get fixed and the staging drivers go untouched. That’s how things actually get better when you can’t fix everything at once.

NetworkManager and ConnMan

Lately ofono and ConnMan have been in the news, and that’s sparked some discussion about how these two projects relate to NetworkManager. I’ve mostly just been ignoring that discussion and focusing on making NetworkManager better. But at some point the discussion needs to become informed and the facts need to be straightened out.

So what makes NetworkManager great?

Flexibility: NM’s D-Bus interface provides a ton of control and information about the network connections of your machine. Developers and applications simply don’t take enough advantage of this. Imagine mail automatically pulled whenever the corporate VPN is up. Or more restrictive firewalls when connected to public networks. Yeah, you can do that today with NetworkManager.
Works everywhere: from the mainframe to the power desktop to the netbook and lower. There’s nothing stopping you from running NetworkManager on an s390 or a Palm Pre.
Integration: most users like NetworkManager’s distro integration, so it’s on by default (but can be turned off for running bare-metal). NM will read your distro’s network config files: ifcfg on Fedora, /etc/network/interfaces on Debian, etc. It doesn’t pretend the rest of the world doesn’t exist, but it can if you tell it to.
Connection Sharing: you can share your 3G connection to the wired or the wifi interface, or the other way around. How you share is completely up to you.
VPN: it’s got plugins for Cisco (vpnc), openvpn, openconnect, and pptp. An ipsec/openswan plugin is being written. It’s just easy to use the VPN of your choice.
Makes Linux better: by not working around stupid vendor drivers or other broken components, NetworkManager drives many improvements in drivers, kernel APIs, the supplicant, and desktop applications. Five years ago I posted a list of wifi problems, many of which got fixed because NetworkManager users complained about them. Stuff like WPA capability fixes, hidden SSID fixes, suspend/resume improvements, Ad-Hoc mode fixes, and lots of improvements to wpa_supplicant to name just a few. By encouraging drivers to be open, by fixing bugs in the open drivers and the stack instead of hacking around them, and by encouraging vendors to work upstream, NetworkManager makes Linux better for you.

What great stuff is coming next?

Bluetooth
IPv6
Bridging
Getting rid of HAL
ModemManager and better mobile broadband

All in all, a lot of great stuff is on the plate. NetworkManager already works well for a ton of people, but we’d like to make it work better for a lot more people. And it will.

So what about ConnMan?

I recently came across a slide deck about ConnMan which makes both disappointing and inaccurate claims about NetworkManager. It’s also worth emphasizing the philosophical differences between the two projects.

First, ConnMan primarily targets embedded devices, netbooks, and MIDs (slide #1). When ConnMan was first released in early 2008, NetworkManager 0.7 was under heavy development, and NetworkManager 0.6 clearly did not meet the requirements. But 0.7, released in November 2008, works well for a wide range of use-cases and hardware platforms.

NetworkManager scales from netbooks, MIDs, and embedded devices with custom-written UIs to desktops to large systems like IBM’s s390. You get the best of both worlds: from phenomenal cosmic power down to itty-bitty living space.

ConnMan explicitly doesn’t try to integrate with existing distributions (slide #5), partly due to it’s requirements to be as light-weight as possible. But NetworkManager will use your distro’s normal network config and startup scripts if you tell it to do (but you don’t have to). Early in NetworkManager days we tried to ignore the rest of the world too. Turns out that doesn’t work so well; users demand integration with their distribution. But ConnMan doesn’t pretend to be general purpose, and due to its embedded focus, it can wave this issue away.

Both ConnMan and ofono reject well-established technologies like GObject (but still uses glib) in favor of re-implementing much of GObject internally anyway. This is a curious decision as GObject is not a memory hog and not a performance drag for these cases. The NIH syndrome continues with libgppp, libgdhcp, and libgdbus, where instead of improving existing, widely-used tools like dhclient/dhcpcd, pppd, and dbus-glib, ConnMan opts to re-implement them in the name of being more “lightweight”. With embedded projects that ConnMan targets (like Maemo and Moblin) already using GObject and dhcpcd, I don’t understand why this tradeoff was made. Perhaps this visceral dislike of GObject and dbus-glib was one reason the project’s creators decided to write their own connection manager instead of helping to improve existing ones.

NetworkManager in contrast re-uses and helps improve components all over the Linux stack. Because of that, more people benefit from the fixes and improvements that NetworkManager drives in projects like avahi, wpa_supplicant, the kernel, pppd, glib, dbus-glib, ModemManager, libnl, PolicyKit, udev, etc.

Taking a look at the deck

I have things to say about most of the slides, but I’ll concentrate on the most interesting and misinformed ones instead.

Very Complex Design: a complete strawman, because it doesn’t say anything. NetworkManager 0.7 is a mature project with many useful features. NM is based around a core of objects, each one performing actions based on signals and events from other objects. It’s modular and flexible. It’s just not a ConnMan-style box of lego blocks with a rigid plugin API and all the problems that causes.
Large Dependency List: NM requires things like wpa_supplicant, udev, dbus, glib, libuuid, libnl, and a crypto library. pppd and avahi are optional. This list is certainly not large. When you take ConnMan and its optional dependencies (most of which are needed in a useful system) the list is just about the same.
Too Much Decision-making in the UI: Completely bogus and frankly incomprehensible. The core NM daemon provides a default policy which is in no way connected to the UI, and the rest is up to the user. nm-applet contains no policy whatsoever. If the objection is to nm-applet’s desktop-centric interaction model, then it’s important to know there is no lack of applets for different use-cases.
Tries to work around distro problems: this is completely a matter of perspective. Since Intel was creating its own Linux distribution (moblin), they didn’t have to work around any existing issues; these were simply waved away. Unfortunately NetworkManager lives in the world of reality and not some universe full of ponies. For users that expect it, NetworkManager integrates with your distros existing network config, init scripts, and DNS resolution. For users that don’t care, NetworkManager can run bare-metal.
Too much GNOME-like source code: seriously, what the hell? I’m not sure where to begin with this one. The NetworkManager core does not depend on GNOME. At all. Yeah, the source-code is in the Gnome style, but is that seriously an issue?

Uninformed diagram of NetworkManager architecture

(Misinformation shaded blue for your protection)

The User Settings service is contained in the applet, and it’s completely optional. The System Settings service has been merged back into the NetworkManager core daemon and is no longer a separate process. That same commit ported NM from HAL to udev; thus HAL is no longer required. NetworkManager always used HAL/udev for device detection instead of RTNL (ie, netlink). NetworkManager also hasn’t used WEXT for a long time; wpa_supplicant handles kernel wireless configuration. NetworkManager uses distro networking scripts only for service control, as does ConnMan. The rest of the slide is quite petty and just splits hairs.

Where to?

It’s unlikely that either NetworkManager or ConnMan will disappear in the near future. That means we’ll all have to live with two mutually exclusive connection managers and two completely different network configuration systems. I think that’s pretty pointless, but I don’t get the last word anyway, since that’s not how Open Source works. The users will decide which solution works best for them. And that means NetworkManager will keep getting better, keep getting more useful, and will continue to be the easiest network management solution around.

I was told there’d be cookies?

Land of Confusion

Since NetworkManager 0.7 came out, there’s one issue that’s been causing confusion with lots of users: hashed network keys. That passphrase you type into the box when you connect to a WiFi network using any OS isn’t what actually gets used; instead it’s hashed to come up with the real key. There are a few different ways to enter an encryption key for a WiFi network, so bear with me:

Hex: works with both WEP and WPA, and is the most compatible since it actually is exactly what gets sent to the driver as the encryption key. For WEP, this is either a 10 character (for 40-bit WEP) or a 26 character (for 104-bit WEP) string composed of hexadecimal characters. For WPA it’s a 64 character hexadecimal string. Typing in 64 hexadecimal characters gets old pretty fast, which leads us to…
Passphrase: a string of arbitrary characters that is hashed into the actual key to be used. WEP passphrases have no real size restrictions, and are repeated into a 64-byte buffer before being hashed with MD5. At least the creators of WPA learned from experience, specifying that WPA passphrases are between 8 and 63 characters inclusive, which means you can actually autodetect whether it’s a passphrase or a hex key, unlike WEP passphrases. WPA passphrases are hashed using SHA-1 into the real encryption key.
ASCII key: Thanks, Lucent. The original WaveLAN cards used passphrases of 5 or 13 ASCII characters, which some drivers and people still use for God knows what reason. To hash it, take the two-byte ASCII value of each character and stuff them into a buffer. Not secure at all.
Apple passwords: in their infinite wisdom, Apple chose a completely different hashing mechanism for WEP. This means that to connect a non-Apple computer to an Airport WEP network, you need the “Compatible Network Password”, ie the hexadecimal WEP key. At least they stuck with the standard for WPA.

The huge pain with WEP is that you simply cannot autodetect what type of key the user has entered. Since WEP passphrases can also be composed of 10 or 26 hexadecimal characters, it’s impossible to differentiate between a WEP hex key, a WEP passphrase, or a WEP 104-bit ASCII key. Which means the user has to know what WEP key type they are using. FAIL. They also have to know whether the network uses 40-bit or 104-bit encryption, and whether it uses Shared Key authentication or Open System authentication. That’s 12 different possible WEP configurations.

WEP == MASSIVE USER FAIL

In any case, NetworkManager 0.7 required pre-hashed keys for reasons I don’t accurately remember, possibly related to bad trips from the NM 0.6 API that I mis-designed. So the applet hashed your passphrase right after you entered it and stored the hashed key in the keyring. Unfortunately, when the driver failed to connect and NetworkManager asked for your secrets again, all you saw was something you certainly don’t remember typing in. While this actually was your passphrase, and it would work when you hit OK, it certainly was confusing.

Change We Can Believe In

As of Saturday, you’ll always see what you typed in. The real fix is to simply connect the first time and never ask for your passphrase again, but that’s almost always due to driver and supplicant bugs that can and should be fixed; I’ve spent weeks of my life doing just that. Of course, that can only reliably happen in open-source drivers; at least when we find the bugs we can fix them. Which is why you really don’t want any of these.

That’s when I reach for my revolver…

All your GSM are belong to me — A few of the 3G cards NetworkManager gets tested with

… to shoot myself in the head. Some mobile broadband cards are like a nice, quiet child that does everything you tell them to do; they’d even clean out your family’s slurry tank if you asked. Unfortunately, most of the cards you just want to throw right into the slurry tank strapped to the side of a large brick. Hopefully the tank is full, and the card doesn’t have a snorkel, not that a snorkel would help it much.

Yes, there are standards. But as we all know, given 10 people and a standard, you’ll end the day with 12 or 13 differently behaving “standards-compliant” implementations. People suck. You’d think it would be easy to agree on an AT command for “prefer 3G / prefer 2G / 3G only / 2G only”. NO SIMPLE FOR YOU. But NetworkManager has to work around huge amounts of stupid. Here’s a run-down of some of the mobile broadband hardware that’s available today and what about it sucks.

HUAWEI (PHEAR THE DRAGON)

Europe apparently got carpet-bombed with these things. They provide two usable serial ports; one for data and another for stuff like signal strength, mode switches, etc. But asking it anything on the second port makes the modem cry, grab its toys, and run home to tell mommy what you’ve done. This caused problems with the new modem capability probing code in NetworkManager 0.7.1. Thanks guys (not). Dropping unhandled input on the floor would apparently have been too easy. And, of course, they use AT^SYSCFG with some magic numbers to indicate 2G/3G preference. That said, Huawei does participate upstream and proactively adds IDs and support for their hardware.

Qualcomm Gobi (NEW HOTNESS ALERT)

Apparently now all the rage State-side (though they’ve been out for a while); even in the ultra-small Atom-based Poulsbong-smoking Sony Vaio P series. These parts can do GSM/HSPA and CDMA/EVDO, depending on the firmware they load. Now that’s pretty cool. There’s even a driver for them (qcserial) queued up in gregkh’s tree. Unfortunately, because Qualcomm still can’t get their head out of their ass you won’t get signal strength, cell tower reports, or mode change signals, since the driver only exposes one serial port which is used for PPP (it might support GSM multiplexing, in which case this rant is wrong). Everything else seems to get done from userspace with libusb and a proprietary protocol. WTF is so awesome about proprietary protocols? You get to sell people an SDK for $20,000 or something? Nice try.

Modern Sierra (MAGICALLY DELICIOUS)

Driven by the ‘sierra’ driver (surprise!), these cards expose multiple serial ports; two or more of which accept AT commands. Only one of these ports has a full AT interpreter and gets used for PPP, the other ports get used for signal strength and GPS during the PPP session. I hear new Sierra gear is switching to the tty+netdevice model though, so these will be old-but-not-busted soon. But of course, somebody took a huge pull off the bong, and came up with AT!SELRAT for 2G/3G preference. Yay! Variation #2!

Old-School Sierra (OLD BUT NOT BUSTED)

Sweet bliss. Works like a champ. 16-bit PCMCIA. HSDPA even. GSM multiplexing support makes up for the fact that it only exposes one serial port to the OS (even though we don’t support userspace muxing yet). It’s been supported in NetworkManager since, like, day #1. Like the newer Sierra cards, it also uses AT!SELRAT, so at least Sierra is consistent. Which is more than I can say for some other hardware I’ve seen.

Option “HSO” (THE NEED FOR SPEED)

PPP sucks; it’s only between you and the card, not over the air. So why bother? Which is why Option killed PPP dead. These devices expose multiple AT-capable ttys, and an ethernet network interface. Do the setup on the AT ports, do the data in high speed on the network interface. This is the current trend. Sierra is going to do it soon. So is Huawei. But unfortunately, everyone does the authentication and the IP configuration differently. And Option’s 2G/3G preference command is AT_OPSYS. Variant #3! Go to hell. In any case, big thanks to Option for providing me with hardware and also working with the upstream kernel community; you guys rock.

Ericsson F3507g (SWEDISH INVASION)

Dude, you got a Dell Mini? If you’ve coughed up for the 5530 Mobile Broadband option, it’s probably got one of these inside. The Sony Ericsson MD300 is the same hardware. For once, somebody uses standard interfaces too; these parts expose multiple cdc-acm serial ports (like most mobile phones), and one cdc-ether network device used for data. The interesting thing is that to get an IP address, you use DHCP on the ethernet interface. We don’t yet know how to set 2G/3G preference, but you can get it with AT*ERINFO. All hail variant #4. This is getting rediculous. At least Ericsson pays people to make their stuff work with Linux, though the AT reference document is NDA-encumbered. Need to hit somebody with the cluebat for that.

BUSlink SCWi275u (DEAR GOD DON’T BUY THIS)

Really. If you find one, put it out of its misery by burning it alive. Yeah, it’s really old, and it’s only GPRS, and it’s from the land before time when they put WiFi into cellular modems because nobody had it onboard. And hey, its firmware is as clueless about standards as Qualcomm is about Open Source. But it works fine with NetworkManager 🙂

As you can see, nobody in this industry talks to each other, and none of the carriers care about making it easier to write software for the devices they sell. Everywhere you look there are silos, walled gardens, and revenue stream protection. But that’s where NetworkManager comes in.

The Bright, Shiny New Mobile Future

NM 0.7 delivered the promise of Mobile Broadband. We took a limited set of devices (ie, no phones) and made those work out of the box. Now it’s time to get bigger, faster, and stronger. We can’t support everything in the current architecture inside NetworkManager, so Tambet started a new project called ModemManager. All mobile broadband handling gets punted out to ModemManager (similar to how WiFi is handled with wpa_supplicant), making the NetworkManager core simpler, easier to maintain, and more robust. ModemManager provides a nice D-Bus API for everything modem-related; data connections, SMS, phonebook, signal strength, GPS, etc. It rocks. It’s more flexible. It spews out cute, cuddly kittens by the thousands. It’s definitely the right architecture and the way forward.

The Slightly Less-Bright Now

But until ModemManager drops some awesome on y’all, we need to better support modems in NetworkManager 0.7.x too. A few problems we’ve been tackling over the past few months:

multiple serial ports – most modems provide more than one port; but nothing tells you what that port gets used for. Sometimes asking the port what’s purpose in life is doesn’t work either. So we have to special-case some modems in the udev prober, and some in NetworkManager. This gets as ugly as your first girlfriend/boyfriend.
modem capabilities – this is why your mobile phone didn’t work with NetworkManager 0.7.0. We need to know whether the modem is CDMA/EVDO or GSM/HSPA since the operation and UI needs to change based on which kind of modem it is. Previously we used hal-info’s 10-modem.fdi, which simply doesn’t scale. Asking the modem freaks some of them out (ie, Huawei) and others just lie for various reasons. So with NM 0.7.1, we probe serial ports with a udev helper and are careful not to touch things that shouldn’t be touched.
modem init strings – because, of course, consistent handling of initialization strings between devices would just be too easy. Some devices puke up half-eaten puppies when given the same init string that every other device on the planet supports. No standardization here. So NetworkManager 0.7.1 tries different init strings until one works.
registration commands – some Huawei modems want to use AT+CGREG instead of AT+CREG. Yeah, I know why it seems to think it can be special, but it’s not. It’s just plain stupid. And this seems to change based on firmware version of all things. Dear God, why do you toy with me? So in lieu of finding a Huawei engineer and asking them what the fuck they were thinking, we hacked around it for now.

We’ve gotten most of worked out in the NetworkManager 0.7.1 release candidate series. And all this crap is exactly why NM 0.7.1 isn’t out yet. Like when NM 0.7.1-rc1 broke people’s Huawei cards due to modem probing freaking out the firmware, I spent $100 for a Huawei E160G off eBay. It took a week to get here, and two days to fix it.

But that’s why NetworkManager rocks; we pony up the cash to make sure our shit works. Users appreciate that.

If you have a Sony Ericsson MD300…

Please send me the output of the “AT+CGMM” command using minicom or something like that. You won’t regret it.

Suspend/resume vs. NetworkManager

private-island

The other day while chilling beside the pool on my private island (A), I decided to head into Port Nelson (B) to check up on my various offshore accounts. Financial crisis and all you see; that Stanford thing last week really had me worried. A laptop hibernation and a short helicopter ride later, I’m in the branch office and need to look up a few things pertaining to my net worth. But upon resume, NetworkManager started reconnecting to my villa’s access point, which was all the way back on my island. WTH!!!??!?!

This problem has been around for a long time. Pretty much since the beginning of time. I looked at it last year and concluded that it wasn’t NetworkManager. This time it really annoyed me, so I made a bet with my porter that I’d figure it out by time I left to hit up this party in Bailey Town. He’s cool like that. I got to keep my money. It still wasn’t NetworkManager.

See, drivers timestamp wifi networks they know about. That way you can figure out if the network was last seen a second ago, 7 seconds ago, or so long ago that it’s dead to me. But they all use an kernel counter called ‘jiffies’ to do that. And ‘jiffies’ doesn’t increment across suspend/resume. See where I’m going with this?

So the next scan after resume, all the old networks are mixed in with the new networks, and you simply can’t tell which ones are old and which ones are new. They all look like they were scanned within the past 10 seconds. The last AP you were connected to looks like a great candidate to try, no matter where it is.

Abusing people as a metaphor for scan results:

WANT
(with apologies to Imansyah)

old scan results suck

DO NOT WANT

The solution is to age the scan results with the amount of time spent in suspend. This keeps both normal laptops (where you’ll usually be suspended for a while) and OLPC-style laptops (where suspend can happen for sub-second durations) happy. The patches are queued for 2.6.30, and I’ve backported them to 2.6.27, 2.6.28, and 2.6.29. They are also a prerequisite for making NetworkManager just try harder to associate when the connection fails, which I know annoys a lot of people, including myself.

Problem solved, party attended.

The big lesson? When something is wrong with the drivers, fix the drivers. Don’t hack around it like a helpless tool. And if you can’t fix the driver, well… then why did mindlessly stuff $50 bills into Broadcom’s thong in the first place?