urlparse considered harmful

Over the weekend, I spent a number of hours tracking down a bug caused by the cache in the Python urlparse module. The problem has already been reported as Python bug 1313119, but has not been fixed yet.

First a bit of background. The urlparse module does what you’d expect and parses a URL into its components:

>>> from urlparse import urlparse
>>> urlparse('http://www.gnome.org/')
('http', 'www.gnome.org', '/', '', '', '')

As well as accepting byte strings (which you’d be using at the HTTP protocol level), it also accepts Unicode strings (which you’d be using at the HTML or XML content level):

>>> urlparse(u'http://www.ubuntu.com/')
(u'http', u'www.ubuntu.com', u'/', '', '', '')

As the result is immutable, urlparse implements a cache of up to 20 previous results. Unfortunately, the cache does not distinguish between byte strings and Unicode strings, so parsing a byte string may return unicode components if the result is in the cache:

>>> urlparse('http://www.ubuntu.com/')
(u'http', u'www.ubuntu.com', u'/', '', '', '')

When you combine this with Python’s automatic promotion of byte strings to unicode when concatenating with a unicode string, can really screw things up when you do want to work with byte strings. If you hit such a problem, the code may all look correct but the problem was introduced 20 urlparse calls ago. Even if your own code never passes in Unicode strings, one of the libraries you use might be doing so.

The problem affects more than just the urlparse function. The urljoin function from the same module is also affected since it uses urlparse internally:

>>> from urlparse import urljoin
>>> urljoin('http://www.ubuntu.com/', '/news')
u'http://www.ubuntu.com/news'

It seems safest to avoid the module all together if possible, or at least until the underlying bug is fixed.

OpenID 2.0 Specification Approved

It looks like the OpenID Authentication 2.0 specification has finally been released, along with OpenID Attribute Exchange 1.0. While there are some questionable features in the new specification (namely XRIs), it seems like a worthwhile improvement over the previous specification. It will be interesting to see how quickly the new specification gains adoption.

While this is certainly an important milestone, there are still areas for improvement.

Best Practices For Managing Trust Relationships With OPs

The proposed Provider Authentication Policy Extension allows a Relying Party to specify what level of checking it wants the OpenID Provider to perform on the user (e.g. phishing resistant, multi factor, etc). The OP can then tell the RP what level of checking was actually performed.

What the specification doesn’t cover is why the RP should believe the OP. I can easily set up an OP that performs no checking on the user but claims that it performed “Physical Multi-Factor Authentication” in its responses. Any RP that acted on that assertion would be buggy.

This isn’t to say that the extension is useless. If the entity running the RP also runs the OP, then they might have good reason to believe the responses and act on them. Similarly, they might decide that JanRain are quite trustworthy so believe responses from myOpenID.

What is common in between these situations is that there is a trust relationship between the OP and RP that is outside of the protocol. As the specification gives no guidance on how to set up these relationships, they are likely to be ad-hoc and result in some OpenIDs being more useful than others.

At a minimum, it’d be good to see some best practices document on how to handle this.

Trusted Attribute Exchange

As mentioned in my previous article on OpenID Attribute Exchange, I mentioned that attribute values provided by the OP should be treated as being self asserted. So if the RP receives an email address or Jabber ID via attribute exchange, there is no guarantee that the user actually owns them. This is a problem if the RP wants to start emailing or instant messaging the user (e.g. OpenID enabled mailing list management software). Assuming the RP doesn’t want to get users to revalidate their email address, what can it do?

One of the simplest solutions is to use a trust relationship with the OP. If the RP knows that the OP will only transfer email addresses if the user has previously verified them, then they need not perform a second verification. This leaves us in the same situation as described in the previous situation.

Another solution that has been proposed by Sxip is to make the attribute values self-asserting. This entails making the attribute value contain both the desired information plus a digital signature. Using the email example, if the email address has a valid digital signature and the RP trusts the signer to perform email address verification, then it can accept the email address without further verification.

This means that the RP only needs to manage trust relationships with the attribute signers rather than every OP used by their user base. If there are fewer attribute signers than OPs then this is of obvious benefit to the RP. It also benefits the user since they no longer limited to one of the “approved” OPs.

Canonical IDs for URL Identifiers

I’ve stated previously that I think the support for identifier reuse with respect to URL identifiers is a bit lacking.  It’d be nice to see it expanded in a future specification revision.