urlparse considered harmful

Over the weekend, I spent a number of hours tracking down a bug caused by the cache in the Python urlparse module. The problem has already been reported as Python bug 1313119, but has not been fixed yet.

First a bit of background. The urlparse module does what you’d expect and parses a URL into its components:

>>> from urlparse import urlparse
>>> urlparse('http://www.gnome.org/')
('http', 'www.gnome.org', '/', '', '', '')

As well as accepting byte strings (which you’d be using at the HTTP protocol level), it also accepts Unicode strings (which you’d be using at the HTML or XML content level):

>>> urlparse(u'http://www.ubuntu.com/')
(u'http', u'www.ubuntu.com', u'/', '', '', '')

As the result is immutable, urlparse implements a cache of up to 20 previous results. Unfortunately, the cache does not distinguish between byte strings and Unicode strings, so parsing a byte string may return unicode components if the result is in the cache:

>>> urlparse('http://www.ubuntu.com/')
(u'http', u'www.ubuntu.com', u'/', '', '', '')

When you combine this with Python’s automatic promotion of byte strings to unicode when concatenating with a unicode string, can really screw things up when you do want to work with byte strings. If you hit such a problem, the code may all look correct but the problem was introduced 20 urlparse calls ago. Even if your own code never passes in Unicode strings, one of the libraries you use might be doing so.

The problem affects more than just the urlparse function. The urljoin function from the same module is also affected since it uses urlparse internally:

>>> from urlparse import urljoin
>>> urljoin('http://www.ubuntu.com/', '/news')
u'http://www.ubuntu.com/news'

It seems safest to avoid the module all together if possible, or at least until the underlying bug is fixed.

OpenID 2.0 Specification Approved

It looks like the OpenID Authentication 2.0 specification has finally been released, along with OpenID Attribute Exchange 1.0. While there are some questionable features in the new specification (namely XRIs), it seems like a worthwhile improvement over the previous specification. It will be interesting to see how quickly the new specification gains adoption.

While this is certainly an important milestone, there are still areas for improvement.

Best Practices For Managing Trust Relationships With OPs

The proposed Provider Authentication Policy Extension allows a Relying Party to specify what level of checking it wants the OpenID Provider to perform on the user (e.g. phishing resistant, multi factor, etc). The OP can then tell the RP what level of checking was actually performed.

What the specification doesn’t cover is why the RP should believe the OP. I can easily set up an OP that performs no checking on the user but claims that it performed “Physical Multi-Factor Authentication” in its responses. Any RP that acted on that assertion would be buggy.

This isn’t to say that the extension is useless. If the entity running the RP also runs the OP, then they might have good reason to believe the responses and act on them. Similarly, they might decide that JanRain are quite trustworthy so believe responses from myOpenID.

What is common in between these situations is that there is a trust relationship between the OP and RP that is outside of the protocol. As the specification gives no guidance on how to set up these relationships, they are likely to be ad-hoc and result in some OpenIDs being more useful than others.

At a minimum, it’d be good to see some best practices document on how to handle this.

Trusted Attribute Exchange

As mentioned in my previous article on OpenID Attribute Exchange, I mentioned that attribute values provided by the OP should be treated as being self asserted. So if the RP receives an email address or Jabber ID via attribute exchange, there is no guarantee that the user actually owns them. This is a problem if the RP wants to start emailing or instant messaging the user (e.g. OpenID enabled mailing list management software). Assuming the RP doesn’t want to get users to revalidate their email address, what can it do?

One of the simplest solutions is to use a trust relationship with the OP. If the RP knows that the OP will only transfer email addresses if the user has previously verified them, then they need not perform a second verification. This leaves us in the same situation as described in the previous situation.

Another solution that has been proposed by Sxip is to make the attribute values self-asserting. This entails making the attribute value contain both the desired information plus a digital signature. Using the email example, if the email address has a valid digital signature and the RP trusts the signer to perform email address verification, then it can accept the email address without further verification.

This means that the RP only needs to manage trust relationships with the attribute signers rather than every OP used by their user base. If there are fewer attribute signers than OPs then this is of obvious benefit to the RP. It also benefits the user since they no longer limited to one of the “approved” OPs.

Canonical IDs for URL Identifiers

I’ve stated previously that I think the support for identifier reuse with respect to URL identifiers is a bit lacking.  It’d be nice to see it expanded in a future specification revision.

States in Version Control Systems

Elijah has been writing an interesting series of articles comparing different version control systems. While the previous articles have been very informative, I think the latest one was a bit muddled. What follows is an expanded version of my comment on that article.

Elijah starts by making an analogy between text editors and version control systems, which I think is quite a useful analogy. When working with a text editor, there is a base version of the file on disk, and the version you are currently working on which will become the next saved version.

This does map quite well to the concepts of most VCS’s. You have a working copy that starts out identical to a base tree from the branch you are editing. You make local changes and eventually commit, creating a new base tree for future edits.

In addition to these two “states”, Elijah goes on to list three more states that are actually orthogonal to the original two. These additional states refer to certain categorisations of files within the working copy, rather than particular versions of files or trees. Rather than simplifying things, I believe that mingling the two concepts together is more likely to cause confusion. I think this is evident from the fact that the additional states do not fit the analogy we started with.

Versioned and Unversioned Files

If you are going to use a version control system seriously, it is worth understanding how files within a working copy are managed. Rather than thinking of a flat list of possible states, I think it is helpful to think of a hierarchy of categories. The most basic categorisation is whether a file is versioned or not.

Versioned files are those whose state will be saved when committing a new version of the tree. Conversely, unversioned files exist in the working copy but are not recorded when committing new versions of the tree.

This concept does not map very well to the original text editor analogy. If text editors did support such a feature, it would be the ability to add paragraphs to the document that do not get stored to disk when you save, but would persist inside the editor.

Types of Versioned Files

There are various ways to categorise versioned files, but here are some fairly generic ones that fit most VCS’s.

  1. unchanged
  2. modified
  3. added
  4. removed

Each of these categorisations is relative to the base tree for the working copy. The modified category contains both files whose contents have changed and whose metadata has changed (e.g. files that have been renamed).

The removed category is interesting because files in this category don’t actually exist in the working copy. That said the VCS knows that such files did exist, so it knows to delete the files when committing the next version of the tree.

Types of Unversioned Files

There are two primary categories for unversioned files:

  1. ignored
  2. unknown

The ignored category consists of unversioned files that the VCS knows the user does not want added to the tree (either through a set of default file patterns, or because the user explicitly said the file should be ignored). Object files and executables built from source code in the tree are prime examples of files that the user would want to ignore.

The unknown category is a catch-all for any other unversioned file in the tree. This is what Elijah referred to as “limbo” in his article.

Differences between VCS’s

These concepts are roughly applicable to most version control systems, but there are differences in how the categories are handled. Some of the areas where they differ are:

  • Are newly created files in the working copy counted as added or unknown?
    Some VCS’s (or configurations of VCS’s) don’t have a concept of unknown files. In such a system, newly created files will be treated as added rather than unknown.
  • Are unknown files allowed in the working copy when committing?
    One of the issues Elijah brought up was forgetting to add new files before commit. Some VCS’s avoid this problem by not letting you commit a tree with unknown files.
  • When renaming a versioned file, does it count as a single modified file, or a removed file and an added file?
    This one is a basic question of whether the VCS supports renames or not.
  • If I delete a versioned file, is it put in the removed category automatically?
    With some VCS’s you need to explicitly tell them that you are removing a file. With others it is enough to delete the file on disk.

These differences are the sorts of things that affect the workflow for the VCS, so are worth investigating when comparing different systems.

Inkscape Migrated to Launchpad

Yesterday I performed the migration of Inkscape‘s bugs from SourceForge.net to Launchpad. This was a full import of all their historic bug data – about 6900 bugs.

As the import only had access to the SF user names for bug reporters, commenters and assignees, it was not possible to link them up to existing Launchpad users in most cases. This means that duplicate person objects have been created with email addresses like $USERNAME@users.sourceforge.net.

If you are a Launchpad user and have previously filed or commented on Inkscape bugs, you can clean up the duplicate person object by going to the following URL and entering your $USERNAME@users.sourceforge.net address:

https://launchpad.net/people/+requestmerge

After following the instructions in the email confirmation, all references to the duplicate person will be fixed up to point at your primary account (so bug mail will go to your preferred email address rather than being redirected through SourceForge).

OpenID Attribute Exchange

In my previous article on OpenID 2.0, I mentioned the new Attribute Exchange extension. To me this is one of the more interesting benefits of moving to OpenID 2.0, so it deserves a more in depth look.

As mentioned previously, the extension is a way of transferring information about the user between the OpenID provider and relying party.

Why use Attribute Exchange instead of FOAF or Microformats?

Before deciding to use OpenID for information exchange, it is worth looking at whether it is necessary at all.

There are existing solutions for transferring user data such as FOAF and the hCard microformat. As the relying party already has the user’s identity URL, it’d be trivial to discover a FOAF file or hCard content there. That said, there are some disadvantages to this method:

  1. Any information published in this way is available to everyone. This might be fine for some classes of information (your name, a picture, your favourite colour), but not for others (your email address, phone number or similar).
  2. The same information is provided to all parties. Perhaps you want to provide different email addresses to work related sites.
  3. The RP needs to make an additional request for the data. If we can provide the information as part of the OpenID authentication request, it will reduce the number of round trips that need to be made. In turn, this should reduce the amount of time it takes to log the user in.

Why use Attribute Exchange instead of the Simple Registration extension?

There already exists an OpenID extension for transferring user details to the RP, in the form of the Simple Registration extension. It has already been used in the field, and works with OpenID 1.1 too.

One big downside of SREG is that it only supports a limited number of attributes. If you need to transfer more attributes, you basically have two choices:

  1. use some other extension to transfer the remaining attributes
  2. make up some new attribute names to send with SREG and hope for the best.

The main problem with (2) is that there is no way to tell between your own extensions to SREG and someone else’s which will likely create interoperability problems if when an attribute name conflict occurs. So this solution is not a good idea outside of closed systems. This leaves (1), for which Attribute Exchange is a decent choice.

What can I do with Attribute Exchange?

There are two primary operations that can be performed with the extension:

  1. fetch some attribute values
  2. store some attribute values

Both operations are performed as part of an OpenID authentication request. Among other things, this allows:

  • The OP to ask the user which requested attributes to send
  • If the OP has not stored values for the requested attributes, it could get the user to enter them in and store them for next time.
  • The OP could use a predefined policy to decide what to send the RP. One possibility would be to generate one-time email addresses specific to a particular RP.
  • For store requests, the OP can ask the user to confirm that they want to store the attributes.

Fetching Attributes

An attribute fetch request is a normal authentication request with a few additional fields:

  • openid.ax.mode: this needs to be set to “fetch_request”
  • openid.ax.required: a comma separated list of attribute aliases that the RP needs (note that this does not guarantee that the OP will return those attributes).
  • openid.ax.if_available: a comma separated list of attribute aliases that the RP would like returned if available.
  • openid.ax.type.alias: for each requested attribute alias, the URI identifying the attribute type
  • openid.ax.count.alias: the number of values the RP would like for the attribute.
  • openid.ax.update_url: a URL to send updates to (will be discussed later).

The use of URIs to identify attributes makes it trivial to define new attributes without conflicting with other people (and as with XML namespaces, the attribute aliases are arbitrary). However, the extension is only useful if the OP and RP can agree on attribute types. To help with this, there is a collection of community defined attribute types at axschema.org.

As an example, imagine a web log that uses OpenID to authenticate comment posts. Rather than just printing the OpenID URL for the commenter, it could use attribute exchange to request their name, email, website and hackergotchi. The authentication request might contain the following additional fields:

openid.ns.ax=http://openid.net/srv/ax/1.0
openid.ax.mode=fetch_request
openid.ax.required=name,hackergotchi
openid.ax.if_available=email,web
openid.ax.type.name=http://axschema.org/namePerson
openid.ax.type.email=http://axschema.org/contact/email
openid.ax.type.hackergotchi=http://axschema.org/media/image/default
openid.ax.type.web=http://axschema.org/contact/web/default

In the successful authentication response, the following fields will be included (assuming the OP supports the extension):

  • openid.ax.mode: must be “fetch_response”
  • openid.ax.type.alias: specify the type URI for each attribute being returned.
  • openid.ax.count.alias: the number of values being returned for the given attribute alias (defaults to 1).
  • openid.ax.value.alias: the value for the given attribute alias, if no corresponding openid.ax.count.alias field was sent.
  • openid.ax.value.alias.n: the nth value for the given attribute alias, if a corresponding openid.ax.count.alias field was sent. The first attribute value is sent with n = 1.
  • openid.ax.update_url: to be discussed later.

For the web log example given above, the response might look like:

openid.ns.ax=http://openid.net/srv/ax/1.0
openid.ax.mode=fetch_response
openid.ax.type.name=http://axschema.org/namePerson
openid.ax.type.email=http://axschema.org/contact/email
openid.ax.type.hackergotchi=http://axschema.org/media/image/default
openid.ax.value.name=John Doe
openid.ax.value.email=john@example.com
openid.ax.count.hackergotchi=0

In this response, we can see the following:

  1. The user has provided their name and email
  2. They have not provided any information about their web site. Either the OP does not support the attribute or the user has declined to provide it.
  3. The use has explicitly stated that they have no hackergotchi (i.e. it is a zero-valued attribute).

Storing Attributes

Using the Attribute Exchange fetch request, it is possible to outsource management of pretty much all the user’s profile information to the OP. That said, the user will still need to update their profile data occasionally. Telling them to go to their OP to change things and then log in again is not particularly user friendly though.

Using the store request, the RP can let the user update their profile on site and then transfer the changes back to the OP. Like the fetch request, a store request is performed as part of an OpenID authentication request. The additional request fields are pretty much identical to a store response, except that openid.ax.mode is set to “store_request”.

In the positive authentication response, the RP can see whether the data was successfully stored by checking the openid.ax.mode response field. If the data was stored, then it will be set to “store_response_success”. If the data was not stored it will be set to “store_response_failure” and an error message may be found in openid.ax.error.

Asynchronous Attribute Updates

One downside of the Simple Registration extension is that it only transferred user details on login. This means that it is only possible to get updates to attribute values by asking the user to log in again. The Attribute Exchange extension provides a way to solve this problem in the form of the openid.ax.update_url request field.

When a “fetch_request” is issued with the openid.ax.update_url field set, a compliant OP will record the following:

  1. the claimed ID and local ID from the authentication request
  2. the list of requested attributes
  3. the update_url value (after verifying that it matches the openid.realm value of the authentication request).

The OP will then include openid.ax.update_url in the authentication response as an acknowledgement to the RP. When any of the given attributes are updated the OP will send an unsolicited positive authentication response to the given update URL. This will effectively be the same as the original authentication response (i.e. for the same claimed ID and local ID), but with new values for the changed attributes.

As there is no mention of unsolicited authentication responses in the main OpenID authentication specification, it is worth looking at what checking the RP should do. This includes:

  • Is this OP still authoritative for the claimed ID? This is checked by performing discovery on the claimed ID and verifying that it results in the same server URL and local ID as given in the response.
  • Did the message come from the OP? As with a standard response, there should be a signature for the fields. Since the OP does not know what association to use for the signature, a new private association will be used. By issuing a “check_authentication” request to the OP, the RP can verify that the message originated from the OP.

If these checks fail the RP should respond with a 404 HTTP error code, which tells the OP to stop sending updates. If the message is valid, the RP can update the user’s profile data.

Caveats

While the Attribute Exchange extension provides significant features above those provided by Simple Registration, but it still has its limitations:

  1. Any attribute values provided to the RP are self-asserted.
  2. Related to the above, there is no way for a third party to make assertions about attribute values.

For (1), the solution is to perform the same level of verification on the attribute value as if the user had entered it directly. So an OpenID enabled mailing list manager should verify the email address provided by attribute exchange before subscribing the user. In contrast, an OpenID enabled shop probably doesn’t need to do further verification of the user’s shipping address (since it is in the user’s best interest to provide correct information).

The exception to this rule is when there is some other trust relationship between the OP and RP. For instance, if the RP knows that the OP will only send an email address if it has first been validated, then it may decide to trust the email address without performing its own validation checks. This is most likely to be useful in closed systems that happen to be using OpenID for single sign-on.