urlparse considered harmful
Over the weekend, I spent a number of hours tracking down a bug caused by the cache in the Python urlparse module. The problem has already been reported as Python bug 1313119, but has not been fixed yet.
First a bit of background. The urlparse module does what you’d expect and parses a URL into its components:
>>> from urlparse import urlparse >>> urlparse('http://www.gnome.org/') ('http', 'www.gnome.org', '/', '', '', '')
As well as accepting byte strings (which you’d be using at the HTTP protocol level), it also accepts Unicode strings (which you’d be using at the HTML or XML content level):
>>> urlparse(u'http://www.ubuntu.com/') (u'http', u'www.ubuntu.com', u'/', '', '', '')
As the result is immutable, urlparse implements a cache of up to 20 previous results. Unfortunately, the cache does not distinguish between byte strings and Unicode strings, so parsing a byte string may return unicode components if the result is in the cache:
>>> urlparse('http://www.ubuntu.com/') (u'http', u'www.ubuntu.com', u'/', '', '', '')
When you combine this with Python’s automatic promotion of byte strings to unicode when concatenating with a unicode string, can really screw things up when you do want to work with byte strings. If you hit such a problem, the code may all look correct but the problem was introduced 20 urlparse calls ago. Even if your own code never passes in Unicode strings, one of the libraries you use might be doing so.
The problem affects more than just the urlparse function. The urljoin function from the same module is also affected since it uses urlparse internally:
>>> from urlparse import urljoin >>> urljoin('http://www.ubuntu.com/', '/news') u'http://www.ubuntu.com/news'
It seems safest to avoid the module all together if possible, or at least until the underlying bug is fixed.