urlencode and urldecode in sh

This is a fun piece of shell I thought I’d share. For gnome-doc-tool, I need to convert file paths into URLs and back. That means urlencoding and urldecoding them. I searched around and found a few solutions, mostly using a few dozen lines of awk. Now, I’ve been known to write some crazy stuff in awk (like an RNG compact syntax parser), but this seemed like too much work for a simple problem.

Then I remembered printf(1). It can do all the work of converting characters into hex byte representations and back. All you need to write is a loop to iterate over the string.

# This is important to make sure string manipulation is handled # byte-by-byte. export LANG=C


urlencode() {

    arg="$1"

    i="0"

    while [ "$i" -lt ${#arg} ]; do

	c=${arg:$i:1}

	if echo "$c" | grep -q '[a-zA-Z/:_\.\-]'; then

	    echo -n "$c"

	else

	    echo -n "%"

	    printf "%X" "'$c'"

	fi

	i=$((i+1))

    done

}

urldecode() { arg="$1" i="0" while [ "$i" -lt ${#arg} ]; do c0=${arg:$i:1} if [ "x$c0" = "x%" ]; then c1=${arg:$((i+1)):1} c2=${arg:$((i+2)):1} printf "\x$c1$c2" i=$((i+3)) else echo -n "$c0" i=$((i+1)) fi done }

That’s it. If you use these functions on potentially garbage input, you might want to add some error checking. In particular, the decoder should probably check that there are two more characters, and that they are valid hex digits.

12 thoughts on “urlencode and urldecode in sh”

James Henstridge says:

2009-12-06 at 2:01

… but is it any faster than using awk?

Only needing one fork/exec might tip the scales in awk’s favour.
Stef70 says:

2009-12-06 at 2:20

Nice. I’ll keep them. Could be useful one day. I will also remember the trick of using printf %X with a ‘quoted’ character to obtain an ascii code.

If you are using bash, you can use the regexp comparison =~ to remove the call to the non built-in function grep

if [[ “$c” =~ [a-zA-Z/:_\.\-] ]] ; then
…

You should also accept the digits 0-9 characters in URLs.
Stef70 says:

2009-12-06 at 3:25

For the fun! Here is a version of urlencode making full use of bash regex.

urlencode() {
local arg
arg=”$1″
while [[ “$arg” =~ ^([0-9a-zA-Z/:_\.\-]*)([^0-9a-zA-Z/:_\.\-])(.*) ]] ; do
echo -n “${BASH_REMATCH[1]}”
printf “%%%X” “‘${BASH_REMATCH[2]}'”
arg=”${BASH_REMATCH[3]}”
done
# the remaining part
echo -n “$arg”
}
anon says:

2009-12-06 at 3:38

In contrast to a solution with (n)awk this code is non-portable and will most likely only work with bash and GNU userland.

* echo -n is platform dependent why not use printf which is specified by POSIX and portable?
* parameter expansion like ${var:i:j} is not in POSIX only works with bash/ksh
* \x works only with GNU printf
Tethys says:

2009-12-06 at 4:54

You can remove a couple of the forks, as the shell has much of what you need built in (for bash, at least). In urldecode(), the printf line can be changed to:

eval echo -n “\$’\x$c1$c2′”

In urlencode(), the grep can be removed, since bash supports regexp matching:

if [[ “$c” =~ ‘[a-zA-Z/:_\.\-]’ ]]; then

I can’t work out how to ditch the printf that gets from a character to its hex equivalent, though. The shell supports some limited base conversions. You can go from hex to dec with $((0xNN)), for example, but I can’t see a way to go from dec to hex, let alone going from a character to its hex equivalent, short of a lookup table (bash supports associative arrays, so it would be easy to implement, albeit somewhat clumsy).
Daenyth says:

2009-12-06 at 8:04

Does this work with multibyte characters in urls? At a glance it seems like it would break for them.
Paolo Bonzini says:

2009-12-06 at 8:05

No forks here, though… looks like it’s only builtins.
Michael R. Head says:

2009-12-06 at 10:33

There was a superuser.com question about this a little while ago. I posted your solution and a link to your block as an answer (hope you don’t mind): http://superuser.com/questions/76612/linux-urldecode-filename

— mike
shaunm says:

2009-12-06 at 12:59

Daenyth, it works correctly because of “export LANG=C”. This ensures that the strings are processed byte-by-byte, so the escape sequences are always three characters: %XX. For URIs being sent to a remote system, I make no guarantee about the encodings working correclty. But for local files, it does correclty encode the on-disk filename encoding.
shaunm says:

2009-12-06 at 13:10

Tethys, Stef70: Nice tricks for people who only care about bash. For my purposes, I need to stay reasonably portable. Oh, and Stef70, thanks for pointing out 0-9. Don’t know why that slipped through my brain.

And anon, thanks for bringing the portability issues to my attention. Those are very important to me.
shaunm says:

2009-12-06 at 13:18

James, wow. I just did a little test. I sort of expected to see a negligable speed difference in favor of awk. But it was not negligable at all. For 1000 iterations of encoding the string “!@#$%”, mine took 7.6s, and the awk solution I found took 2.1s.
shaunm says:

2009-12-06 at 13:32

Using Stef70’s bash-dependent version takes only 0.6s. Beats the pants off of anything else. Yay for stagnating “standards” holding back superior implementations.

Comments are closed.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31