Using moderated messages to train the bayes classifier

This week I took a look at the moderation queue of a GNOME mailing list. There were loads of messages in it. There is a moderation team who looks at these queues and cleans it up, by discarding the spam and accepting the valid messages. The moderation queue of the mailing list I looked at had lots of similar spam messages over various days. To avoid newer type of spam messages, every day/hour (forgot how often) the Spamassassin rules are updated. These rules includes the ones from Sare. There is a big anti-spam gap in this as the new rules might not catch the things the moderators have classified as spam/ham.

To make the process more intelligent, I’ve added a patch to Mailman to allow moderators to use the discarded/accepted messages to train the Bayes classifier used by Spamassassin. The way it works is hackish, but very simple to implement. I’ve added a patch to our Mailman package which forwards all discarded and accepted messages to a special user. This user has a ~/.procmailrc file to divide these messages in two maildir folders. A script runs via cron to train sa-learn on the spam and ham folders. Sa-learn understands directories, avoiding the need to start sa-learn per spam/ham message.

Hopefully this will result in less spam messages for the moderators to classify.

A screenshot of the new functionality:

New mailman version

I’ve upgraded the Mailman version on mail.gnome.org to 2.1.10. In this version I redid the post-only patch. Basically, if the default action for new subscribers is to moderate them (done on e.g. metacity-devel-list), then members subscribed to post-only won’t be automatically accepted. In 2.1.10 Mailman supports something like post-only by default (see NEWS). I found it easier to patch this logic in than to change existing lists and to ensure new lists would get the post-only setting as well.

If you see any problems, please email gnome-sysadmin. Note that yesterdays email backlog wasn’t caused by the Mailman upgrade. I waited until it cleared.

SSH public keys

A while ago I added some basic SSH public key checking to Mango. This for two reasons:

  • Avoid email back and forth in case of copy/paste error
  • To ensure the key is long enough

The key length checking was mostly a hack. I did that by checking if the number of bytes was larger than the number of bytes in a test SSH public key. I never actually knew how many bits the key had. The added benefit was that I figured out how the fingerprint is generated. I always wanted to determine the key length, but couldn’t be bothered. Of course, I could’ve used ssh-keygen, but I don’t like to start another process. It doesn’t actually matter to do a full check. This as the person providing the key wants it to work. We’ll probably get a mail when the public key was broken on purpose. Exception being the key length, as we don’t want too short keys (security issue).

Today I figured out how to determine the key length for RSA and DSA public keys. I’ll start with an example public key (wrapped for readability):

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAP8zwqYE675bpnYzui0pLNd2XyoB+
v4RtlK2QJ6+42w3VWREbDfeeUmvenLBzdcffs602WOuWB1DrbhjEv4CbABH/u
O89IMlC4h62wel7BfiQqEq6yKW0B+yqQxIsBQPhu8ID0gXrt0uhlPaHkqD1XR
WM9ywr5UP1K51cTPRZu8xQVtpCDMgppa/FwZTKY/+l3HXvu01/NAaNGMPOD3y
neIturzKi3x4f5Id65V1KD70B5YiCiJxFSevOcPx3yYYy+NQN52EBGLf76a78
1S2MiPcHxhoQtG8EPfAhWN3MmOnEd4iuy2IzHdyAK+LCp5Qtyy3mbKTBKSKQb
vrsm8jrjE= olav@bkor.dhs.org

The ssh-rsa is the key type; ssh-rsa for RSA and ssh-dss for DSA. After that there is a space and then the data (easily recognizable as base64 encoded). After the base64 data there is another space followed by a comment (contents doesn’t matter for SSH, you can change it with a text editor). The fingerprint of a key is nothing more than the MD5 hash of the base64 decoded data. The md5 hash of the example public key is 6f8c83c826ee51535a813756ff1bc9b5. The ssh-keygen program shows this with colons.

If you’ve compared a few public SSH keys, you probably noticed that the start of the base64 data is always the same (for the same key type). This is because the data itself again contains the key type (‘ssh-rsa’). The data is encoded using a big endian number (4 bytes) providing the number of bytes of the string that follows. A hex editor will show 00 00 00 07 for the first 4 bytes. After these 4 bytes follows either ‘ssh-rsa’ or ‘ssh-dsa’. The rest of the data is key type specific.
For RSA, the key type is followed by 2 strings encoded using the same method. These strings actually represent ‘bignum’s. However, I don’t care about that. For DSA, 4 strings (bignums) follow after the key type, also encoded using the same method. For RSA you’ll want the 2nd bignum. For DSA the 1st bignum.

When you have such a bignum you’re very close to determine the key length. It is mostly just determining the number of bits needed for the bignum. You could do this by determining the number of bytes used by the bignum and multiplying it by 8. However, you have to deduct a few bits. This as the first byte causes the bignum to use more bits than it actually needs. For example, if the following are the first two bytes (shown as binary) out of the total 255 bytes used for the bignum:

00110011 11000010

You’ll note that the first two 0’s aren’t actually needed. So instead of calculating 255*8 (2040), you should subtract it by 2 bits. Resulting in a key length of 2038. This is what ssh-keygen will give you if you try the example public ssh keyfile shown above (make sure to unwrap the lines and remove the spaces in the base64 encoded data). Looking back, it actually is pretty easy.

Oh, and the reason why I want to know the exact key length is to add a check against blacklisted keys. For this I need to know the actual length of a key. I figured above out partly myself, partly by reading python-paramiko source (unfortunately I overlooked the public key reading part) and partly by trying to understand the openssh code. Running ssh-keygen would’ve been loads easier, but IMO nasty (especially when someone has multiple keys) and also not as interesting. Further, if you use Python instead of PHP (blergh!), just use python-paramiko.

SVN.gnome.org downtime! Sun 8 Jun 7:00 – 10:00 UTC

As you could’ve read in my previous post, I plan to upgrade the svn.gnome.org server on:

Sun 8 Jun from 7:00 – 10:00 UTC

This means that svn.gnome.org is NOT available at that time.

Suggest to do the following:

  1. Use the Bzr mirror. Suggest to read the instructions on the wiki
  2. Figure out Git. See the Git with GNOME instructions. Don’t forget: With Git you’ll need to do some work before the downtime (convert a SVN repos into a git-svn repos).

Note: Please *wait* with committing until *either* 10:00 UTC (hopefully;) *or* when it is announced as back up (#gnome-hackers on GIMPNet and devel-announce-list). This as I obviously will need to test it after it has been upgraded.

For those using a DVCS, I’d love to see a repeat of http://tinyurl.com/5yvgc3.

PS: Please subscribe to devel-announce-list.

Update: It is 8 Jun, not 9 (subject was correct, just not the text itself).

Deciding when to upgrade svn.gnome.org

The server behind svn.gnome.org still runs the previous Ubuntu LTS (Dapper / 6.06). I want to upgrade this to the latest Ubuntu LTS, this being Hardy / 8.04. The upgrade itself should not take more than 30min, but the downtime will be longer than that (rsync’ing everything to another machine). I’ll setup another machine to handle SVN in case the distro upgrade the distro upgrade fails in unexpected ways.

I’m currently thinking of Sat 7 or Sun 8 Jun 2008. I prefer Sunday mornings (CEST / UTC +2) as hopefully not too many people are online. Further, the most important users of svn.gnome.org are developers and I expect less usage in the weekend. The weekend is more popular for translators though. It also isn’t just before an important release.

Above date is not fixed, I still have to check if there is someone standing by with physical access.

Anyone think above is a bad idea? Better suggestions regarding when to schedule this? Note: During the week it has to be in the evening (UTC +2).

Note: After the upgrade, the repository format has to be changed, but I’ll do that later. It shouldn’t be more than a few minutes per repository (due to converting it twice, first the whole repository, then disallowing commits, then doing the commits that have been made when the conversion was running). The repository format should allow things like svnsync, plus repository size should be smaller. See http://subversion.tigris.org/svn_1.4_releasenotes.html.

Page loading problems

An answer regarding page loading problems:

Gave up on hoping that Gnome would serve my blog sanely – apparently asking for it is a denial of service attack. Duplicated it at http://www.go-oo.org/~michael/blog/index.atom.

This is not what I said or meant on IRC. There is a DDoS going on. Nothing too heavy. However, it really slows down the database, slowing down Bugzilla until a crawl. Due to the type of DDoS it is almost impossible to block (too many IPs, etc), resulting in unintended side-effects. Initially I thought I could add the workaround pretty quickly. However, was too difficult to do at UDS. I noted the delay in #sysadmin, so if you’ve asked there someone hopefully would’ve repeated in case I wasn’t around. I did inform the sysadmins regarding the config change (mailing list and IRC) causing these page load issues. Due to abuse avoidance (often if you make it known, seems others see a need to repeat/join) I did not inform anyone else. I do hope the DDoS doesn’t morph into something I cannot block.

Ubuntu Developer Summit

Canonical invited various people to attend FOSS Camp, others just for the Ubuntu Developer Summit. Some even had an invitation for both. I received an invite to attend the Ubuntu Developer Summit, which was held 19-23 May in Prague, Czech Repulic. I’ve been to Prague twice before, but it was way too long ago for me to remember any specifics, apart from the Clock tower in the old city centre.
Apart from me, also Vincent Untz, Andre Klapper, Christian Kellner, Ryan Lorty, Pedro Villavicencio Garrido, Sebastien Bacher, Murray Cumming (only a few days), David Zeuthen, Lennart Poettering and others attended the Summit. During lunch we usually sat at the GNOME table. I really liked the food in Czech, my only complaint that sometimes it was way too much (which on one hand doesn’t matter, but on the other hand makes it hard to stop if it is really good). Plus they could ease up on the sauce (too much). My roommate was Reed Loden, known for firstly being another Bugzilla developer, but he is also a Mozilla sysadmin. We weren’t the only Bugzilla developers who attended as Christian Reis was also there. Unfortunately I did not meet him (although he should’ve put in some effort and also worn his Bugzilla tshirt 😉 ).

Aside from Ubuntu things, also talked to various people regarding distributed version control systems. This included a few Bazaar developers, Reed (mostly regarding usage of hg at Mozilla) and a few Git users such as Christian Kellner. This to prepare for the BoF at GUADEC.
Vincent discussed some (private) release team things, first with Andre, then with me. It hasn’t been discussed yet with the rest of the r-t. Face-to-face is so much faster than IRC/mailing lists. Fortunately the lack of openness (like e.g. public r-t archives + meetings) did not matter in this case.

Bought a second hand (but still fairly new) laptop a few weeks before UDS. As I find my desktop way better than some laptop, resulting in me not picking it up until just before UDS. Meant that the first thing I did at UDS was installing Ubuntu. Sound unfortunately did not work, this was fixed later in the week (nothing other than installing new updates). Btw, the ‘system restart required’ should tell me why it wants to restart so I can understand the impact of not doing it right away and it would allow me to work around it.
Having a laptop is pretty good for the downtime (either uninteresting talks or just breaks). Also, only having 1280Xsomething wasn’t as annoying as I assumed beforehand.

Now on to the bits and pieces I remember from UDS. Specifics might be (unintentionally) incorrect, feel free to leave a constructive comment for that. I have a feeling that this might be boring, too long, etc… but perhaps interesting for people who want to know what it is like at UDS.

The talks have been divided into various topics such as qa, community, server, etc. Each topic was usually held in the same room. The room for the desktop topics had been held in a room where the airconditioning was working like mad — it was way too cold in there. One good thing was the availability of two extra rooms. One of which I used for a Bazaar discussion. As noted in the wrapup by someone, it would’ve been better if those rooms showed up in the schedule. This as I almost never looked at the white board. Ok, I did see a KDE group hug for 2 hours, which had me wondering for days.. 2 hour long hug?!? Maybe the KDE attendees were crazier than the GNOME cabal, at least I’ll try and do better 😉

On Monday morning at the ‘OpenChange Exchange integration’ talk, they introduced the concept of bug 0. This in relation to bug 1 in Launchpad, which has as goal to have more marketshare than Microsoft (for free software, not just Ubuntu). The bug 0 concept is about a possible decline in server market share. It was suggested that the reason behind the increase of Microsoft servers was the integration of various Microsoft components such as Active Directory (LDAP), Calendering, Mail, etc. To overcome the increase in Microsoft servers, something has to be delivered which provides an similar experience to what Microsoft currently has. The talk suggested a combination of both Samba, openldap, plus something which deals with Exchange (calendering, etc). With the new protocol (etc) documentation, maybe finally we’ll have some good Exchange replacement available under Linux (finally!). Oh, and I mean something which integrates with both Outlook as well as Evolution.
Don’t recall much more from Monday. Some talks I was only physically around, as I was busy with the manual labour of replacing SSH keys (btw: that SSH key replacement was not just me. Lots of work done by Kjartan Maraas, Kurt von Finck and Christian Rose, way before I had time).

Don’t recall much from Tuesday apart from a Music Experience review. Celeste was commenting on various usability issues in rhythmbox. Oh, and SSH key replacements. At one point (maybe Tue, perhaps some other day) I enhanced Mango to automatically inform users when some sysadmin / accounts person adds a new SSH key to an account. Shows all SSH keys on the account, not just the newly added one. Reed was partly interested in a system like Mango for Mozilla. Hopefully he’ll either base something off Mango, or make something new which is usable for GNOME.

On Wednesday morning I attended automated desktop testing. Automated testing is hard as the there are various tools which generate testing scripts, but they are usually very detailed, breaking too often. Resulting in lots of effort to (re)write such tests/scripts. Errors in such scripts are bad as it lowers the trust of developer as well as qa people have in them. In the talk someone explained a new testing tool which worked by using something like VNC (comparing graphics on the screen). Seems it will not correctly handle changes made to the theme, nor alpha (transparent) window/themes.
Don’t remember, but guessing that on Wednesday afternoon (12.00) Celeste from KDE/openusability.org asked for some meta package which would include all tools needed to test usability, including remotely. This as basically everything is available, just not in one package. I’m interested if remote usability testing is possible. From what I remember, one of the most basic things that you have to do is to make the user comfortable. This aside from the way you ask questions. I wonder if a user can be made comfortable remotely. Hopefully progress can be followed via blogposts (Planet KDE). Note that remotely wasn’t the only thing covered. It was mostly to get friends and family to help hold a usability test, guided by an expert (e.g. Celeste).

Started Thursday (IIRC) morning off with a QA session. Suggested a few things which would get more triagers to help Ubuntu (by explaining what to my understanding drives triagers). I’ll refrain from being too specific as I don’t want too much competition and influence the strength of the excellent GNOME bugsquad team 😉
Went to a session regarding a common printing dialog for KDE as well as GNOME. I expected some proposal meant for an ISV, to allow non-KDE/GNOME software to show the appropriate dialog depending under which desktop environment the program is running under. However, it rather was regarding replacing the KDE and GNOME dialogs with just one dialog. I’m not sure about the feasibility and usefulness.
After that session I attended the input hotplug. This was about allowing e.g. two mouse pointers and multiple keyboards. Seems that gtk+ wouldn’t deal properly with two mouse enter events (no idea about Qt). It would also still be possible to have the two mouses control just one pointer on the screen. There was some discussion in which I learned that apparently you can have multiple (Wacom) tablet pens, each assigned to a different color. Pretty cool.

Every day after lunch various talks were given in the big room (Plenary). One interesting item was Wubi, which is an Ubuntu installer for Window. on Thursday the Wubi developers started off by apologizing for the Windows machine (IMO nuts), then by having people raise their hands if they knew absolutely nothing about Wubi. From the people who raised their hands someone was picked to do the Wubi installation. According to the developers Wubi should work with all Ubuntu derivatives (Kubuntu, etc), and IIRC even different distributions.
There are some scripts available to increase the size of the Wubi Linux file, and even to transform the Wubi file into a real partition (or perhaps this is still planned, not sure). The nice thing about Wubi is that all the installer questions are already asked in Windows. The entire installation of Ubuntu itself is automatic.
In the afternoon I went to Client Drawn Decorations by Mirco Muller. Basically not for the talk, I just wanted to see bling. There was some discussion regarding the feasibility of this. I didn’t follow the whole discussion as discussion != bling.
Followed PackageKit after that that talk. There was an agreement that all packages requiring stdin would be fixed, plus some discussion regarding the integration of debconf and license agreements. Richard Hughes called in for that talk using sip. Worked pretty well overall, only trouble was during one of his answers (he was breaking up).

On Friday I attended a talk about System-wide preference. Desrt explained the possibilities of gconf (you can have system defaults in gconf, just see gconf-editor). That as well as the plan for dconf (writing requires dbus, reading does not). Readonly dconf has very few deps (could be used for some stage in the bootloader, forget exactly where).

A party had been planned for Friday evening, mostly to continue celebrating the coolness of Daniel Holbach. Claire understood the laziness of the crowd as 3 busses had been arranged. We went to a club called XT3. The party started off with an Ubuntu band. After they played all the numbers they knew (I would’ve enjoyed if they knew more songs), some DJ started. After that DJ stopped, Daniel Holback and someone else took over. From what I noticed, enough videos have been made. Hopefully you can get a good impression by searching for these videos on Youtube and Google.
At 1.00 AM they arranged a bus to bring everyone back to the hotel. Together with Andre Klapper I decided to stay and party on (after the bus left we noticed we had been the almost only ones to make that choice). We stayed until they kicked us out of the club. This might sound awesome, but unfortunately it happened around 1:30 AM. After that we decided to partly walk back to the hotel and take the public transport as soon as we had enough. We eventually arrived at the hotel at around 3:00 AM (long time to get there was on purpose). Learned later that almost everyone was drunk there, according to a study done by the Shuttleworth Research Centre. Did not notice that such research was taking place; must have been all those people filming the party.
In the morning I woke and got up around 9.30, leaving just enough time to get down and eat some breakfast. I even saw Andre there. Walked to the subway station together with Christian Kellner, Pedro Villavicencio Garrido and Andre Klapper.

After landing at Amsterdam (Schiphol), I took a train which passed Amsterdam Arena. This was around the time as the start of some sing-along concert. Meaning: fully packed train. I thought they were all to accompany me 🙁

Bad things during UDS:

  • No icecream deathmatch
  • Unsuccessful in making the stewards laugh while explaining the security stuff. My successful laughs still stands at 2. I did hate them for cheating on the way back, as the instructions consisted of a movie.
  • Didn’t always understand what a talk would be about. This seemed to be mostly caused by my non-use of Debian/Ubuntu.
  • Way too much Menthos was available.
  • Vuntz had loads of green tshirts at his home, but brought none to UDS. Evil!
  • Forgot to make time for a GNOME release team interview (video)
  • Seeing clock show 20+ degrees celcius.. but that was the Netherlands, not Prague (rain, cold, etc). Seems that you can only show another weather location in the panel by changing the (system?) timezone, which I didn’t want to do.
  • Something to do with an ambush of out of context quotes, behind the scenes data gathering, etc. Didn’t attend, but consider such behaviour impolite (very difficult to respond to such things without having context or something other than stuff happening months ago). From what I heard it resulted in a very long discussion, instead of being rescheduled to some other time (meaning: not at UDS). Note: I wasn’t there, just my opinion based on hearsay. I’m on Planet GNOME, not Ubuntu, so adding a comment regarding the issue itself is not constructive.
  • Same for some other discussions (appeared to me as bike shed). Thankfully I had my laptop and working wifi.

Other random bits:

  • Reached a ratio of 32 for the KDE 4.1 alpha movies. Not much higher than before UDS, so I stopped seeding that
  • Have UDS t-shirt.
  • Vuntz is too skilled in kicking people from a channel
  • Liked the wrapup. Basically an honest discussion about things to improve to make the Summit better
  • Seems people believe whatever a channel topic says. This caused some harassment (joking!) towards desrt. I’m so going to love /topic 🙂

PS: Canonical asked if people would blog about the event. I’m doing this because I think it is a good idea, to remember all the things from last week and to let GNOME people know what I did there.

Bzr mirror of GNOME

John Carr has setup a Bzr mirror of all GNOME repositories. Details are available on the wiki. Most GNOME modules are available via Launchpad, but that one doesn’t allow you to commit to SVN (IIRC). The mirror by John Carr does allow commits. For this all to work you’ll need latest Bzr and bzr-svn.

Copy/pasting the instruction:

Usage:
We’ll create a project folder for your module which can house multiple branches. The branches will share revisions to save disk space.

cd ~/
bzr init-repo --rich-root-pack cheese
cd cheese/
bzr branch http://gnome.unrouted.co.uk/cheese/trunk

If you want to get the latest stuff:

cd ~/cheese/trunk
bzr pull

If you need something that hasn’t made it to the mirror yet, you can pull directly from GNOME SVN:

cd ~/cheese/trunk
bzr pull http://svn.gnome.org/svn/cheese/trunk

When you have some changes that are in your trunk branch, but not in SVN, you can push:

cd ~/cheese/trunk
bzr push svn+ssh://username@svn.gnome.org/svn/cheese/trunk

Note: If someone wants to setup a git mirror, contact me. I can grant rsync access to speed up the conversion. Regarding the Bzr mirror: It is all John Carr. For praises and more, contact him.