Using moderated messages to train the bayes classifier

This week I took a look at the moderation queue of a GNOME mailing list. There were loads of messages in it. There is a moderation team who looks at these queues and cleans it up, by discarding the spam and accepting the valid messages. The moderation queue of the mailing list I looked at had lots of similar spam messages over various days. To avoid newer type of spam messages, every day/hour (forgot how often) the Spamassassin rules are updated. These rules includes the ones from Sare. There is a big anti-spam gap in this as the new rules might not catch the things the moderators have classified as spam/ham.

To make the process more intelligent, I’ve added a patch to Mailman to allow moderators to use the discarded/accepted messages to train the Bayes classifier used by Spamassassin. The way it works is hackish, but very simple to implement. I’ve added a patch to our Mailman package which forwards all discarded and accepted messages to a special user. This user has a ~/.procmailrc file to divide these messages in two maildir folders. A script runs via cron to train sa-learn on the spam and ham folders. Sa-learn understands directories, avoiding the need to start sa-learn per spam/ham message.

Hopefully this will result in less spam messages for the moderators to classify.

A screenshot of the new functionality:

New mailman version

I’ve upgraded the Mailman version on mail.gnome.org to 2.1.10. In this version I redid the post-only patch. Basically, if the default action for new subscribers is to moderate them (done on e.g. metacity-devel-list), then members subscribed to post-only won’t be automatically accepted. In 2.1.10 Mailman supports something like post-only by default (see NEWS). I found it easier to patch this logic in than to change existing lists and to ensure new lists would get the post-only setting as well.

If you see any problems, please email gnome-sysadmin. Note that yesterdays email backlog wasn’t caused by the Mailman upgrade. I waited until it cleared.