Using moderated messages to train the bayes classifier

This week I took a look at the moderation queue of a GNOME mailing list. There were loads of messages in it. There is a moderation team who looks at these queues and cleans it up, by discarding the spam and accepting the valid messages. The moderation queue of the mailing list I looked at had lots of similar spam messages over various days. To avoid newer type of spam messages, every day/hour (forgot how often) the Spamassassin rules are updated. These rules includes the ones from Sare. There is a big anti-spam gap in this as the new rules might not catch the things the moderators have classified as spam/ham.

To make the process more intelligent, I’ve added a patch to Mailman to allow moderators to use the discarded/accepted messages to train the Bayes classifier used by Spamassassin. The way it works is hackish, but very simple to implement. I’ve added a patch to our Mailman package which forwards all discarded and accepted messages to a special user. This user has a ~/.procmailrc file to divide these messages in two maildir folders. A script runs via cron to train sa-learn on the spam and ham folders. Sa-learn understands directories, avoiding the need to start sa-learn per spam/ham message.

Hopefully this will result in less spam messages for the moderators to classify.

A screenshot of the new functionality:

4 Replies to “Using moderated messages to train the bayes classifier”

  1. I do something similar that doesn’t require a patch. I check the “Preserve messages for the site admin” box before discarding. Then I run this job from cron every hour…

    sed -i “s/\*\*\*SPAM\*\*\* //” /var/lib/mailman/spam/spam* 2> /dev/null
    sa-learn –spam /var/lib/mailman/spam/spam*
    rm /var/lib/mailman/spam/spam* 2> /dev/null

    The sed bit is in there to remove the spam tag if it’s in the subject before SA learns from it

    Anyway, it works well for me 🙂

  2. Gabriel: But wouldn’t that require a lot of clicks (@gnome.org get loads of spam)? Also, do you see the difference between discarded messages and accepted ones this way?

  3. Yeah, I guess it would require a bunch of clicking. I don’t run very busy lists, so that’s not an issue for me.

    If you accept them, they get sent off, if you discard them and preserve a copy, then you have the one to learn SPAM from.

Comments are closed.