Bogofilter training

Chris Wilkes cwilkes-bf at ladro.com
Fri Mar 5 19:15:39 CET 2004


On Fri, Mar 05, 2004 at 07:32:54PM +0200, jyry at helzinki.net wrote:
> I have been considering the following method based on what I have read.
> 
> 1. Messages are filtered and sorted into INBOX and INBOX.Spam accordingly.
> 2. Users check INBOX and INBOX.Spam and moves/copies false-positives and
> false-negatives into INBOX.NotSpam and INBOX.IsSpam accordingly.
> 3. Bogofilter is learns from the messages in the INBOX.NotSpam and 
> INBOX.IsSpam.
> 
> I have a few questions about the above:
> -Is cron the only method to let bogofilter train from the messages in 
> INBOX.NotSpam and INBOX.IsSpam?

I can't think of another portable way of doing this other than via a
cronjob.  You might be able to tail the IMAP logs or use some sort of
famd configuration, but I find cronjobs just fine for this.

What I would do is only operate on emails that have been in .NotSpam and
.IsSpam for over a minute, just in case a large email is in the process
of being moved into the folder.  You can get those files through a use
of "find"'s time functions.  (provided that you're using maildir
formatted folders).

I have a similiar setup to what you list -- except I also train on
correctly classified emails.  If an email's been in their Spam or INBOX
for over an hour I assume that its been correctly classified so I run it
through a -n or -s to add to their wordlist.

Its a little complicated as 90% of the users here POP their good mail
off the server so I have to keep a record of the emails, and also an md5
hash of it so that when they correct the email by putting into one of
those folders (the spam folders are IMAP) that I don't train it as good
and then correct for spam, etc.

> -Should I use a site-wide dictionary or one per user?

The problem I see with site-wide dictionaries is that each person's spam
/ nonspam wordlists are different.  Now it might not be that big of a
deal if you're just a one-operation shop (ie 5 programmers or 5 real
estate peoples) but if you have a diverse population at your office
there's bound to be someone that needs all the mortgage listings while
another is on a mailing list for deals at the local office supply store.

So then in addition to the site-wide dictionary you'll have to implement
a secondary user dictionary to do individual tests.  Granted you could
save on disk space if that 2nd dictionary only contains corrections, but
still it could grow.  And since disk space is extremely cheap I would
say stick with individual wordlists.

Chris




More information about the Bogofilter mailing list