bogofilter's default algorithm

Wed Jan 22 01:19:06 CET 2003

At 02:33 PM 1/21/03, Nick Simicich wrote:
>At 06:17 AM 2003-01-21 -0500, Greg Louis wrote:
>
>
>>Automatic training without manual correction is not going to work
>>anyway.
>
>My current practice is to autotrain on all mail, (except for a particular 
>class that I want delivered whether spam or ham, which I do not bogofilter 
>at all) and to correct all the deliveries that get misfiled.
>
>What this seems to expose me to is a window where a misclassified mail 
>might detrain the database, until it is refiled.
>
>I just ran bogofilter -h and bogofilter -V and it told me that
>
>bogofilter version 0.9.1.2 Copyright (C) 2002 Eric S. Raymond
>
>and
>
>         -g      - select Graham spam calulation method (default).
>         -r      - select Robinson spam calulation method.
>         -f      - select Fisher spam calulation method.

Unfortunately the help message is defective.  Robinson _is_ the default 
algorithm for 0.9.1.2

>So I guess when I thought I was using Robinson, that was based on a 
>message here .  I am actually using Graham.  I have not seen any 
>indication in the makefile that I made any specific request, and I note 
>that it does make all of graham.o robinson.o and fisher.o.
>
>I guess people belive that Robinson or Robinson-Fischer is better than 
>Graham.  If I want to switch to Robinson (which is still useful because it 
>gives me a yes-no answer) can I use my existing databases or do I have to 
>run all training over again?  I do note that I have not been saving all 
>mail I have used for training.

Yes, you can continue to use the database.  An interesting thing to do is 
to use the script contrib/randomtrain which provides a "train on error" 
process.  Given a decent set of ham and spam messages and some cpu time, 
random train builds new wordlists using those message with which bogofilter 
has troubl.  To be a bit more detailed, in random order, the script goes 
through the two mailboxes and has bogofilter classify each message.  If 
bogofilter got it right, randomtrain goes on to the next message.  If 
bogofilter got it wrong, randomtrain trains bogofilter on the message 
(using the appropriate flag, i.e. '-s' or '-n').  This algorithm only adds 
tokens to bogofilter's wordlists when bogofilter has(had) insufficient info 
to correctly classify the message.  I did this around the first of the year 
with 3800 spam and 16300 ham messages and ended up with wordlists 
containing approx 1000 mssages.  They were much smaller than before and are 
performing very well.

David