New script to train bogofilter

Wed Jul 2 14:24:02 CEST 2003

At 03:57 AM 7/2/03, elijah wrote:
>On Wed, 2 Jul 2003, Boris 'pi' Piwinger wrote:
>
> > Boris 'pi' Piwinger wrote:
> >
> > > I wrote a perl script which trains bogofilter on error. It
> > > produces very small databases. We'll have to see how good
> > > that works. Any comments are warmly welcome.
> >
> > I reran my script until I got no errors. It was still
> > extremely small: 352 spam and 291 ham
> >
> > So my first estimation: This works perfectly, we need far
> > less messages in the database than we thought before. There
> > seems to be no practical reason to avoid multiple
> > classification of the same message.
>
>If I understand correctly, you are correcting for mistakes over and over
>again until bogofilter finally gets it right.
>
>I confess that I do not understand all the bogomath, but I have always
>wondered if high message counts in the database waters down new input.

I don't think it does.  Bogofilter looks at each token of the message and 
computes its ham and spam scores, which are (roughly) the percentages of 
ham (spam) messages it occurs in, and takes a ratio of the two 
numbers.  This effectively eliminates size as a concern.

>Maybe what is needed is a 'super' spam/ham switch:
>
>bogofilter --force -Ns < some-spammy-message
>
>--force would keep repeating the action until the message was correctly
>identified (in this case repeatedly adding the message to the spam
>wordlist and removing it from the ham wordlist). Of course, in practice
>people make lots of mistakes classifying spam (at least in a server wide
>install). Something like this would really magnify any mistake, but maybe
>it could also be easily corrected. Seems like --force should be
>incompatible with -u.
>
>-elijah

As "--force" can be implemented by a loop in a simple script, there's no 
need to add it to bogofilter.