New script to train bogofilter

Wed Jul 2 10:14:14 CEST 2003

elijah wrote:

>> I reran my script until I got no errors. It was still
>> extremely small: 352 spam and 291 ham
>>
>> So my first estimation: This works perfectly, we need far
>> less messages in the database than we thought before. There
>> seems to be no practical reason to avoid multiple
>> classification of the same message.
> 
> If I understand correctly, you are correcting for mistakes over and over
> again until bogofilter finally gets it right.

That is correct. I start from nothing, but that is not required.

> I confess that I do not understand all the bogomath, but I have always
> wondered if high message counts in the database waters down new input.

Could be.

> Maybe what is needed is a 'super' spam/ham switch:
> 
> bogofilter --force -Ns < some-spammy-message

You don't need to remove it from the ham list if you have
not added it before. But you can easily do it yourself. Just
repeat:
bogofilter -s < some-spammy-message
bogofilter -v < some-spammy-message
until the output says "Spam".

I just think about adding that kind of loop to my script. I
just cannot prove that it terminates;-) You might well
construct an example which does not, but this won't happen
in real life I guess.

> --force would keep repeating the action until the message was correctly
> identified (in this case repeatedly adding the message to the spam
> wordlist and removing it from the ham wordlist). Of course, in practice
> people make lots of mistakes classifying spam (at least in a server wide
> install).

That kind of mistakes are always really bad.

> Something like this would really magnify any mistake, but maybe
> it could also be easily corrected. Seems like --force should be
> incompatible with -u.

-u is not really what you would like to use with my script.

pi