New new script to train bogofilter

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Fri Jul 4 13:47:21 CEST 2003


Greg Louis wrote:

> but I'm going to reply to the list because there will be others
> interested in the answer; hope you don't mind.

That is what this list is for;-)

>> .MSG_COUNT              329    283
>> % of read               2.2    1.3
>> 
>> It turns out, that not a single message was used twice in
>> training (by accident, but it might cool down worries;-).
>> 
>> I am curious to hear why this is not great and why I should
>> see problems soon.

[What theoretically is broken with bogofilters statistics.]

> These interrelations can be coped with in the analysis,
> but again, overtraining on a given spam or nonspam will distort the
> overall picture of the population that the Bayesian analysis is
> building, so that messages similar to the overtrained one become easier
> to classify, but at the expense of those that are not so similar.

As you see, in practice there is no problem with
overtraining. But still, if I have some spam looking still
like ham after training and I add it again, I don't see that
it is that bad. After all I could have received it twice anyway.

> So the problem you may see down the road if you train "to extinction"
> on errors is that bogofilter will get very good at recognizing the
> types of messages on which you train,

That is the idea. So my hope (and by the compilation of my
training set I think this is OK) is that my mail collection
is representative for all mail I receive.

> and rather poor at recognizing
> messages that are similar to, but not strongly similar to, those
> training ones. 

That would mean that new messages are not classified
correctly more often than in a full training. So far my
observations don't show this.

pi

PS: Sorry, if several letters are missing, my keyboard is
broken.





More information about the Bogofilter mailing list