better Bayesian bogofilter

Wed Aug 13 15:17:30 CEST 2003

Greg Louis wrote:

>> Which would almost impossible to use for train on error. So
>> this part of the test would not work.
> 
> No, I train on error, and that's exactly why my training db's
> .MSG_COUNTs aren't accurately characteristic of the population's.  What
> I'll do is measure the proportion of spam in a training batch and use
> that until next time.  Since I train every couple of weeks, that should
> be close enough, given that the accuracy doesn't change drastically
> with minor changes in the proportion of spam.

My mistake was to think about the ratio of trained as
opposed to real messages. For the latter this ratio would
change during training which would in turn change which
messages are used for training.

For me it seems hard to tell how the real ratio is. I don't
keep track of that, I do safe all spam, but not all ham,
which is not counted anywhere.

pi