better Bayesian bogofilter

Wed Aug 13 14:32:39 CEST 2003

On 20030813 (Wed) at 1400:45 +0200, Boris 'pi' Piwinger wrote:
> Greg Louis wrote:
> 
> >> What I'm more interested in knowing is exactly _how_ you plan to keep track 
> >> of the ham/spam ratio.  One thought that crosses my mind is having a 
> >> ".SCORE" token rather like .MSG_COUNT.  If I understand your article, 
> >> .SCORE needs to be updated for each ham and each spam scored.
> > 
> > That was my first idea, yes. 
> 
> What is wrong with .MSG_COUNT, as long as you make sure you
> don't diviede by zero?

> > What I intend to do first is implement
> > Eq. #5 with a single parameter that can be set manually
> 
> Which would almost impossible to use for train on error. So
> this part of the test would not work.

No, I train on error, and that's exactly why my training db's
.MSG_COUNTs aren't accurately characteristic of the population's.  What
I'll do is measure the proportion of spam in a training batch and use
that until next time.  Since I train every couple of weeks, that should
be close enough, given that the accuracy doesn't change drastically
with minor changes in the proportion of spam.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |