better Bayesian bogofilter

Greg Louis glouis at dynamicro.on.ca
Tue Aug 12 16:40:10 CEST 2003


On 20030812 (Tue) at 1420:27 +0200, Boris 'pi' Piwinger wrote:

> > On the other hand, if you train with every
> > message or randomly select errors-and-unsures to keep the ratio right,
> > you get to use equation #4, which saves 3 divisions per token over what
> > we do now.
> 
> I guess, most people (including me) don't care about the
> ratio in the database.

I got lucky: when I built my training database the ratio was about
right, and since then I've been adding more spams than nonspams -- but
the proportion of spam to nonspam inbound has been increasing too! 
But although Eq.#4 is a plausible approximation in my case, I think
bogofilter should implement Eq.#5 now, with B' and G' as parameters;
then, at leisure, we can come up with some automated way of doing the
counts (for now my classify script is what I use, in the way we just
wrote into the FAQ).  I don't think we should encourage people to trust
Eq.#4 to remain valid as their databases grow.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list