better Bayesian bogofilter
Greg Louis
glouis at dynamicro.on.ca
Tue Aug 12 16:40:10 CEST 2003
On 20030812 (Tue) at 1420:27 +0200, Boris 'pi' Piwinger wrote:
> > On the other hand, if you train with every
> > message or randomly select errors-and-unsures to keep the ratio right,
> > you get to use equation #4, which saves 3 divisions per token over what
> > we do now.
>
> I guess, most people (including me) don't care about the
> ratio in the database.
I got lucky: when I built my training database the ratio was about
right, and since then I've been adding more spams than nonspams -- but
the proportion of spam to nonspam inbound has been increasing too!
But although Eq.#4 is a plausible approximation in my case, I think
bogofilter should implement Eq.#5 now, with B' and G' as parameters;
then, at leisure, we can come up with some automated way of doing the
counts (for now my classify script is what I use, in the way we just
wrote into the FAQ). I don't think we should encourage people to trust
Eq.#4 to remain valid as their databases grow.
--
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
More information about the Bogofilter
mailing list