better Bayesian bogofilter

Tue Aug 12 20:17:59 CEST 2003

On 20030812 (Tue) at 1658:17 +0200, Matthias Andree wrote:
> Greg Louis <glouis at dynamicro.on.ca> writes:
> 
> > By "more complicated calculation" is meant "equation #5", right?  Yeah,
> > just once per token ;)  On the other hand, if you train with every
> > message or randomly select errors-and-unsures to keep the ratio right,
> > you get to use equation #4, which saves 3 divisions per token over what
> > we do now.
> 
> I'd think we should go for the code that doesn't care about the ratio of
> spam to ham used in training. We'd better avoid optimizations that
> depend on the environment or makes assumptions about the user.

That tallies with my recommendation that we put in equation #5.  If we
add just one parameter, the proportion of spam in the population (B'),
bogofilter can calculate G' = 1 - B' and the user doesn't have to
maintain two parameter values.

> Does your code affect "make check" results?

Haven't tried, but it should blow them away completely, as changing to
equation #5 requires adjusting the spam cutoff.  So far I've only
pasted in a couple of constants and rewritten the equation for test
purposes; I need to sit down and parameterize it and update docs and
help and all that.  When I get that patch together it can be used to
find the new value for the default spam cutoff, and then make check
might work (I've never run it myself, for historical reasons rather
than any actual aversion to doing so).

BTW an html version of the paper, with some typos corrected, is now on
line as http://www.bgl.nu/bogofilter/bayes.html

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |