Anybody seen this?

Tue Sep 17 23:28:56 CEST 2002

* Paul Tomblin <ptomblin at xcski.com> [2002-09-17 16:50:36 -0400]:
>
> It's a explanation of what the original Paul Graham paper got wrong:
> http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
<snip>

One thing PG has that I don't is a ton of data.  It's hard to call it
wrong if it works... whether it's truly Bayesian is another matter.

I think the #ifdef NONEQUIPROBABLE code starts to address some of
the points made above; although this code has errors of its own:

msg_prob is meant to substitute for the hard-coded 0.4... the 
spam prob. for a word not previously seen; instead, it uses the
prior probability of spam itself based on the message counts.

> This patch fixes the calculation of msg_prob, and adds the
> necessary casts since both ops are int.
> 
> --- bogofilter.c.0.7	Wed Sep 11 14:46:41 2002
> +++ bogofilter.c	Wed Sep 11 14:51:00 2002
> @@ -394,5 +394,6 @@
>  #ifdef NON_EQUIPROBABLE
>      // There is an argument that we should by by number of *words* here.
> -    double	msg_prob = (spam_list.msgcount / ham_list.msgcount);
> +    double	msg_prob = (double)spam_list.msgcount / 
> +        (double)(spam_list.msgcount + ham_list.msgcount);
>  #endif // NON_EQUIPROBABLE

Regards,

-- 
Mark M. Hoffman
mhoffman at lightlink.com

For summay digest subscription: bogofilter-digest-subscribe at aotto.com