Anybody seen this?
Mark M. Hoffman
mhoffman at lightlink.com
Tue Sep 17 23:28:56 CEST 2002
* Paul Tomblin <ptomblin at xcski.com> [2002-09-17 16:50:36 -0400]:
>
> It's a explanation of what the original Paul Graham paper got wrong:
> http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
<snip>
One thing PG has that I don't is a ton of data. It's hard to call it
wrong if it works... whether it's truly Bayesian is another matter.
I think the #ifdef NONEQUIPROBABLE code starts to address some of
the points made above; although this code has errors of its own:
msg_prob is meant to substitute for the hard-coded 0.4... the
spam prob. for a word not previously seen; instead, it uses the
prior probability of spam itself based on the message counts.
> This patch fixes the calculation of msg_prob, and adds the
> necessary casts since both ops are int.
>
> --- bogofilter.c.0.7 Wed Sep 11 14:46:41 2002
> +++ bogofilter.c Wed Sep 11 14:51:00 2002
> @@ -394,5 +394,6 @@
> #ifdef NON_EQUIPROBABLE
> // There is an argument that we should by by number of *words* here.
> - double msg_prob = (spam_list.msgcount / ham_list.msgcount);
> + double msg_prob = (double)spam_list.msgcount /
> + (double)(spam_list.msgcount + ham_list.msgcount);
> #endif // NON_EQUIPROBABLE
Regards,
--
Mark M. Hoffman
mhoffman at lightlink.com
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list