Spam / ham registration issue
Tig
tigger at onemoremonkey.com
Wed Mar 3 14:07:45 CET 2004
On Wed, 03 Mar 2004 13:46:21 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> Greg Louis wrote:
>
> >> Is this a bug or user error?
> [...]
> > Looks like a bug to me. I think the scores should be 0.415, 0.415,
> > 0.529 and 0.538 for the four No results, assuming the default s, x
> > and min_dev. Admittedly, a database with just two tokens in it is a
> > bit atypical, but fwiw I can reproduce what you see, using
> > bogofilter version 0.17.2.
>
> Let me redo the experiment using -vvv and -C (note that my
> lexer sees four tokens here) (quoted to avoid line wrapping):
>
<snip>
>
> As we see clearly, pgood and pbad don't change which is
> correct. Now what is happening, how does the calculation
> proceed:
> f(w)=(robs*robx+n*p(w))/(robs+n)
>
> n=#emails
> p(w)=pbad/(pbad+pgood)=1/2 (after the second training)
>
> So f(w)=0.0002075*n/(0.01+n). As we can see this is not
> changing much with n.
>
> Now of course, none of these tokens ever become significant
> in our experiment, teherefore, we alway set robx. Everything
> is correct.
>
> pi
Thanks heaps for the reply people. My understanding now is: Each word
in the test case has been registered as spam and ham, so therefore
balance out and give a neutral result. It does not matter how many
times a word is registered as spam or ham, just the fact that it has
been recorded as either or both.
Would this be a correct summary?
-Tig
More information about the Bogofilter
mailing list