Spam / ham registration issue

Wed Mar 3 14:07:45 CET 2004

On Wed, 03 Mar 2004 13:46:21 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Greg Louis wrote:
> 
> >> Is this a bug or user error?
> [...]
> > Looks like a bug to me.  I think the scores should be 0.415, 0.415,
> > 0.529 and 0.538 for the four No results, assuming the default s, x
> > and min_dev.  Admittedly, a database with just two tokens in it is a
> > bit atypical, but fwiw I can reproduce what you see, using
> > bogofilter version 0.17.2.
> 
> Let me redo the experiment using -vvv and -C (note that my
> lexer sees four tokens here) (quoted to avoid line wrapping):
> 
<snip>
> 
> As we see clearly, pgood and pbad don't change which is
> correct. Now what is happening, how does the calculation
> proceed:
> f(w)=(robs*robx+n*p(w))/(robs+n)
> 
> n=#emails
> p(w)=pbad/(pbad+pgood)=1/2 (after the second training)
> 
> So f(w)=0.0002075*n/(0.01+n). As we can see this is not
> changing much with n.
> 
> Now of course, none of these tokens ever become significant
> in our experiment, teherefore, we alway set robx. Everything
> is correct.
> 
> pi

Thanks heaps for the reply people. My understanding now is: Each word
in the test case has been registered as spam and ham, so therefore
balance out and give a neutral result. It does not matter how many
times a word is registered as spam or ham, just the fact that it has
been recorded as either or both.

Would this be a correct summary?

-Tig