Spam / ham registration issue

Wed Mar 3 13:46:21 CET 2004

Greg Louis wrote:

>> Is this a bug or user error?
[...]
> Looks like a bug to me.  I think the scores should be 0.415, 0.415,
> 0.529 and 0.538 for the four No results, assuming the default s, x and
> min_dev.  Admittedly, a database with just two tokens in it is a bit
> atypical, but fwiw I can reproduce what you see, using bogofilter
> version 0.17.2.

Let me redo the experiment using -vvv and -C (note that my
lexer sees four tokens here) (quoted to avoid line wrapping):

> [3.14 at pi ~/tmp/bogo]$ echo "bogo is my friend" > t1
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -s < t1
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -vvv < t1
> X-Bogosity: Yes, tests=bogofilter, spamicity=0.999999, version=0.17.2
>                                      n    pgood     pbad      fw     U
> "head:bogo"                          1  0.000000  1.000000  0.994208 +
> "head:friend"                        1  0.000000  1.000000  0.994208 +
> "head:is"                            1  0.000000  1.000000  0.994208 +
> "head:my"                            1  0.000000  1.000000  0.994208 +
> N_P_Q_S_s_x_md                       4  1.90e-06  1.00e-00  1.00e-00
>                                         1.00e-02  4.15e-01  0.100
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -n < t1
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -vvv < t1
> X-Bogosity: No, tests=bogofilter, spamicity=0.415000, version=0.17.2
>                                      n    pgood     pbad      fw     U
> "head:bogo"                          2  1.000000  1.000000  0.499577 -
> "head:friend"                        2  1.000000  1.000000  0.499577 -
> "head:is"                            2  1.000000  1.000000  0.499577 -
> "head:my"                            2  1.000000  1.000000  0.499577 -
> N_P_Q_S_s_x_md                       0  0.00e+00  0.00e+00  4.15e-01
>                                         1.00e-02  4.15e-01  0.100
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -s < t1
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -vvv < t1
> X-Bogosity: No, tests=bogofilter, spamicity=0.415000, version=0.17.2
>                                      n    pgood     pbad      fw     U
> "head:bogo"                          3  1.000000  1.000000  0.499718 -
> "head:friend"                        3  1.000000  1.000000  0.499718 -
> "head:is"                            3  1.000000  1.000000  0.499718 -
> "head:my"                            3  1.000000  1.000000  0.499718 -
> N_P_Q_S_s_x_md                       0  0.00e+00  0.00e+00  4.15e-01
>                                         1.00e-02  4.15e-01  0.100
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -s < t1
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -d ./ -vvv < t1
> X-Bogosity: No, tests=bogofilter, spamicity=0.415000, version=0.17.2
>                                      n    pgood     pbad      fw     U
> "head:bogo"                          4  1.000000  1.000000  0.499788 -
> "head:friend"                        4  1.000000  1.000000  0.499788 -
> "head:is"                            4  1.000000  1.000000  0.499788 -
> "head:my"                            4  1.000000  1.000000  0.499788 -
> N_P_Q_S_s_x_md                       0  0.00e+00  0.00e+00  4.15e-01
>                                         1.00e-02  4.15e-01  0.100
> [3.14 at pi ~/tmp/bogo]$ my-bogofilter -C -Q
> bogofilter version 0.17.2
> 
> algorithm   = fisher
> robx        = 0.415000 (4.15e-01)
> robs        = 0.010000 (1.00e-02)
> min_dev     = 0.100000 (1.00e-01)
> ham_cutoff  = 0.000000 (0.00e+00)
> spam_cutoff = 0.950000 (9.50e-01)
> 
> block_on_subnets  = no
> replace_nonascii_characters = no
> 
> spam_header_name  = 'X-Bogosity'
> header_format     = '%h: %c, tests=bogofilter, spamicity=%p, version=%v'
> terse_format      = '%1.1c %f'
> log_header_format = '%h: %c, spamicity=%p, version=%v'
> log_update_format = 'register-%r, %w words, %m messages'
> spamicity_tags    = 'Yes', 'No'
> spamicity_formats = '%0.6f', '%0.6f'

BTW: Note that -Q should not say "algorithm   = fisher". But
it could list other things which are set.

As we see clearly, pgood and pbad don't change which is
correct. Now what is happening, how does the calculation
proceed:
f(w)=(robs*robx+n*p(w))/(robs+n)

n=#emails
p(w)=pbad/(pbad+pgood)=1/2 (after the second training)

So f(w)=0.0002075*n/(0.01+n). As we can see this is not
changing much with n.

Now of course, none of these tokens ever become significant
in our experiment, teherefore, we alway set robx. Everything
is correct.

pi