More understanding bogofilter

Wed May 7 20:04:42 CEST 2003

>  if a token appears just one time in goodlist, and just one time in
> spamlist, why has goodlist a bigger "weihgt" to calcute the score??.
> This is an example with my database:
> 
> linux-list:~# echo spamfilter | bogofilter -R -vv 
> X-Bogosity: No, tests=bogofilter, spamicity=0.028904, version=0.12.2
>                                      n    pgood     pbad      fw
> invfwlog    fwlog  U
> "spamfilter"                         2  0.016949  0.000091  0.028904
> -0.02933  -3.54376 +
> N_P_Q_S_s_x_md                       1  9.71e-01  2.89e-02  2.89e-02
> 1.00e-01  5.00e-01 0.440
> 
> linux-list:~# bogoutil -p .bogofilter spamfilter
>                        spam    good  Gra prob  Rob prob
> spamfilter                1       1  0.400000  0.028057

The pgood and pbad figures are counts divided by message counts; it
would seem that you have trained with 59 nonspams and 10,989 spams, and
spamfilter appeared once in each group.  A token that appears in 1/59 of
the messages is deemed more significant than one that appears 1/10989 of
the time.

(A training database that lopsided is likely to give poor results in
general.)

Graham's p(w) is 0.0053404 (but Graham uses the default 0.4 because
the counts are low).

Robinson's f(w) is (0.1 * 0.5 + 2 * 0.0053404)/(0.1 + 2) or 0.028904

This is an excellent example of Robinson's method compensating for low
counts by splitting the weight between the actual p(w) and the prior
guess of 0.5, as compared with Graham's approach of giving all the
weight to the prior guess until the counts reach a threshold.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |