More understanding bogofilter
Greg Louis
glouis at dynamicro.on.ca
Wed May 7 20:04:42 CEST 2003
> if a token appears just one time in goodlist, and just one time in
> spamlist, why has goodlist a bigger "weihgt" to calcute the score??.
> This is an example with my database:
>
> linux-list:~# echo spamfilter | bogofilter -R -vv
> X-Bogosity: No, tests=bogofilter, spamicity=0.028904, version=0.12.2
> n pgood pbad fw
> invfwlog fwlog U
> "spamfilter" 2 0.016949 0.000091 0.028904
> -0.02933 -3.54376 +
> N_P_Q_S_s_x_md 1 9.71e-01 2.89e-02 2.89e-02
> 1.00e-01 5.00e-01 0.440
>
> linux-list:~# bogoutil -p .bogofilter spamfilter
> spam good Gra prob Rob prob
> spamfilter 1 1 0.400000 0.028057
The pgood and pbad figures are counts divided by message counts; it
would seem that you have trained with 59 nonspams and 10,989 spams, and
spamfilter appeared once in each group. A token that appears in 1/59 of
the messages is deemed more significant than one that appears 1/10989 of
the time.
(A training database that lopsided is likely to give poor results in
general.)
Graham's p(w) is 0.0053404 (but Graham uses the default 0.4 because
the counts are low).
Robinson's f(w) is (0.1 * 0.5 + 2 * 0.0053404)/(0.1 + 2) or 0.028904
This is an excellent example of Robinson's method compensating for low
counts by splitting the weight between the actual p(w) and the prior
guess of 0.5, as compared with Graham's approach of giving all the
weight to the prior guess until the counts reach a threshold.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list