Spam / ham registration issue

Greg Louis glouis at dynamicro.on.ca
Wed Mar 3 16:58:08 CET 2004


On 20040303 (Wed) at 1346:21 +0100, Boris 'pi' Piwinger wrote:
> 
> >> Is this a bug or user error?
> [...]
> > Looks like a bug to me.  I think the scores should be 0.415, 0.415,
> > 0.529 and 0.538 for the four No results, assuming the default s, x and
> > min_dev.  Admittedly, a database with just two tokens in it is a bit
> > atypical, but fwiw I can reproduce what you see, using bogofilter
> > version 0.17.2.
> 
> Let me redo the experiment using -vvv and -C (note that my
> lexer sees four tokens here) (quoted to avoid line wrapping):
> 
> > N_P_Q_S_s_x_md                       4  1.90e-06  1.00e-00  1.00e-00
> >                                         1.00e-02  4.15e-01  0.100
> > N_P_Q_S_s_x_md                       0  0.00e+00  0.00e+00  4.15e-01
> >                                         1.00e-02  4.15e-01  0.100
> > N_P_Q_S_s_x_md                       0  0.00e+00  0.00e+00  4.15e-01
> >                                         1.00e-02  4.15e-01  0.100
> > N_P_Q_S_s_x_md                       0  0.00e+00  0.00e+00  4.15e-01
> >                                         1.00e-02  4.15e-01  0.100

I did the calculation manually, with R.  It's conceivable that I blew
it, since I used the wrong distribution function; the spamicities for 2
and 3 are higher than I reported initially.  For a correct calculation
see below.  The per-token f(w) value for n=2 (one spam one nonspam) is
within 0.1 (min_dev) of 0.5; that for n=3 is not, nor is that for n=4.

> s <- 0.01
> x <- 0.415
> b <- 1
> g <- 1
> pw <- b/(b+g)
> n <- b+g            
> fw <- (x * s + n * pw) / (s + n)
> fw
[1] 0.4995771
> b <- 2
> pw <- b/(b+g)
> n <- b+g
> fw <- (x * s + n * pw) / (s + n)
> fw
[1] 0.6658306
> b <- 3
> pw <- b/(b+g)
> n <- b+g
> fw <- (x * s + n * pw) / (s + n)
> fw
[1] 0.7491646

Since standard bogofilter ignores "is" and "my", we have two tokens
with identical fw for each run: here's the calculation for three spam
one nonspam:

> P <- (1-fw) ** 2
> Q <- fw ** 2
> lp <- -2 * log(P)
> lq <- -2 * log(Q)
> S <- (1 + pchisq(lp,4) - pchisq(lq,4)) / 2
> S
[1] 0.8242374

Does anyone see an error here?  I don't, as yet.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list