evaluating possible new options
michael at optusnet.com.au
michael at optusnet.com.au
Fri May 16 01:51:48 CEST 2003
Greg Louis <glouis at dynamicro.on.ca> writes:
> summary(aov(pc ~ fold + head + html + fold*head + fold*html +
> + head*html + fold*head*html, data=parms))
> Df Sum Sq Mean Sq F value Pr(>F)
> fold 1 0.038226 0.038226 26.5486 0.0008716 ***
> head 1 0.296242 0.296242 205.7430 5.448e-07 ***
> html 1 0.002262 0.002262 1.5709 0.2454608
> fold:head 1 0.061685 0.061685 42.8410 0.0001794 ***
> fold:html 1 0.001369 0.001369 0.9504 0.3581594
> head:html 1 0.000251 0.000251 0.1743 0.6872818
> fold:head:html 1 0.000251 0.000251 0.1746 0.6870339
> Residuals 8 0.011519 0.001440
A run from my corpus of 84875 spam and 48079 hams. Method
used was to randomly divide into 4 equal blocks, then
in turn, use one block to train and the measure against
that block and the other three.
Default bogofilter 0.12.3 with subject tagging turned on:
$ perl ./out-crunch out
CONFIG : Mindev 0.100, RobX 0.415
0 against 0 --> false pos 0 false neg 1425
0 against 1 --> false pos 0 false neg 4049
0 against 2 --> false pos 0 false neg 3977
0 against 3 --> false pos 0 false neg 3863
1 against 0 --> false pos 0 false neg 3770
1 against 1 --> false pos 0 false neg 1468
1 against 2 --> false pos 0 false neg 3873
1 against 3 --> false pos 0 false neg 3812
2 against 0 --> false pos 0 false neg 3859
2 against 1 --> false pos 0 false neg 3977
2 against 2 --> false pos 0 false neg 1467
2 against 3 --> false pos 0 false neg 3829
3 against 0 --> false pos 0 false neg 3923
3 against 1 --> false pos 0 false neg 4026
3 against 2 --> false pos 0 false neg 4026
3 against 3 --> false pos 0 false neg 1505
Then the same data with latest CVS bogofilter with -Puh
flag. (i.e. turning off case folding).
[root at genconf73 db]# perl ./out-crunch out.1
CONFIG : Mindev 0.100, RobX 0.415
0 against 0 --> false pos 0 false neg 1172
0 against 1 --> false pos 0 false neg 3283
0 against 2 --> false pos 0 false neg 3196
0 against 3 --> false pos 0 false neg 3105
1 against 0 --> false pos 3 false neg 3123
1 against 1 --> false pos 0 false neg 1166
1 against 2 --> false pos 2 false neg 3175
1 against 3 --> false pos 1 false neg 3042
2 against 0 --> false pos 1 false neg 3204
2 against 1 --> false pos 0 false neg 3304
2 against 2 --> false pos 0 false neg 1189
2 against 3 --> false pos 0 false neg 3149
3 against 0 --> false pos 1 false neg 3191
3 against 1 --> false pos 2 false neg 3285
3 against 2 --> false pos 3 false neg 3282
3 against 3 --> false pos 0 false neg 1208
As you can see, there's been a jump in false positives. The 13 false
positives actually have 8 unique items of email (out of the 48,000 total
hams). Of those 8 , 1 was actually spam that was mis-filed. 1
was a submitted web form that definately wasn't spam, and the
remaining 6 I couldn't tell.
The good news though is the huge drop in false negatives. This is an
average drop from 15.6% to 12.7% of total spam volume (or a nearly 20%
drop in the spam getting through).
More information about the Bogofilter
mailing list