evaluating possible new options

Fri May 16 02:12:11 CEST 2003

At 07:51 PM 5/15/03, michael at optusnet.com.au wrote:

>Greg Louis <glouis at dynamicro.on.ca> writes:
> > summary(aov(pc ~ fold + head + html + fold*head + fold*html +
> > +   head*html + fold*head*html, data=parms))
> >                Df   Sum Sq  Mean Sq  F value    Pr(>F)
> > fold            1 0.038226 0.038226  26.5486 0.0008716 ***
> > head            1 0.296242 0.296242 205.7430 5.448e-07 ***
> > html            1 0.002262 0.002262   1.5709 0.2454608
> > fold:head       1 0.061685 0.061685  42.8410 0.0001794 ***
> > fold:html       1 0.001369 0.001369   0.9504 0.3581594
> > head:html       1 0.000251 0.000251   0.1743 0.6872818
> > fold:head:html  1 0.000251 0.000251   0.1746 0.6870339
> > Residuals       8 0.011519 0.001440
>
>A run from my corpus of 84875 spam and 48079 hams. Method
>used was to randomly divide into 4 equal blocks, then
>in turn, use one block to train and the measure against
>that block and the other three.
>
>Default bogofilter 0.12.3 with subject tagging turned on:
>$ perl ./out-crunch out
>CONFIG : Mindev 0.100, RobX 0.415
>          0 against 0   --> false pos     0 false neg  1425
>          0 against 1   --> false pos     0 false neg  4049
>          0 against 2   --> false pos     0 false neg  3977
>          0 against 3   --> false pos     0 false neg  3863
>          1 against 0   --> false pos     0 false neg  3770
>          1 against 1   --> false pos     0 false neg  1468
>          1 against 2   --> false pos     0 false neg  3873
>          1 against 3   --> false pos     0 false neg  3812
>          2 against 0   --> false pos     0 false neg  3859
>          2 against 1   --> false pos     0 false neg  3977
>          2 against 2   --> false pos     0 false neg  1467
>          2 against 3   --> false pos     0 false neg  3829
>          3 against 0   --> false pos     0 false neg  3923
>          3 against 1   --> false pos     0 false neg  4026
>          3 against 2   --> false pos     0 false neg  4026
>          3 against 3   --> false pos     0 false neg  1505
>
>Then the same data with latest CVS bogofilter with -Puh
>flag. (i.e. turning off case folding).
>
>[root at genconf73 db]# perl ./out-crunch out.1
>CONFIG : Mindev 0.100, RobX 0.415
>          0 against 0   --> false pos     0 false neg  1172
>          0 against 1   --> false pos     0 false neg  3283
>          0 against 2   --> false pos     0 false neg  3196
>          0 against 3   --> false pos     0 false neg  3105
>          1 against 0   --> false pos     3 false neg  3123
>          1 against 1   --> false pos     0 false neg  1166
>          1 against 2   --> false pos     2 false neg  3175
>          1 against 3   --> false pos     1 false neg  3042
>          2 against 0   --> false pos     1 false neg  3204
>          2 against 1   --> false pos     0 false neg  3304
>          2 against 2   --> false pos     0 false neg  1189
>          2 against 3   --> false pos     0 false neg  3149
>          3 against 0   --> false pos     1 false neg  3191
>          3 against 1   --> false pos     2 false neg  3285
>          3 against 2   --> false pos     3 false neg  3282
>          3 against 3   --> false pos     0 false neg  1208
>
>As you can see, there's been a jump in false positives. The 13 false
>positives actually have 8 unique items of email (out of the 48,000 total
>hams). Of those 8 , 1 was actually spam that was mis-filed. 1
>was a submitted web form that definately wasn't spam, and the
>remaining 6 I couldn't tell.
>
>The good news though is the huge drop in false negatives.  This is an
>average drop from 15.6% to 12.7% of total spam volume (or a nearly 20%
>drop in the spam getting through).

Michael,

The drop in false negatives is great!  Many more spam are getting caught.

You don't mention your spam_cutoff value, or how you're choosing 
it.  Greg's methodology starts by scoring a corpus of non-spam to determine 
a spam_cutoff value and then using that to score several corpora of spam 
and count the false negatives (a.k.a missed spam).  Doing that, together 
with using wordlists built with the parameters being tested, give him his 
results.

My experiments with bogofilter's default parameters (done some a week or so 
ago, before recent changes), indicate that a min_dev in the range of 0.35 
to 0.45 would be best.  Also, when changing min_dev, it's best to change 
spam_cutoff.  (FWIW, I'm presently using min_dev=0.40 and 
spam_cutoff=0.500).  Anyhow, you might find it interesting to test with 
some higher min_dev values.

David