evaluating possible new options
David Relson
relson at osagesoftware.com
Fri May 16 02:12:11 CEST 2003
At 07:51 PM 5/15/03, michael at optusnet.com.au wrote:
>Greg Louis <glouis at dynamicro.on.ca> writes:
> > summary(aov(pc ~ fold + head + html + fold*head + fold*html +
> > + head*html + fold*head*html, data=parms))
> > Df Sum Sq Mean Sq F value Pr(>F)
> > fold 1 0.038226 0.038226 26.5486 0.0008716 ***
> > head 1 0.296242 0.296242 205.7430 5.448e-07 ***
> > html 1 0.002262 0.002262 1.5709 0.2454608
> > fold:head 1 0.061685 0.061685 42.8410 0.0001794 ***
> > fold:html 1 0.001369 0.001369 0.9504 0.3581594
> > head:html 1 0.000251 0.000251 0.1743 0.6872818
> > fold:head:html 1 0.000251 0.000251 0.1746 0.6870339
> > Residuals 8 0.011519 0.001440
>
>A run from my corpus of 84875 spam and 48079 hams. Method
>used was to randomly divide into 4 equal blocks, then
>in turn, use one block to train and the measure against
>that block and the other three.
>
>Default bogofilter 0.12.3 with subject tagging turned on:
>$ perl ./out-crunch out
>CONFIG : Mindev 0.100, RobX 0.415
> 0 against 0 --> false pos 0 false neg 1425
> 0 against 1 --> false pos 0 false neg 4049
> 0 against 2 --> false pos 0 false neg 3977
> 0 against 3 --> false pos 0 false neg 3863
> 1 against 0 --> false pos 0 false neg 3770
> 1 against 1 --> false pos 0 false neg 1468
> 1 against 2 --> false pos 0 false neg 3873
> 1 against 3 --> false pos 0 false neg 3812
> 2 against 0 --> false pos 0 false neg 3859
> 2 against 1 --> false pos 0 false neg 3977
> 2 against 2 --> false pos 0 false neg 1467
> 2 against 3 --> false pos 0 false neg 3829
> 3 against 0 --> false pos 0 false neg 3923
> 3 against 1 --> false pos 0 false neg 4026
> 3 against 2 --> false pos 0 false neg 4026
> 3 against 3 --> false pos 0 false neg 1505
>
>Then the same data with latest CVS bogofilter with -Puh
>flag. (i.e. turning off case folding).
>
>[root at genconf73 db]# perl ./out-crunch out.1
>CONFIG : Mindev 0.100, RobX 0.415
> 0 against 0 --> false pos 0 false neg 1172
> 0 against 1 --> false pos 0 false neg 3283
> 0 against 2 --> false pos 0 false neg 3196
> 0 against 3 --> false pos 0 false neg 3105
> 1 against 0 --> false pos 3 false neg 3123
> 1 against 1 --> false pos 0 false neg 1166
> 1 against 2 --> false pos 2 false neg 3175
> 1 against 3 --> false pos 1 false neg 3042
> 2 against 0 --> false pos 1 false neg 3204
> 2 against 1 --> false pos 0 false neg 3304
> 2 against 2 --> false pos 0 false neg 1189
> 2 against 3 --> false pos 0 false neg 3149
> 3 against 0 --> false pos 1 false neg 3191
> 3 against 1 --> false pos 2 false neg 3285
> 3 against 2 --> false pos 3 false neg 3282
> 3 against 3 --> false pos 0 false neg 1208
>
>As you can see, there's been a jump in false positives. The 13 false
>positives actually have 8 unique items of email (out of the 48,000 total
>hams). Of those 8 , 1 was actually spam that was mis-filed. 1
>was a submitted web form that definately wasn't spam, and the
>remaining 6 I couldn't tell.
>
>The good news though is the huge drop in false negatives. This is an
>average drop from 15.6% to 12.7% of total spam volume (or a nearly 20%
>drop in the spam getting through).
Michael,
The drop in false negatives is great! Many more spam are getting caught.
You don't mention your spam_cutoff value, or how you're choosing
it. Greg's methodology starts by scoring a corpus of non-spam to determine
a spam_cutoff value and then using that to score several corpora of spam
and count the false negatives (a.k.a missed spam). Doing that, together
with using wordlists built with the parameters being tested, give him his
results.
My experiments with bogofilter's default parameters (done some a week or so
ago, before recent changes), indicate that a min_dev in the range of 0.35
to 0.45 would be best. Also, when changing min_dev, it's best to change
spam_cutoff. (FWIW, I'm presently using min_dev=0.40 and
spam_cutoff=0.500). Anyhow, you might find it interesting to test with
some higher min_dev values.
David
More information about the Bogofilter
mailing list