default parameters - new vs old vs mine
David Relson
relson at osagesoftware.com
Tue Mar 30 03:19:40 CEST 2004
Greetings All,
FWIW, I decided to compare bogofilter's "old" parameter set (as
presently used in all versions of bogofilter) to the new parameters (as
found by Greg's mondo/huge bogotune run). For good measure, I also
included the parameters that I'm currently using on my site (from a
bogotune run, with some minor (manual) changes).
For the comparison, I used all the email received at my site since
bogofilter went into production use in Oct 2002. The message counts are
89,330 ham and 74,917 spam. I ran bogofilter 4 times using the 3
parameter sets described above (and shown in the table below). Using
the results of the "new" parameters, I was able to see that lowering
spam_cutoff from 0.99 to 0.90 would not affect the number of false
positives and that lowering it to 0.70 would duplicate the fp counts for
the "old" parameters. In the tables below, I've included the counts for
these 2 additional values of spam_cutoff.
The "accuracy" table shows how many ham were classified as ham, as
unsure, and as spam, as well as how mnay spam were classified as ham, as
unsure, and as spam. A perfect score would have entries only in the
"hh" (ham scored as ham) and "ss" (spam scored as spam) columns.
Here are the results:
Parameters:
robs robx min_dev spam_co ham_co
old 0.010000 0.415000 0.100000 0.950000 0.100000
new-0.99 0.017800 0.520000 0.375000 0.990000 0.450000
new-0.90 0.017800 0.520000 0.375000 0.900000 0.450000
new-0.70 0.017800 0.520000 0.375000 0.700000 0.450000
mine 0.017800 0.549138 0.435000 0.501000 0.376000
Classification Accuracy:
ver hh hu hs sh su ss
old 88673 650 7 0 604 74313
new-0.99 88965 362 3 2 850 74065
new-0.90 88965 362 3 2 549 74366
new-0.70 88965 357 7 2 427 74588
mine 88955 359 16 4 274 74639
My main purpose in doing this was to see how well the new parameters
compare to the old parameters. Offhand, I'd say they look good and are
eminently usable.
What does this all mean? That's hard to say. It can be observed that
the new parameters give the lowest number of false positives, but also
give more unsures. It can also be observed that my local parameters
give many more false positives, though the rate is only 1 in 5,000.
David
More information about the Bogofilter
mailing list