subject-tagging test results
David Relson
relson at osagesoftware.com
Fri Feb 14 15:39:36 CET 2003
At 08:06 AM 2/14/03, Greg Louis wrote:
>This will be written up later today or tomorrow on
>http://www.bgl.nu/bogofilter, for those who are interested in the
>details.
>
>I ran a test with 21,969 spams and 19,823 nonspams in the training
>corpus; the test involved two runs, each with 5,948 spams and 3,204
>nonspams. There were two factors: with and without David's new header
>tagging feature, and min_dev varying from 0.025 to 0.125 in steps of
>0.025. At each point, the nonspams were used to establish a
>spam-cutoff value that would give 0.1% false positives, and then the
>number of false negatives at that cutoff was determined.
>
>Without header tagging, the optimum min_dev value was 0.075, and at
>that value there were 3.9% false negatives (233 per run, on average).
>With header tagging, the optimum min_dev was 0.05, with 3.5% false
>negatives (214 per run). At low min_dev values (0.025 and 0.05),
>tagging was beneficial; at 0.075 it was neither beneficial nor harmful;
>at 0.1 and 0.125 tagging yielded more (though not many more) false
>negatives.
Interesting that small changes in min_dev produce noticeable changes in the
results.
With robx at 0.415, changing min_dev from 0.10 to 0.075 includes the
unknown words (which are scored at robx). Decreasing min_dev further to
0.05 includes more tokens that are nearly neutral and gave better
results. It seems that including these nearly neutral words _does_ help
bogofilter's classification abilities.
I'll rerun my test with 0.075 and 0.05 to see if the results match. If so,
perhaps we should change the default min_dev for fisher from 1.0 to 0.5.
>The min_dev value of 0.125 gave the worst results, both without and
>with tagging (4.9 and 5.0% respectively). I mention this to underline
>the point that, although _optimum_ performance depends on appropriate
>tuning of bogofilter's various parameters, it's not necessary to get it
>_exactly_right_ to enjoy the benefit of spam filtering.
Good point. Nick is seeing that a Robinson spam_cutoff around 0.34 or 0.38
works for him - no false positives. Given my use of Robinson-Fisher and
the variation in my unsures, I can't go below 0.885. I guess what they say
is true - "Different strokes for different folks"
More information about the Bogofilter
mailing list