subject-tagging test results

David Relson relson at osagesoftware.com
Fri Feb 14 15:39:36 CET 2003


At 08:06 AM 2/14/03, Greg Louis wrote:

>This will be written up later today or tomorrow on
>http://www.bgl.nu/bogofilter, for those who are interested in the
>details.
>
>I ran a test with 21,969 spams and 19,823 nonspams in the training
>corpus; the test involved two runs, each with 5,948 spams and 3,204
>nonspams.  There were two factors: with and without David's new header
>tagging feature, and min_dev varying from 0.025 to 0.125 in steps of
>0.025.  At each point, the nonspams were used to establish a
>spam-cutoff value that would give 0.1% false positives, and then the
>number of false negatives at that cutoff was determined.
>
>Without header tagging, the optimum min_dev value was 0.075, and at
>that value there were 3.9% false negatives (233 per run, on average).
>With header tagging, the optimum min_dev was 0.05, with 3.5% false
>negatives (214 per run).  At low min_dev values (0.025 and 0.05),
>tagging was beneficial; at 0.075 it was neither beneficial nor harmful;
>at 0.1 and 0.125 tagging yielded more (though not many more) false
>negatives.

Interesting that small changes in min_dev produce noticeable changes in the 
results.

With robx at 0.415, changing min_dev from 0.10 to 0.075 includes the 
unknown words (which are scored at robx).  Decreasing min_dev further to 
0.05 includes more tokens that are nearly neutral and gave better 
results.  It seems that including these nearly neutral words _does_ help 
bogofilter's classification abilities.

I'll rerun my test with 0.075 and 0.05 to see if the results match.  If so, 
perhaps we should change the default min_dev for fisher from 1.0 to 0.5.

>The min_dev value of 0.125 gave the worst results, both without and
>with tagging (4.9 and 5.0% respectively).  I mention this to underline
>the point that, although _optimum_ performance depends on appropriate
>tuning of bogofilter's various parameters, it's not necessary to get it
>_exactly_right_ to enjoy the benefit of spam filtering.

Good point.  Nick is seeing that a Robinson spam_cutoff around 0.34 or 0.38 
works for him - no false positives.  Given my use of Robinson-Fisher and 
the variation in my unsures, I can't go below 0.885.  I guess what they say 
is true - "Different strokes for different folks"





More information about the Bogofilter mailing list