Security margins in training (on error and to exhaustion)

David Relson relson at osagesoftware.com
Wed Dec 10 13:34:59 CET 2003


Hi pi,

Some interesting results!

To summarize:  a larger margin builds a larger database and gives better
classification results.

I suspect another way to get comparable results is to train with all
messages except those that score very high or very low.  In other words,
it a message scores below 0.01 or above 0.99 don't use it for training,
but train with all other messages.  The ideal numbers probably aren't
really 0.01 and 0.99, but something similar, perhaps 0.02 & 0.98 or 0.05
& 0.95.

To summarize this idea:  train on all messages except the extrema.

FYI:  I'm running bogofilter with spam_cutoff at 0.501 and ham_cutoff at
0.376.  There's little room left for a security margin.  Most of my
unsures score 0.499xxx or 0.500000.  As you know, I'm also autoupdating
for ham and spam and am manually classifying the unsures and using a
cron job to load the classified unsures into the wordlist.

Cheers!

David




More information about the Bogofilter mailing list