Security margins in training (on error and to exhaustion)

Wed Dec 10 13:48:30 CET 2003

David Relson wrote:

> Some interesting results!

I hope so.

> To summarize:  a larger margin builds a larger database and gives better
> classification results.

At least up to some point. As a KISS answer I'd suggest to
use spam_cutoff+-0.3 as an interval (assuming ham_cutoff =
spam_cutoff, IOW: ham_cutoff=0).

> I suspect another way to get comparable results is to train with all
> messages except those that score very high or very low.  In other words,
> it a message scores below 0.01 or above 0.99 don't use it for training,
> but train with all other messages. 

Yes, that is the same. The only question then is if you want
to repeat until no errors remain.

> The ideal numbers probably aren't
> really 0.01 and 0.99, but something similar, perhaps 0.02 & 0.98 or 0.05
> & 0.95.

I'd even assume values closer to the middle, as I said above
 more like 0.2 and 0.8.

> FYI:  I'm running bogofilter with spam_cutoff at 0.501 and ham_cutoff at
> 0.376.  There's little room left for a security margin.

Well, you could use everything for training between 0.1 and
0.8, but rate messages still with the values you have now.

> Most of my
> unsures score 0.499xxx or 0.500000.  As you know, I'm also autoupdating
> for ham and spam

That of course would need to be changed to follow this
concept by ignoring the extreme messages.

> and am manually classifying the unsures and using a
> cron job to load the classified unsures into the wordlist.

That of course would still be fine.

pi