Security margins in training (on error and to exhaustion)
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Dec 10 13:48:30 CET 2003
David Relson wrote:
> Some interesting results!
I hope so.
> To summarize: a larger margin builds a larger database and gives better
> classification results.
At least up to some point. As a KISS answer I'd suggest to
use spam_cutoff+-0.3 as an interval (assuming ham_cutoff =
spam_cutoff, IOW: ham_cutoff=0).
> I suspect another way to get comparable results is to train with all
> messages except those that score very high or very low. In other words,
> it a message scores below 0.01 or above 0.99 don't use it for training,
> but train with all other messages.
Yes, that is the same. The only question then is if you want
to repeat until no errors remain.
> The ideal numbers probably aren't
> really 0.01 and 0.99, but something similar, perhaps 0.02 & 0.98 or 0.05
> & 0.95.
I'd even assume values closer to the middle, as I said above
more like 0.2 and 0.8.
> FYI: I'm running bogofilter with spam_cutoff at 0.501 and ham_cutoff at
> 0.376. There's little room left for a security margin.
Well, you could use everything for training between 0.1 and
0.8, but rate messages still with the values you have now.
> Most of my
> unsures score 0.499xxx or 0.500000. As you know, I'm also autoupdating
> for ham and spam
That of course would need to be changed to follow this
concept by ignoring the extreme messages.
> and am manually classifying the unsures and using a
> cron job to load the classified unsures into the wordlist.
That of course would still be fine.
pi
More information about the Bogofilter
mailing list