training to exhaustion?

Tom Anderson tanderso at oac-design.com
Tue Mar 9 15:23:12 CET 2004


On Tue, 2004-03-09 at 07:20, Greg Louis wrote:
> Practical question: If I set up a Tom-Anderson-like experiment (see his
> recent posting in praise of repetitive training), and it works, then
> after a relatively short time there will be precious few wrongly
> classified nonspam among the new test messages.  Almost all the errors
> will be unsure spam.  pi, do you occasionally pad your training db with
> correctly classified nonspam to rebalance the message counts, or do you
> let it get lopsided?

That is my current situation, yes.  1-2 false negatives and a few unsure
spams.  Never any false positives or unsure hams.  My database is
somewhat lopsided only because I receive 3-4 spams for every ham.  I use
-u on incoming emails to keep hams being registered.  That may not be
necessary, though it goes along with my analogy of training a dog... if
he does something I like, give a reward.  I feel no need to balance my
database though, as the spam:ham ratio doesn't really effect
classification.

> (Anyone who trains on error, repetitively or not, is going to have this
> problem; I usually pad before tuning, or whenever my training db gets
> to be 10-15% out of balance.)

For what reason?  If for some reason, excessive training to "sit" causes
your dog to sit even when you say "fetch", then simply train the correct
behavior for "fetch", and he should get both right.  I'm fairly certain
that any hams would be classified as unsure long before they made it
into the false positive territory due to excessive repetitive training
of spams.  Correcting the unsure ham would prevent future such
occurances.

BTW, regarding tuning... I'm in it for the long haul.  Kind of like
playing the stock market.  The point is to get the largest return (most
proper filtering) over time, not to constantly monitor localized spikes
one way or the other.  Therefore, I generally tune in only one
direction, slowly narrowing down to my ideal long-term numbers, over a
period of weeks to months.  I don't think tuning before and/or after
each training is productive.  I'd rather monitor the long-term trends
and adjust appropriately.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040309/2f12acde/attachment.sig>


More information about the Bogofilter mailing list