training to exhaustion and the risk of overvaluing irrelevant tokens

Thu Aug 14 01:28:07 CEST 2003

In current editions of the FAQ, mention is made of the risk one takes
in training to exhaustion (taking messages bogofilter misclassifies or
is unsure about, and retraining with them till bogofilter gets them
right).  If one does this, irrelevant tokens present in such messages
acquire higher counts than they ought to, and may for a time degrade
bogofilter's classification accuracy.

It seems this concept is difficult to grasp.  Let me try an analogy.

Imagine bogofilter is used to recognize dogs as longhaired or
shorthaired, and is trained with quite a variety of canines.  Now
suppose that a short-haired dog with floppy ears gets misclassified as
long-haired, and a long-haired dog with upright ears gets classified as
short-haired (the hair length being at the near edge of normal for both
animals).  Because we're training to exhaustion, we show bogofilter the
dogs over and over till it gets them right; it takes twenty passes. 
Guess what?  Bogofilter has learned that dogs with floppy ears are
usually short-haired and ones with upright ears are long-haired.  The
next German shepherd and the next St. Bernard both get misclassified. 
Had we trained just once, there wouldn't have been the repeated
exposure to the ear type, and bogofilter wouldn't have learned to look
at the wrong characteristic.

Hope that helps.  It's unfortunately a bit too long to put in the FAQ,
though.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |