training to exhaustion

Greg Louis glouis at dynamicro.on.ca
Thu Aug 14 18:29:22 CEST 2003


On 20030814 (Thu) at 1448:59 +0200, Boris 'pi' Piwinger wrote:

> BTW: Doesn't the theorem assume, that the message you train
> with are chosen randomly? A complete training would satisfy
> this, but any algorithm which makes decision (which are not
> purely random) which messages are chosen to train with would
> already violate the assumption.

The theorem assumes that the messages with which you train are chosen
randomly from the population of messages that satisfy the condition for
which you are training.  One way to make that assumption valid is to
let the population be all messages that the user is receiving or going
to receive; then you are right in saying that a selection process
destroys randomness.  However, if you segregate (select) a subcategory
of the population, using constant segregation criteria, and then apply
the theorem to a randomly-chosen sampling of that subcategory, you're
still ok; strictly speaking, you've taught bogofilter how to deal with
the subcategory, without teaching it anything about how to classify the
larger population.
   
It has been found by experience that teaching bogofilter only the hard
stuff eventually produces a training database that works well for the
easy stuff too.  (This is one of the few things in this area that
you and I agree on, pi :)

So we train bogofilter "on error" by sampling (at random) the
population of messages that cause error or uncertainty, instead of the
population of messages as a whole.  This doesn't violate the assumption
of randomness with regard to that subpopulation.  We then apply the
result to classification of the general population; this does violate
the assumption of randomness, but we already know that we can get away
with that violation.

Some people (not all) find that training to exhaustion, which does
severely violate randomness by registering different numbers of
messages different numbers of times, also produces a database that
works well with the general population of messages.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list