training to exhaustion

Thu Aug 14 18:57:38 CEST 2003

Greg Louis <glouis at dynamicro.on.ca> wrote:

>> BTW: Doesn't the theorem assume, that the message you train
>> with are chosen randomly? A complete training would satisfy
>> this, but any algorithm which makes decision (which are not
>> purely random) which messages are chosen to train with would
>> already violate the assumption.
>
>The theorem assumes that the messages with which you train are chosen
>randomly from the population of messages that satisfy the condition for
>which you are training.  

ACK. So what is that condition? I say it is the property of
being ham/spam, respectively.

>One way to make that assumption valid is to
>let the population be all messages that the user is receiving or going
>to receive; then you are right in saying that a selection process
>destroys randomness.  

Good.

>However, if you segregate (select) a subcategory
>of the population, using constant segregation criteria, and then apply
>the theorem to a randomly-chosen sampling of that subcategory, you're
>still ok;

What would such a criterion be?

>strictly speaking, you've taught bogofilter how to deal with
>the subcategory, without teaching it anything about how to classify the
>larger population.

And that is wrong -- in theory, in practice it can work
well.

Anyhow, train on error does not use a constant segregation
criterium, it changes with every new message added to the
database.

>It has been found by experience that teaching bogofilter only the hard
>stuff eventually produces a training database that works well for the
>easy stuff too.  (This is one of the few things in this area that
>you and I agree on, pi :)

Absolutely. Bad in theory, great in practice:-))

>So we train bogofilter "on error" by sampling (at random) the
>population of messages that cause error or uncertainty, instead of the
>population of messages as a whole.

Yes and now. Train on error does not evaluate all messages
and adds the wrong ones. This is done in method 4, though.
But as you said above, this -- at least in theory -- does
not give the right picture about the complete population.

>This doesn't violate the assumption
>of randomness with regard to that subpopulation. 

It is the total absence of randomness. I first evaluate the
message and then *depending* on the output train or don't.
The decision for a message *depends* on all previous
decisions.

>We then apply the
>result to classification of the general population; this does violate
>the assumption of randomness, but we already know that we can get away
>with that violation.

Well, this is my point: In theory, we break some things, in
practice are happy.

>Some people (not all) find that training to exhaustion, which does
>severely violate randomness by registering different numbers of
>messages different numbers of times, also produces a database that
>works well with the general population of messages.

:-)) I'd like to see the opposite. Not really, but for the
sake of your argument;-)

pi