training to exhaustion and the risk of overvaluing irrelevant tokens

Mon Aug 18 09:02:36 CEST 2003

Matthias Andree <matthias.andree at gmx.de> wrote:

[Adding some quotes, so I can answer only one mail.]

>>> Well, if you recieved a spam message, i. e. a bag of tokens, twice, then
>>> registering it twice is the right thing to do, isn't it?

Very good argument. Actually, I receive several mails more
than once. Examples are mailing lists where people send
(annoying) personal copies. Spam comes in over several
addresses plus forwards in functional addresses. Even if
headers are slightly different, they are classified
identically.

>> Exactly.  And any other number of times is wrong, theoretically, and
>> unpredictable in practice. 

But can you (better: bogofilter) really see the difference?
Isn't it indistinguishable from reality?

>> The idea being to build up a training db
>> that mirrors the group of messages you're trying to characterize. 

That is the one idea which led to the concept. The other
concept -- possibly not theoretically founded, but maybe
only nobody cared to work that out -- is to extract the
relevant tokens. That is the aim of the original concept,
find those by their seen probabilities.

>> With
>> training on error, that's "messages that produce uncertainty or error."
>> With full training, it's "messages received".  As pi rightly points
>> out, classification is based on extrapolation from what you've
>> already recorded.

So it is not at all the mirror of what you receive. You
depend on your setting and the order. In the above example,
if you train on error, normally the first of a series of the
same spam (not recognized correctly in the first place)
switches the rest, so you only train one, violating the idea
of adding it exactly twice or whatever is the number you
received.

>I still wonder what the most /practical/ approach to maintain this
>state. Run bogofilter -u and correct any mistakes?

What I do: I take my training archive and use
bogominitrain.pl to build a wordlist. I don't use -u. If a
message comes in with the wrong classification, I train with
it and again run bogominitrain.pl.

>Whatever. I have tons of "unsure" with counts near 0.5, and I wonder if
>I should scrap my whole data base and rebuild...

I don't use unsure. Currently, I train s.t. the complete
training database is understood correctly, every spam get
values above 0.701, every ham get values below .201, and in
production I use a cutoff of .501.

pi