training to exhaustion and the risk of overvaluing irrelevant tokens

Greg Louis glouis at
Sat Aug 16 13:52:28 CEST 2003

On 20030815 (Fri) at 1343:28 +0200, Matthias Andree wrote:

> Well, if you recieved a spam message, i. e. a bag of tokens, twice, then
> registering it twice is the right thing to do, isn't it?
Exactly.  And any other number of times is wrong, theoretically, and
unpredictable in practice.  The idea being to build up a training db
that mirrors the group of messages you're trying to characterize.  With
training on error, that's "messages that produce uncertainty or error."
With full training, it's "messages received".  As pi rightly points
out, classification is based on extrapolation from what you've
already recorded.

| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
| |   (on my website or any keyserver) |
| in signatures helps fight junk email. |

More information about the Bogofilter mailing list