training to exhaustion and the risk of overvaluing irrelevant tokens

Sat Aug 16 14:15:12 CEST 2003

Greg Louis <glouis at dynamicro.on.ca> writes:

> Exactly.  And any other number of times is wrong, theoretically, and
> unpredictable in practice.  The idea being to build up a training db
> that mirrors the group of messages you're trying to characterize.  With
> training on error, that's "messages that produce uncertainty or error."
> With full training, it's "messages received".  As pi rightly points
> out, classification is based on extrapolation from what you've
> already recorded.

I still wonder what the most /practical/ approach to maintain this
state. Run bogofilter -u and correct any mistakes?

We should remember that statistics aren't perfect and don't claim so,
and bogofilter will not be able to do without any false
classifications. The thresholds we define take care of preferring false
negatives over false positives, the software itself is being neutral.

Whatever. I have tons of "unsure" with counts near 0.5, and I wonder if
I should scrap my whole data base and rebuild... (I'm not using -u at
the moment.)

-- 
Matthias Andree