training to exhaustion and the risk of overvaluing irrelevant tokens

Sat Aug 16 14:21:13 CEST 2003

At 08:15 AM 8/16/03, Matthias Andree wrote:
>Greg Louis <glouis at dynamicro.on.ca> writes:
>
> > Exactly.  And any other number of times is wrong, theoretically, and
> > unpredictable in practice.  The idea being to build up a training db
> > that mirrors the group of messages you're trying to characterize.  With
> > training on error, that's "messages that produce uncertainty or error."
> > With full training, it's "messages received".  As pi rightly points
> > out, classification is based on extrapolation from what you've
> > already recorded.
>
>I still wonder what the most /practical/ approach to maintain this
>state. Run bogofilter -u and correct any mistakes?
>
>We should remember that statistics aren't perfect and don't claim so,
>and bogofilter will not be able to do without any false
>classifications. The thresholds we define take care of preferring false
>negatives over false positives, the software itself is being neutral.
>
>Whatever. I have tons of "unsure" with counts near 0.5, and I wonder if
>I should scrap my whole data base and rebuild... (I'm not using -u at
>the moment.)

Matthias,

I've been using '-u' ever since it became available, and still see from 1 
to 10 unsures per day (with 80 to 120 spams per day).  Most of my unsures 
are also right near 0.5, specifically between 0.499 and 0.501

Every unsure also goes into the wordlists (once I've determined if it's 
spam or ham).

David