training to exhaustion and the risk of overvaluing irrelevant tokens

Sat Aug 16 17:16:33 CEST 2003

On 20030816 (Sat) at 0821:13 -0400, David Relson wrote:
> At 08:15 AM 8/16/03, Matthias Andree wrote:
> >Greg Louis <glouis at dynamicro.on.ca> writes:
> >
> >> Exactly.  And any other number of times is wrong, theoretically, and
> >> unpredictable in practice.  The idea being to build up a training db
> >> that mirrors the group of messages you're trying to characterize.  With
> >> training on error, that's "messages that produce uncertainty or error."
> >> With full training, it's "messages received".  As pi rightly points
> >> out, classification is based on extrapolation from what you've
> >> already recorded.
> >
> >I still wonder what the most /practical/ approach to maintain this
> >state. Run bogofilter -u and correct any mistakes?
> >
> >We should remember that statistics aren't perfect and don't claim so,
> >and bogofilter will not be able to do without any false
> >classifications. The thresholds we define take care of preferring false
> >negatives over false positives, the software itself is being neutral.
> >
> >Whatever. I have tons of "unsure" with counts near 0.5, and I wonder if
> >I should scrap my whole data base and rebuild... (I'm not using -u at
> >the moment.)
> 
> Matthias,
> 
> I've been using '-u' ever since it became available, and still see from 1 
> to 10 unsures per day (with 80 to 120 spams per day).  Most of my unsures 
> are also right near 0.5, specifically between 0.499 and 0.501
> 
> Every unsure also goes into the wordlists (once I've determined if it's 
> spam or ham).
> 
I don't use -u and I train on error.  What I do is accumulate a batch
of email, let bogo classify it, manually correct and split unsures into
spam and nonspam trainer files and then train on those.  I'm still
experimenting to try to find how wide the ratio window should be -- two
weeks seems natural, since that's the batch period, but in practice I
find it's better to do a moving average over about six weeks.  Dunno if
that's best but for me it's a shade better than using 0.5 as we have
done up to now.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |