breaking the training db

David Relson relson at osagesoftware.com
Sun Sep 21 15:41:37 CEST 2003


On Sun, 21 Sep 2003 08:42:24 -0400
Greg Louis <glouis at dynamicro.on.ca> wrote:

> Some of the changes we make require that users rebuild their training
> databases to update token counts in the light of the new parsing.  It
> looks as though 0.15.4 is one such; I got 29 fp, out of about 950
> nonspam, within 20 hours of installing it on my personal mail server. 
> I haven't analysed the fp at all, but I'm pretty sure the change in
> header tagging is part of the cause; these are the first fp I've seen
> in my personal mbox for many (>6 at least) weeks.  After registering
> them I reclassified them, and every one had a score less than
> DBL_EPSILON.
>  
> It makes sense that this should happen, and I expect that, in the
> present case, the effect will be transient as people train on the new
> errors; but I feel sorry for our -u'sers.
> 
> Perhaps it would be helpful, especially to the users whose experience
> is limited, to issue a very explicit "needs retraining" notice when
> changes that impact db counts are included in a bogofilter release.

Greg,

You raise an interesting point.  I'm one of those "-u'sers" you refer to
and I've not seen the problem you refer to.  Possibly it's due to my
having a more comprehensive set of tokens in my wordlist because I _do_
use '-u'.  Could this be an indicator of the weak point of
train-on-error?

The change I _have_ seen was the 16 messages in my Spam-Unsure folder at
04:30 Friday morning.  Nearly all of those were the latest Microsoft
worm.  Having trained on them as spam, bogofilter is improving in its
recognition of them.  FWIW, I've seen more of this worm (approx 300)
than of any other worm _ever_. 

David




More information about the bogofilter-dev mailing list