breaking the training db

Mon Sep 22 15:36:09 CEST 2003

On Mon, 22 Sep 2003 09:22:43 -0400
Greg Louis <glouis at dynamicro.on.ca> wrote:

> On 20030921 (Sun) at 0842:24 -0400, Greg Louis wrote:
> > Some of the changes we make require that users rebuild their
> > training databases to update token counts in the light of the new
> > parsing.  It looks as though 0.15.4 is one such; I got 29 fp, out of
> > about 950 nonspam, within 20 hours of installing it on my personal
> > mail server. I haven't analysed the fp at all, but I'm pretty sure
> > the change in header tagging is part of the cause; these are the
> > first fp I've seen in my personal mbox for many (>6 at least) weeks.
> >  After registering
> > them I reclassified them, and every one had a score less than
> > DBL_EPSILON.
> >  
> > It makes sense that this should happen, and I expect that, in the
> > present case, the effect will be transient as people train on the
> > new errors;
> 
> Seems so.  I had another 12 fp in the last 24 hours; like the 29
> mentioned above, all are addressed to lists that admit spam (aka don't
> require posters to be members), and all yield very very low scores
> after training.  This is _without_ rebuilding the training db (not
> following my own advice), and it seems to suggest I'll be back to
> normal fp levels (<0.0001) in a week or so, without having to rebuild.
> 
> Worth considering, maybe even testing: instead of doing a full rebuild
> after installing a version with changed parsing, it might suffice for
> those who normally train on error to switch back to full training for
> a while.  That would shorten bogofilter's learning curve, perhaps
> enough to keep the irate users from waving their fp in a poor
> mailadmin's face;)

Greg,

I've been thinking about the header tagging changes and realize that the
effect is wider spread than I initially thought.  The changes add
"head:" to _all_ header tokens that aren't already tagged with subj:,
to:, from:, or rtrn:.  The effect is to stop using a whole group of
tokens and start using a new and different set.  Bogofilter's accuracy
may well be lower until sufficient training is done.  Drat!!

David