breaking the training db
David Relson
relson at osagesoftware.com
Mon Sep 22 15:36:09 CEST 2003
On Mon, 22 Sep 2003 09:22:43 -0400
Greg Louis <glouis at dynamicro.on.ca> wrote:
> On 20030921 (Sun) at 0842:24 -0400, Greg Louis wrote:
> > Some of the changes we make require that users rebuild their
> > training databases to update token counts in the light of the new
> > parsing. It looks as though 0.15.4 is one such; I got 29 fp, out of
> > about 950 nonspam, within 20 hours of installing it on my personal
> > mail server. I haven't analysed the fp at all, but I'm pretty sure
> > the change in header tagging is part of the cause; these are the
> > first fp I've seen in my personal mbox for many (>6 at least) weeks.
> > After registering
> > them I reclassified them, and every one had a score less than
> > DBL_EPSILON.
> >
> > It makes sense that this should happen, and I expect that, in the
> > present case, the effect will be transient as people train on the
> > new errors;
>
> Seems so. I had another 12 fp in the last 24 hours; like the 29
> mentioned above, all are addressed to lists that admit spam (aka don't
> require posters to be members), and all yield very very low scores
> after training. This is _without_ rebuilding the training db (not
> following my own advice), and it seems to suggest I'll be back to
> normal fp levels (<0.0001) in a week or so, without having to rebuild.
>
> Worth considering, maybe even testing: instead of doing a full rebuild
> after installing a version with changed parsing, it might suffice for
> those who normally train on error to switch back to full training for
> a while. That would shorten bogofilter's learning curve, perhaps
> enough to keep the irate users from waving their fp in a poor
> mailadmin's face;)
Greg,
I've been thinking about the header tagging changes and realize that the
effect is wider spread than I initially thought. The changes add
"head:" to _all_ header tokens that aren't already tagged with subj:,
to:, from:, or rtrn:. The effect is to stop using a whole group of
tokens and start using a new and different set. Bogofilter's accuracy
may well be lower until sufficient training is done. Drat!!
David
More information about the Bogofilter
mailing list