breaking the training db

Greg Louis glouis at dynamicro.on.ca
Mon Sep 22 15:22:43 CEST 2003


On 20030921 (Sun) at 0842:24 -0400, Greg Louis wrote:
> Some of the changes we make require that users rebuild their training
> databases to update token counts in the light of the new parsing.  It
> looks as though 0.15.4 is one such; I got 29 fp, out of about 950
> nonspam, within 20 hours of installing it on my personal mail server. 
> I haven't analysed the fp at all, but I'm pretty sure the change in
> header tagging is part of the cause; these are the first fp I've seen
> in my personal mbox for many (>6 at least) weeks.  After registering
> them I reclassified them, and every one had a score less than
> DBL_EPSILON.
>  
> It makes sense that this should happen, and I expect that, in the
> present case, the effect will be transient as people train on the new
> errors;

Seems so.  I had another 12 fp in the last 24 hours; like the 29
mentioned above, all are addressed to lists that admit spam (aka don't
require posters to be members), and all yield very very low scores
after training.  This is _without_ rebuilding the training db (not
following my own advice), and it seems to suggest I'll be back to
normal fp levels (<0.0001) in a week or so, without having to rebuild.

Worth considering, maybe even testing: instead of doing a full rebuild
after installing a version with changed parsing, it might suffice for
those who normally train on error to switch back to full training for a
while.  That would shorten bogofilter's learning curve, perhaps enough
to keep the irate users from waving their fp in a poor mailadmin's face
;)

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list