divergence?
Tom Allison
tallison at tacocat.net
Mon May 17 13:12:23 CEST 2004
Not sure what else to call it....
I recently decided to rebuild my wordlist to use the settings of
block_on_subnet and replace_nonascii_characters which seemed to test out
the same or better for my corpus. I don't think any of this is related,
but it explains why I would rebuild my wordlist.
After getting everything built up the first time, I found I had to run
some routines to train to exhaustion in order to correct for some really
bad errors. An example of this is I have one address in aol.com that
shouldn't be spam, but everything aol-ish scores >0.9 because everything
else from aol.com is spam. So, lots of retraining later, I can get
these emails without trolling for them in my spam folders.
I observed it took some 15-20 loops (on the same messages) to get these
emails straightened out.
(NOTE: figure ~2000 of the ham are from mailing lists, that's a lot of
spam!)
But after about four days, it's starting to diverge again, only this
time I have spam coming in as ham with scores < 0.01!
Details show that there is a huge contribution to the overall scores
from the mailing list headers from Debian. Debian mailing lists run
~100+ emails a day with ~10 of those being spam. But the overwhelming
majority are all ham, thus pushing the overall scores way down on just
the headers.
Overall, it doesn't seem to be working very well right now and I'm
finding I have to keep running these retraining loops a lot.
I'm looking for ideas on how to manage this better. I'm afraid that if
I just run retraining loops every single day I'll eventually end up with
a wordlist that has a grosse divergence from reality.
Should I really expect to have to run this many retests all the time?
More information about the Bogofilter
mailing list