divergence?

Mon May 17 13:12:23 CEST 2004

Not sure what else to call it....

I recently decided to rebuild my wordlist to use the settings of 
block_on_subnet and replace_nonascii_characters which seemed to test out 
the same or better for my corpus.  I don't think any of this is related, 
but it explains why I would rebuild my wordlist.

After getting everything built up the first time, I found I had to run 
some routines to train to exhaustion in order to correct for some really 
bad errors.  An example of this is I have one address in aol.com that 
shouldn't be spam, but everything aol-ish scores >0.9 because everything 
else from aol.com is spam.  So, lots of retraining later, I can get 
these emails without trolling for them in my spam folders.

I observed it took some 15-20 loops (on the same messages) to get these 
emails straightened out.

(NOTE: figure ~2000 of the ham are from mailing lists, that's a lot of 
spam!)

But after about four days, it's starting to diverge again, only this 
time I have spam coming in as ham with scores < 0.01!

Details show that there is a huge contribution to the overall scores 
from the mailing list headers from Debian.  Debian mailing lists run 
~100+ emails a day with ~10 of those being spam.  But the overwhelming 
majority are all ham, thus pushing the overall scores way down on just 
the headers.

Overall, it doesn't seem to be working very well right now and I'm 
finding I have to keep running these retraining loops a lot.

I'm looking for ideas on how to manage this better.  I'm afraid that if 
I just run retraining loops every single day I'll eventually end up with 
a wordlist that has a grosse divergence from reality.

Should I really expect to have to run this many retests all the time?