divergence?

Mon May 17 15:17:49 CEST 2004

On Mon, 17 May 2004 07:12:23 -0400
Tom Allison wrote:

> Not sure what else to call it....
> 
> I recently decided to rebuild my wordlist to use the settings of 
> block_on_subnet and replace_nonascii_characters which seemed to test
> out the same or better for my corpus.  I don't think any of this is
> related, but it explains why I would rebuild my wordlist.
> 
> After getting everything built up the first time, I found I had to run
> 
> some routines to train to exhaustion in order to correct for some
> really bad errors.  An example of this is I have one address in
> aol.com that shouldn't be spam, but everything aol-ish scores >0.9
> because everything else from aol.com is spam.  So, lots of retraining
> later, I can get these emails without trolling for them in my spam
> folders.
> 
> I observed it took some 15-20 loops (on the same messages) to get
> these emails straightened out.
> 
> (NOTE: figure ~2000 of the ham are from mailing lists, that's a lot of
> 
> spam!)
> 
> But after about four days, it's starting to diverge again, only this 
> time I have spam coming in as ham with scores < 0.01!

Hi Tom,

Have you run some of the problem messages using "bogofilter -vvv" to see
what the high and low scoring tokens are?  That should help understand
why bogofilter's doing as it's doing.  It might also reveal some
incorrectly scored tokens which indicate some training errors.

> Details show that there is a huge contribution to the overall scores 
> from the mailing list headers from Debian.  Debian mailing lists run 
> ~100+ emails a day with ~10 of those being spam.  But the overwhelming
> 
> majority are all ham, thus pushing the overall scores way down on just
> 
> the headers.

Lists that get spammed are a problem.  I encounter the same thing with
gnu.org and python.org lists.  Perhaps I can resurrect the "ignore
wordlist" code.  With it, one manually creates a text ignore list and
adds the problematical list header tokens.  Bogofilter will then ignore
those tokens during scoring.

> Overall, it doesn't seem to be working very well right now and I'm 
> finding I have to keep running these retraining loops a lot.

I've never approved of training loops, preferring to let the "correct if
unsure/error" process work, even if it takes a while.  Admittedly, an
"ignore list" violates the "take a while" principle :-)

> I'm looking for ideas on how to manage this better.  I'm afraid that
> if I just run retraining loops every single day I'll eventually end up
> with a wordlist that has a grosse divergence from reality.

Likely so.

> Should I really expect to have to run this many retests all the time?

Sorry :-<  No magic answers.

David