divergence?

Mon May 17 23:12:44 CEST 2004

David Relson wrote:

> Have you run some of the problem messages using "bogofilter -vvv" to see
> what the high and low scoring tokens are?  That should help understand
> why bogofilter's doing as it's doing.  It might also reveal some
> incorrectly scored tokens which indicate some training errors.
> 
> 
>>Details show that there is a huge contribution to the overall scores 
>>from the mailing list headers from Debian.  Debian mailing lists run 
>>~100+ emails a day with ~10 of those being spam.  But the overwhelming
>>
>>majority are all ham, thus pushing the overall scores way down on just
>>
>>the headers.
> 
> 
> Lists that get spammed are a problem.  I encounter the same thing with
> gnu.org and python.org lists.  Perhaps I can resurrect the "ignore
> wordlist" code.  With it, one manually creates a text ignore list and
> adds the problematical list header tokens.  Bogofilter will then ignore
> those tokens during scoring.
> 

I think I need to perfect some grep processes to strip out certain 
headers prior to going to bogofilter.  But I try to avoid modifying emails.

> 
>>I'm looking for ideas on how to manage this better.  I'm afraid that
>>if I just run retraining loops every single day I'll eventually end up
>>with a wordlist that has a grosse divergence from reality.
> 
> 
> Likely so.
> 
> 
>>Should I really expect to have to run this many retests all the time?
> 
> 
> Sorry :-<  No magic answers.

hmmm...  Seems not to work all that well against some of these anyways.

25 cycles and I've removed 4 of 13 so far... ugh.