divergence?
Tom Allison
tallison at tacocat.net
Mon May 17 23:12:44 CEST 2004
David Relson wrote:
> Have you run some of the problem messages using "bogofilter -vvv" to see
> what the high and low scoring tokens are? That should help understand
> why bogofilter's doing as it's doing. It might also reveal some
> incorrectly scored tokens which indicate some training errors.
>
>
>>Details show that there is a huge contribution to the overall scores
>>from the mailing list headers from Debian. Debian mailing lists run
>>~100+ emails a day with ~10 of those being spam. But the overwhelming
>>
>>majority are all ham, thus pushing the overall scores way down on just
>>
>>the headers.
>
>
> Lists that get spammed are a problem. I encounter the same thing with
> gnu.org and python.org lists. Perhaps I can resurrect the "ignore
> wordlist" code. With it, one manually creates a text ignore list and
> adds the problematical list header tokens. Bogofilter will then ignore
> those tokens during scoring.
>
I think I need to perfect some grep processes to strip out certain
headers prior to going to bogofilter. But I try to avoid modifying emails.
>
>>I'm looking for ideas on how to manage this better. I'm afraid that
>>if I just run retraining loops every single day I'll eventually end up
>>with a wordlist that has a grosse divergence from reality.
>
>
> Likely so.
>
>
>>Should I really expect to have to run this many retests all the time?
>
>
> Sorry :-< No magic answers.
hmmm... Seems not to work all that well against some of these anyways.
25 cycles and I've removed 4 of 13 so far... ugh.
More information about the Bogofilter
mailing list