For users without corpora

tallison at tacocat.net tallison at tacocat.net
Thu Dec 2 17:19:46 CET 2004


> From: "Tom Allison" <tallison at tacocat.net>
>> Forever running '-u' got me into a lot of trouble.  Over time the emails
>> that originated from aol.com got so skewed that nothing came in that was
>> anywhere close to ham/unsure.  They were getting dumped into spam with
>> very high scores.  In a way, aol.com was saturated.  I've oter tokens
>> that
>> did this as well.  Simply excluding specific tokens was impracticable
>> because email from aol.com has a large array of tokens to contend with.
>
> Are you sure it was -u which produced this situation?  I've been using -u
> for over a year, started training from scratch as I described, and I've
> had
> no such problems.

Absolutely positive.
Out of every 1,000 emails from an AOL account, maybe 1 of them is
legitimate.  bogofilter -vvv proved this out.  The main contributors to
false positives in the case of AOL is the aol headers themselves.

exhaustive retraining on these few tokens would have been... well...
exhausting.

I did start with training from a corpus of 3,000 hame and 3,000 spam, so
there is already a large number of tokens present in the wordlist.  But
for the next year, I'm going to try it without '-u' all the time and see
if it's better behaved for me.
Additionally, I did not have a threshhold value before.  Now I'm putting
one it for when I do have training.




More information about the Bogofilter mailing list