For users without corpora

Thu Dec 2 14:50:53 CET 2004

From: "Tom Allison" <tallison at tacocat.net>
> Forever running '-u' got me into a lot of trouble.  Over time the emails 
> that originated from aol.com got so skewed that nothing came in that was 
> anywhere close to ham/unsure.  They were getting dumped into spam with 
> very high scores.  In a way, aol.com was saturated.  I've oter tokens that 
> did this as well.  Simply excluding specific tokens was impracticable 
> because email from aol.com has a large array of tokens to contend with.

Are you sure it was -u which produced this situation?  I've been using -u 
for over a year, started training from scratch as I described, and I've had 
no such problems.  Perhaps you should try exhaustively training each 
error... passing "x" to bfproxy, it keeps registering an email over and over 
until it is correctly classified or a limit is reached.  In this way, if I 
ever get an "unsure" containing lots of tokens normally in spam, it keeps 
registering until those tokens are sufficiently neutral to no longer cause a 
problem.  This will counteract a disproportionate number of spams containing 
tokens which aren't necessarily spammy.  This should be less necessary with 
thresh_update now though.

> For a new user, I would probably start with a '-u' option initially to 
> provide a faster learning curve (again, no hard proof) and then as soon as 
> bogofilter exceeds 90% success for a day, remove the '-u' option and train 
> only on error.  If they had a corpora already, you might be able to 
> exclude the '-u' step entirely.

The problem I find is that I receive no false positives or even ham unsures 
(not an aweful problem :).  Therefore my wordlist would skew terribly toward 
spam if I didn't use -u.  And I don't want to use bogofilter only to find 
spam, but also to identify ham.

> Additionally, I did not train ANYTHING from a mailing list.  At this 
> point, I suspect that I can start using bogofilter for mailing lists and 
> it will work correctly based on train-on-error only.  Previously, 
> debian-users was saturated as ham and would never catch spam from 
> debian-users.  This is a reversal of my previous aol.com problem.

Using the exhaustive training as I described above should keep common list 
tokens neutral and only classify based on the content which differs.

> At this point, I would advocate limiting the use of '-u' to a point where 
> the MSG_COUNT is below some magic number.

I see no reason for doing this.  Using -u remains useful.

Tom