For users without corpora
Tom Anderson
tanderso at oac-design.com
Thu Dec 2 14:50:53 CET 2004
From: "Tom Allison" <tallison at tacocat.net>
> Forever running '-u' got me into a lot of trouble. Over time the emails
> that originated from aol.com got so skewed that nothing came in that was
> anywhere close to ham/unsure. They were getting dumped into spam with
> very high scores. In a way, aol.com was saturated. I've oter tokens that
> did this as well. Simply excluding specific tokens was impracticable
> because email from aol.com has a large array of tokens to contend with.
Are you sure it was -u which produced this situation? I've been using -u
for over a year, started training from scratch as I described, and I've had
no such problems. Perhaps you should try exhaustively training each
error... passing "x" to bfproxy, it keeps registering an email over and over
until it is correctly classified or a limit is reached. In this way, if I
ever get an "unsure" containing lots of tokens normally in spam, it keeps
registering until those tokens are sufficiently neutral to no longer cause a
problem. This will counteract a disproportionate number of spams containing
tokens which aren't necessarily spammy. This should be less necessary with
thresh_update now though.
> For a new user, I would probably start with a '-u' option initially to
> provide a faster learning curve (again, no hard proof) and then as soon as
> bogofilter exceeds 90% success for a day, remove the '-u' option and train
> only on error. If they had a corpora already, you might be able to
> exclude the '-u' step entirely.
The problem I find is that I receive no false positives or even ham unsures
(not an aweful problem :). Therefore my wordlist would skew terribly toward
spam if I didn't use -u. And I don't want to use bogofilter only to find
spam, but also to identify ham.
> Additionally, I did not train ANYTHING from a mailing list. At this
> point, I suspect that I can start using bogofilter for mailing lists and
> it will work correctly based on train-on-error only. Previously,
> debian-users was saturated as ham and would never catch spam from
> debian-users. This is a reversal of my previous aol.com problem.
Using the exhaustive training as I described above should keep common list
tokens neutral and only classify based on the content which differs.
> At this point, I would advocate limiting the use of '-u' to a point where
> the MSG_COUNT is below some magic number.
I see no reason for doing this. Using -u remains useful.
Tom
More information about the Bogofilter
mailing list