For users without corpora

Thu Dec 2 13:38:28 CET 2004

On Thu, 02 Dec 2004 06:09:11 -0500
Tom Allison wrote:

> Todd Slater wrote:
> > What are your thoughts on using bogofilter with a person who doesn't
> > have a collection of spam and ham mails? Is it better to provide
> > them with a collection of both for initial training and then train
> > on error, or to ask them to wait until they've got enough spam and
> > ham mails to do the training?
> > 
> 
> I've made a few observations on this recently.  I'm sure that a lot of
> it is all discussed in the archives, but now it's happened to me.
> 
> Forever running '-u' got me into a lot of trouble.  Over time the
> emails that originated from aol.com got so skewed that nothing came in
> that was anywhere close to ham/unsure.  They were getting dumped into
> spam with very high scores.  In a way, aol.com was saturated.  I've
> oter tokens that did this as well.  Simply excluding specific tokens
> was impracticable because email from aol.com has a large array of
> tokens to contend with.
> 
> I deleted my wordlist and rebuilt it based on training on a known 
> corpora and now only training on error.  I suspect (no proof here)
> that training on error will have a slower learning curve but less
> likely to saturate certain tokens.  I probably trained it incorrectly
> because I trained all the ham and then all the spam, but I wasn't
> willing to spend much more time than that.  It's working for me.
> 
> About a year ago I started over at zero with '-u' for training. 
> bogofilter was fantastically stupid for about 12 emails.
> After a day is was reasonably consistent.
> By the end of a week it was doing well enough that I could have
> removed the '-u' option.
> 
> For a new user, I would probably start with a '-u' option initially to
> provide a faster learning curve (again, no hard proof) and then as
> soon as bogofilter exceeds 90% success for a day, remove the '-u'
> option and train only on error.  If they had a corpora already, you
> might be able to exclude the '-u' step entirely.
> 
> Additionally, I did not train ANYTHING from a mailing list.  At this 
> point, I suspect that I can start using bogofilter for mailing lists
> and it will work correctly based on train-on-error only.  Previously, 
> debian-users was saturated as ham and would never catch spam from 
> debian-users.  This is a reversal of my previous aol.com problem.

I deal with mailing list spam via ignore list.  gnu.org's list policy is
to allow posting, without requiring subscription.  That opens the door
for spam.  By putting the highly hammish header tokens from a list into
the ignore list, bogofilter's scoring is mainly based on tokens from the
message body.  For my environment, this results in a higher number of
unsures, which is acceptable.

> At this point, I would advocate limiting the use of '-u' to a point 
> where the MSG_COUNT is below some magic number.

My wordlist is over 2 years old and has been using '-u' (except for the
first 2 months).  It has approx 66,000 spam and 76,000 ham.  It's
stable and accurate.  Size would have been much larger, except that I
started using thresh_update=0.01 so messages scoring below 0.01 or above
0.99 wouldn't autoupdate.

> I started out with 3000 ham, 3000 spam and did a complete rebuild of
> the bogofilter.cf and wordlist.  I think this will last me for a very
> long time.  My previous list, that became saturated, was ~12 months
> old.

As always, mileage varies :-)