For users without corpora

Thu Dec 2 12:09:11 CET 2004

Todd Slater wrote:
> What are your thoughts on using bogofilter with a person who doesn't
> have a collection of spam and ham mails? Is it better to provide them
> with a collection of both for initial training and then train on error,
> or to ask them to wait until they've got enough spam and ham mails to do
> the training?
> 

I've made a few observations on this recently.  I'm sure that a lot of 
it is all discussed in the archives, but now it's happened to me.

Forever running '-u' got me into a lot of trouble.  Over time the emails 
that originated from aol.com got so skewed that nothing came in that was 
anywhere close to ham/unsure.  They were getting dumped into spam with 
very high scores.  In a way, aol.com was saturated.  I've oter tokens 
that did this as well.  Simply excluding specific tokens was 
impracticable because email from aol.com has a large array of tokens to 
contend with.

I deleted my wordlist and rebuilt it based on training on a known 
corpora and now only training on error.  I suspect (no proof here) that 
training on error will have a slower learning curve but less likely to 
saturate certain tokens.  I probably trained it incorrectly because I 
trained all the ham and then all the spam, but I wasn't willing to spend 
much more time than that.  It's working for me.

About a year ago I started over at zero with '-u' for training. 
bogofilter was fantastically stupid for about 12 emails.
After a day is was reasonably consistent.
By the end of a week it was doing well enough that I could have removed 
the '-u' option.

For a new user, I would probably start with a '-u' option initially to 
provide a faster learning curve (again, no hard proof) and then as soon 
as bogofilter exceeds 90% success for a day, remove the '-u' option and 
train only on error.  If they had a corpora already, you might be able 
to exclude the '-u' step entirely.

Additionally, I did not train ANYTHING from a mailing list.  At this 
point, I suspect that I can start using bogofilter for mailing lists and 
it will work correctly based on train-on-error only.  Previously, 
debian-users was saturated as ham and would never catch spam from 
debian-users.  This is a reversal of my previous aol.com problem.

At this point, I would advocate limiting the use of '-u' to a point 
where the MSG_COUNT is below some magic number.

I started out with 3000 ham, 3000 spam and did a complete rebuild of the 
bogofilter.cf and wordlist.  I think this will last me for a very long 
time.  My previous list, that became saturated, was ~12 months old.