For users without corpora
Tom Allison
tallison at tacocat.net
Thu Dec 2 12:09:11 CET 2004
Todd Slater wrote:
> What are your thoughts on using bogofilter with a person who doesn't
> have a collection of spam and ham mails? Is it better to provide them
> with a collection of both for initial training and then train on error,
> or to ask them to wait until they've got enough spam and ham mails to do
> the training?
>
I've made a few observations on this recently. I'm sure that a lot of
it is all discussed in the archives, but now it's happened to me.
Forever running '-u' got me into a lot of trouble. Over time the emails
that originated from aol.com got so skewed that nothing came in that was
anywhere close to ham/unsure. They were getting dumped into spam with
very high scores. In a way, aol.com was saturated. I've oter tokens
that did this as well. Simply excluding specific tokens was
impracticable because email from aol.com has a large array of tokens to
contend with.
I deleted my wordlist and rebuilt it based on training on a known
corpora and now only training on error. I suspect (no proof here) that
training on error will have a slower learning curve but less likely to
saturate certain tokens. I probably trained it incorrectly because I
trained all the ham and then all the spam, but I wasn't willing to spend
much more time than that. It's working for me.
About a year ago I started over at zero with '-u' for training.
bogofilter was fantastically stupid for about 12 emails.
After a day is was reasonably consistent.
By the end of a week it was doing well enough that I could have removed
the '-u' option.
For a new user, I would probably start with a '-u' option initially to
provide a faster learning curve (again, no hard proof) and then as soon
as bogofilter exceeds 90% success for a day, remove the '-u' option and
train only on error. If they had a corpora already, you might be able
to exclude the '-u' step entirely.
Additionally, I did not train ANYTHING from a mailing list. At this
point, I suspect that I can start using bogofilter for mailing lists and
it will work correctly based on train-on-error only. Previously,
debian-users was saturated as ham and would never catch spam from
debian-users. This is a reversal of my previous aol.com problem.
At this point, I would advocate limiting the use of '-u' to a point
where the MSG_COUNT is below some magic number.
I started out with 3000 ham, 3000 spam and did a complete rebuild of the
bogofilter.cf and wordlist. I think this will last me for a very long
time. My previous list, that became saturated, was ~12 months old.
More information about the Bogofilter
mailing list