Tuning bogofilter
Tom Anderson
tanderson at orderamidchaos.com
Tue Oct 21 03:18:47 CEST 2008
David Relson wrote:
> We recommend against training over and over with the same messages as
> it biases the wordlist. Training with ham and spam yields a wordlist
> that indicates how often individual words (tokens) occur in ham and in
> spam. For a simplified example: if "xyzw" occurs in 10% of your ham
> and in 20% of your spam, then a message with "xyzw" is twice as likely
> to be spam as it is to be ham. If you keep training with the same spam
> you skew the results -- which is not recommended.
I don't really see biasing a spam message as spam to be particularly
problematic. If indeed a particular word ought to be hammier, then it
will become so in the course of training your hams. My experience has
been that sometimes you don't receive enough spams to make some tokens
spammy enough, and I therefore train these spams multiple times until
bogofilter recognizes them appropriately as spams. Otherwise, I will
keep receiving them as false negatives. For instance, if "xyzw" has
only occurred twice, but it is absolutely 100% always spammy and you
never ever want to see it again, then just keep training on that spam
until bogofilter recognizes it as such. This is as if you have received
the spam many times, but without the inconvenience of actually having
done so. When you later train a ham message which contains another
token "abcd" which may have appeared alongside "xyzw" and is now
spammier than it should have been, "abcd" will become hammier while
"xyzw" remains very spammy. In the end, the result you want.
As I see it, most of the English language should wind up essentially
neutral in your wordlist, with only the truly hammy and spammy words
standing out, like a whitelist and blacklist respectively. If some
portion of the general language moves slightly hammy or spammy due to
"over-training" some particular emails, it shouldn't have a large effect
on the classification since it is largely only the trigger tokens which
will ultimately decide it. If a message is so wishy-washy as to contain
no such trigger tokens which are obviously hammy or spammy, or perhaps
well-crafted enough to contain equal numbers of each, then it deserves
to be marked unsure so that you can manually determine it.
Tom
More information about the Bogofilter
mailing list