Anyone having problems training bogiflter with this stuff?
relson at osagesoftware.com
Sat Dec 23 12:02:53 EST 2006
On Sat, 23 Dec 2006 17:04:57 +0100
Nigel Henry wrote:
> I'm receiveing some stuff with what looks like chinese, or japanese
> characters, intersperced with lower, and upper case latin characters.
> I've run a few training sessions with it, but bogofilter is
> I've pasted a sample below, as trying to set it as a .txt file in a
> text editor results in some very weird characters.
> I'm still using bogofilter-1.0.2 at the moment, which has been
> working very well up to now. It's processing mail directly dl'd to
> I havn't upgraded bogofilter, because I was concerned about how it
> might affect the wordlist.db, but I do have 2 maildir mailboxes in
> Kmail where I save some of the ham, and all of the spam, apart from
> that which is being correctly identified as spam, so it isn't a big
> problem to recreate the wordlist.db.
> Any suggestions on how to deal with this spam that bogofilters having
> problems with?
> btw. It is ending up in the unsure mailbox, so bogofilter obviously
> thinks there is something dodgy about it.
It sounds like the message has lots of tokens that aren't in your
wordlist as well as a bunch that are good and a bunch that are spam.
This would cause the "unsure" score.
Using "bogofilter -vv" will give you a histogram of token scores and
"bogofilter -vvv" will show each token and its score. The FAQ
describes these options in more detail.
Are you using bogofilter's default encoding, i.e. unicode (a.k.a.
utf-8)? What about replace_nonascii_characters? Using the defaults,
i.e. unicode=yes and replace_nonascii_characters=no, is recommended.
Other than having non-ideal settings of the 2 options, it sounds like a
matter of training.
More information about the Bogofilter