Anyone having problems training bogiflter with this stuff?

David Relson relson at osagesoftware.com
Sat Dec 23 18:02:53 CET 2006


On Sat, 23 Dec 2006 17:04:57 +0100
Nigel Henry wrote:

> I'm receiveing some stuff with what looks like chinese, or japanese 
> characters, intersperced with lower, and upper case latin characters.
> I've run a few training sessions with it, but bogofilter is
> struggling.
> 
> I've pasted a sample below, as trying to set it as a .txt file in a
> text editor results in some very weird characters.
> 
> ��シ�`�フ�N���W�b�g�J�[�h���v���[���g�キ���ニ�セ�、�フ���l�ヲ�ス�フ�ナ�キ�ェ�A
> �サ���ェ�~�オ�ゥ�チ�ス���ノ�g�ヲ�ネ�「�フ�ナ�A�ャ�リ���ナ�フ�������l�ヲ�ト�「���フ�ナ�キ�ェ�@�ス�ナ�キ�ゥ
> �H
> ���z�����ォ���ワ�ネ�「�ワ�ワ�f�「�ワ�キ�フ�ナ�A�サ�フ���ナ���z�����ォ�����ナ�����ィ
> �、�ニ�v�チ�ト�「�ワ�キ�B
> ���ヘ�サ�、�「�、�`���l�ヲ�ス�フ�ナ�キ�ェ�A�ヌ�、�ナ�オ���、�ゥ�H
> �ィ�ヤ���ュ�セ�ウ�「�B
> 
> I'm still using bogofilter-1.0.2 at the moment, which has been
> working very well up to now. It's processing mail directly dl'd to
> kmail.
> 
> I havn't upgraded bogofilter, because I was concerned about how it
> might affect the wordlist.db, but I do have 2 maildir mailboxes in
> Kmail where I save some of the ham, and all of the spam, apart from
> that which is being correctly identified as spam, so it isn't a big
> problem to recreate the wordlist.db.
> 
> Any suggestions on how to deal with this spam that bogofilters having
> problems with?
> 
> btw. It is ending up in the unsure mailbox, so bogofilter obviously
> thinks there is something dodgy about it.
> 
> Nigel.

Nigel,

It sounds like the message has lots of tokens that aren't in your
wordlist as well as a bunch that are good and a bunch that are spam.
This would cause the "unsure" score.

Using "bogofilter -vv" will give you a histogram of token scores and
"bogofilter -vvv" will show each token and its score.  The FAQ
describes these options in more detail.

Are you using bogofilter's default encoding, i.e. unicode (a.k.a.
utf-8)?  What about replace_nonascii_characters?  Using the defaults,
i.e. unicode=yes and replace_nonascii_characters=no, is recommended.

Other than having non-ideal settings of the 2 options, it sounds like a
matter of training.

HTH,

David



More information about the Bogofilter mailing list