Problems with Asian Spam

David Relson relson at osagesoftware.com
Wed Nov 22 01:53:59 CET 2006


On Tue, 21 Nov 2006 19:33:05 -0500
dhottinger at harrisonburg.k12.va.us wrote:

> I keep following the list to see if anyone else has been having
> issues with asian spam.  Im pulling some of it out with procmail.
> But it seems to keep flooding in.  If I understand this thread
> correctly, if I have my encoding set to utf-8 (which I do) I should
> be catching it. Is this correct?
> 
> thanks,
> ddh

Unicode, a.k.a. utf-8, is the best setting for recognizing it.  If you
want to see how bogofilter is parsing the message and the scores for
the individual tokens, you can do so using bogofilter's "-v" flag.  See
the FAQ for the discussion of "-vv" and "-vvv".

As with any new foreign language, it takes training before bogofilter
starts recognizing tokens as spammish.  Also, since bogofilter's
parsing is based on the alphabetic languages (think "english" and
"european" and "abc...z" and "01...9", etc), the parsing may produce
gibberish when applied to asian languages.

On the other hand, most asian language messages arriving on my mail
server are classified as spam.  Another group is classified as "unsure"
since they come through a mailing list to which I'm subscribed (and the
mailing list headers provide a lot of hammish tokens).

An alternate approach is using a procmail (or maildrop) rule that says
"if asian charset, redirect to /dev/null".

Regards,

David



More information about the Bogofilter mailing list