chinese-korean-non_latin spam

David Relson relson at osagesoftware.com
Sat Mar 8 01:37:43 CET 2003


At 12:46 PM 3/7/03, Olaf Rogalsky wrote:

>Hello,
>
>yesterday I installed bogofilter and in general I am quite satisfied
>with its performance. But unfortunately I get many spam mails from Asia,
>encoded in some eastern language. Bogofilter has no concept to break up
>the stream of characters into words/glyphs and therefore often fails
>miserably on that kind of spam.
>
>In order to solve this problem a specialized lexer would be needed for
>all major encodings (say latin/chinese/japanese/russian). Further an
>estimator for the encoding is needed, which chooses the right lexer
>(looking at the Content-Type is not enough).
>
>Differentiating between latin and chinese can be done by a simple
>statistical analysis of the content characters, but I am not sure
>about the other encodings. Unfortunatly this would mean, that one has
>to process the message twice, first for the character analysis and
>second for the word count analysis (at least if not all lexers are
>used simultaneously, and the right one is chosen afterwards).

Greetings Olaf,

I use a procmail recipe to identify messages using the asian character 
sets.  Once in a while I a message slips through because it doesn't 
identify the character set, but it's not often.

Here's what I use:

## Silently drop all completely unreadable mail
:0:
* 1^0 
^\/Subject:.*=\?(.*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987|windows-1251|windows-1256)\?
* 1^0 
^\/Content-Type:.*charset="?(.*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987|windows-1251|windows-1256)
spam-unreadable

Bogofilter also has a "replace_nonascii_character" option in its config 
file.  If you choose to use it, it will convert all characters with the 
high bit set, i.e. 0x80, to '?'.  This works great for english speakers 
like me, though I'm not sure it's as useful for people receiving other 
languages (which have various accented characters).  If you're interested, 
the code for setting up the character translation table is in 
bogofilter/src/charset.c

David





More information about the Bogofilter mailing list