FAQ: Asian spam

Thu Mar 27 13:32:06 CET 2003

David Relson wrote:

>> >>Well, the problem is that Bogofilter cannot really
>> >>understand the text. I'm not sure how the lexer performs
>> >>there and if this potentially blows up the database.
>> >
>> >Bogofilter has the replace_nonascii_characters option that replaces
>> >high-bit characters, i.e. 0x80-0xFF, with question marks.
>>
>>Right, but since I receive lots of German spam, this is not
>>a good option for me I think.
> 
> Correct.

So we actually add lots of tokens to the databse when
classifying asian spam. So if it is OK, to drop it all the
way, the following works well (remove the quotes):

> ## Silently drop all completely unreadable spam
> :0
> * 1^0 ^\/Subject:.*=\?(.*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987|windows-1251|windows-1256)\?
> * 1^0 ^Content-Type:.*charset="?(.*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987|windows-1251|windows-1256)
> /dev/null

This fails on multipart, but the fix is too risky I think.

However needs to look at those messages, needs to use
bogofilter here, but that might really increase the size of
the database I fear.

pi