... convert_unicode.c ...

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Wed Jun 22 00:30:44 CEST 2005


On Mon, 20 Jun 2005, David Relson wrote:

> Attached is a list of 76 charsets in this month's spam.  I don't have
> time to see what iconv_open( "from_charset", "UTF-8" ) thinks of them
> -- have to head to work.

The list is quite interesting. As far as I can tell, most "charsets" in 
that list are bogus. Let's look at a few of them:

charset=                   <-- bogus
charset=big5
charset=%charset           <-- bogus (*)
charset=cp-1252            <-- semibogus, should read windows-1252
charset=%custom_charset    <-- bogus (*)
charset=default            <-- bogus
charset=euc                <-- semibogus, suffix (-kr, -jp) missing
charset=euc-kr
charset=euc-kr[ÇŃąšžî]     <-- bogus
charset=gb2312
charset=iso-0145-4         <-- bogus
charset=iso-0151-6         <-- bogus
charset=iso-0237-6         <-- bogus
charset=iso-0408-1         <-- bogus
charset=iso-0501-5         <-- bogus

(*) It is obvious these two values were supposed to be replaced with 
something. Probably by some random charset-like value. This suggests 
spammers set bogus charset values intentionally.

> The solutions I can think of are all hacks
> 
>   ignore tokens for invalid charsets (when scoring and registering)
>   don't register tokens from messages with invalid charsets

I think we should emulate the behaviour of common MUAs--read MS Outlook 
(Express)--in such a situation. This is what spammers expect and this is 
what they optimize their "products" for.

I guess MSOE assumes ASCII, 8859-1, or the charset of its current locale
(most likely but ASCII or perhaps 8859-1 might still be a reasonable
default value) when it encounters an unrecognized charset name.

> Of course, when the bogus charset name is seen (after registration as
> spam), it has a very high spam score -- effectively a red flag!

...assuming that bogus charset name in question is going to appear again.
Spammers might (and as observed above, there is a change they already do)
use random "nonces" as charset names to fool filters.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."





More information about the bogofilter-dev mailing list