... convert_unicode.c ...
peak at argo.troja.mff.cuni.cz
Tue Jun 21 18:30:44 EDT 2005
On Mon, 20 Jun 2005, David Relson wrote:
> Attached is a list of 76 charsets in this month's spam. I don't have
> time to see what iconv_open( "from_charset", "UTF-8" ) thinks of them
> -- have to head to work.
The list is quite interesting. As far as I can tell, most "charsets" in
that list are bogus. Let's look at a few of them:
charset= <-- bogus
charset=%charset <-- bogus (*)
charset=cp-1252 <-- semibogus, should read windows-1252
charset=%custom_charset <-- bogus (*)
charset=default <-- bogus
charset=euc <-- semibogus, suffix (-kr, -jp) missing
charset=euc-kr[ÇŃąšžî] <-- bogus
charset=iso-0145-4 <-- bogus
charset=iso-0151-6 <-- bogus
charset=iso-0237-6 <-- bogus
charset=iso-0408-1 <-- bogus
charset=iso-0501-5 <-- bogus
(*) It is obvious these two values were supposed to be replaced with
something. Probably by some random charset-like value. This suggests
spammers set bogus charset values intentionally.
> The solutions I can think of are all hacks
> ignore tokens for invalid charsets (when scoring and registering)
> don't register tokens from messages with invalid charsets
I think we should emulate the behaviour of common MUAs--read MS Outlook
(Express)--in such a situation. This is what spammers expect and this is
what they optimize their "products" for.
I guess MSOE assumes ASCII, 8859-1, or the charset of its current locale
(most likely but ASCII or perhaps 8859-1 might still be a reasonable
default value) when it encounters an unrecognized charset name.
> Of course, when the bogus charset name is seen (after registration as
> spam), it has a very high spam score -- effectively a red flag!
...assuming that bogus charset name in question is going to appear again.
Spammers might (and as observed above, there is a change they already do)
use random "nonces" as charset names to fool filters.
--Pavel Kankovsky aka Peak [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."
More information about the Bogofilter-dev