... convert_unicode.c ...
relson at osagesoftware.com
Tue Jun 21 19:13:26 EDT 2005
On Wed, 22 Jun 2005 00:30:44 +0200 (CEST)
Pavel Kankovsky wrote:
> On Mon, 20 Jun 2005, David Relson wrote:
> > Attached is a list of 76 charsets in this month's spam. I don't have
> > time to see what iconv_open( "from_charset", "UTF-8" ) thinks of them
> > -- have to head to work.
> The list is quite interesting. As far as I can tell, most "charsets" in
> that list are bogus. Let's look at a few of them:
> charset= <-- bogus
> charset=%charset <-- bogus (*)
> charset=cp-1252 <-- semibogus, should read windows-1252
> charset=%custom_charset <-- bogus (*)
> charset=default <-- bogus
> charset=euc <-- semibogus, suffix (-kr, -jp) missing
> charset=euc-kr[Ç________î] <-- bogus
> charset=iso-0145-4 <-- bogus
> charset=iso-0151-6 <-- bogus
> charset=iso-0237-6 <-- bogus
> charset=iso-0408-1 <-- bogus
> charset=iso-0501-5 <-- bogus
> (*) It is obvious these two values were supposed to be replaced with
> something. Probably by some random charset-like value. This suggests
> spammers set bogus charset values intentionally.
Right you are, Pavel. Spam contains lots of invalid charset
> > The solutions I can think of are all hacks
> > ignore tokens for invalid charsets (when scoring and registering)
> > don't register tokens from messages with invalid charsets
> I think we should emulate the behaviour of common MUAs--read MS Outlook
> (Express)--in such a situation. This is what spammers expect and this is
> what they optimize their "products" for.
> I guess MSOE assumes ASCII, 8859-1, or the charset of its current locale
> (most likely but ASCII or perhaps 8859-1 might still be a reasonable
> default value) when it encounters an unrecognized charset name.
> > Of course, when the bogus charset name is seen (after registration as
> > spam), it has a very high spam score -- effectively a red flag!
> ...assuming that bogus charset name in question is going to appear again.
> Spammers might (and as observed above, there is a change they already do)
> use random "nonces" as charset names to fool filters.
iconv_open( "to_charset", "from_charset" ) does the preparation for
character set translation. Bogofilter's iconvert() function uses
glibc's iconv() to do the work. When iconv_open() rejects the
"from_charset", a message is output to stderr and bogofilter calls
iconv_open() with "iso-8859-1" for both "from_charset" and
"to_charset". The effect of this is that _no_ translation is done. It
would be equally easy to call iconv_open( "utf-8", "iso-8859-1").
Unfortunately I don't have information that says which is better:
1 - no translation
2 - iso-8859-1 to utf-8 translation
Either way it's done there will be cases when bogus tokens are
created. Of course, before 0.95.0, bogus tokens are created regularly,
for example, for asian language charsets.
I'm interested in reasons (or precedents) for one way or the other.
Anybody know if there's an RFC that applies?
More information about the Bogofilter-dev