RFC-2047

Matthias Andree matthias.andree at gmx.de
Tue Jul 22 01:08:21 CEST 2003


Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> writes:

> There is one question though: Decoding remove the charset
> info (as long as we have not implemented Unicode). So it
> might be a good idea to also add the charset to the list
> (which will catch all that asian spam).

On the Unicode topic, that's something that needs to be researched
into. Unifying character sets discards a bit of information. I have yet
to see spammers use iso-8859-15, but should we re-encode to utf-8, we
would lose distinction between iso-8859-1, iso-8859-15, windows-1252 and
whatever else may be similar (CP850 maybe from b0rked UUCP-to-StrangeNet
gates).

Prefixing individual tokens with the full charset seems bulky to me, but
something like that would likely be effective. Things will get hairy
when you have two different character sets in one header (say, the 1/4
and € symbols in iso-8859-1 and iso-8859-15 -- unless you're using
utf-7, -8 or Windows 1252, that is.)

-- 
Matthias Andree




More information about the Bogofilter mailing list