RFC-2047

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Tue Jul 22 09:15:02 CEST 2003


Matthias Andree <matthias.andree at gmx.de> wrote:

>> There is one question though: Decoding remove the charset
>> info (as long as we have not implemented Unicode). So it
>> might be a good idea to also add the charset to the list
>> (which will catch all that asian spam).
>
>On the Unicode topic, that's something that needs to be researched
>into. Unifying character sets discards a bit of information. I have yet
>to see spammers use iso-8859-15, but should we re-encode to utf-8, we
>would lose distinction between iso-8859-1, iso-8859-15, windows-1252 and
>whatever else may be similar (CP850 maybe from b0rked UUCP-to-StrangeNet
>gates).

Right, same as decoding in header fields.

>Prefixing individual tokens with the full charset seems bulky to me, 

ACK. We could use body-charset:whatever.

But I don't see why the same word should show up several
times because of different codings. Furher, we already
discussed, that we cannot even tell what is whitespace or
punctuation if we don't understand the charset.

pi




More information about the Bogofilter mailing list