Solutions for the charset issue

Wed Sep 25 11:51:37 CEST 2002

Matthias Andree wrote:

>> cp1252 has some alphabetic characters at 128 (Euro sign), 138 (LATIN
>> CAPITAL LETTER S WITH CARON), 140 (LATIN CAPITAL LIGATURE OE) etc.
> 
> Most of these are in iso-8859-15 -- at different positions than in
> Windows-1252 though.

Right. So if we don't analyze the used charset we lose.

>> So the question is: Can somebody come up with a set of characters
>> which are
>> a) always not part of words and
>> b) capture enough to separate words from punctuation and other words?
> 
> The question is: do we want Unicode support?

As a long term goal I think we should support all charsets (well, all
that show up in real life;-), translate them internally to Unicode and
only work with Unicode internally. This way we know a word is the same
word no matter what charset it was coded in.

> Is there a flex-like tool that can deal with UTF-8?

How about recode? I don't know if it can be used here, though.

> Do we need to treat UTF-16 instead?

I don't think so.

> Should we
> convert all input to a particular character set, and take that as
> canonical? If so, we'd need to parse MIME, which makes the whole
> software somewhat slower.

Yes, slower, but *much* stronger. I don't think, we need full MIME
support. Understanding QP is pretty simple. Base64 isn't hard either,
but might be delayed a bit longer I think. That should be it.

> I fear for COMPLETE i18n support, we'd need to
> canonicalize things to a common character set, because that's the only
> way we get the same token for €1000 in ISO-8859-15 and Windows-1252 --
> however, this will fail with foul apple or draughty windows software
> that declares iso-8859-1 when windows-1252 is inside.

There is really no correct way to deal with falsly declared charsets.
There is one approach, which could work (Forte Agent uses it).

a) You can set a default charset which applies whenever there is no
charset declared.

b) Use a translation table (as from recode), but enrich it with common
mistakes. Like using cp1252 at the places where ISO-8859-1 is
undefinied in mails which are in ISO-8859-1.

> There are libraries that take care of this, like iconv, but I've never
> used those, and I'm not sure how good flex would deal with that. It
> might be necessary to look at another scanner generator or write one.

Yep.

> But let's move this to the bogofilter-dev list.

Well, I don't understand enough of C programming to really support
that. I hope, that I can share some thoughts about mail standards and
MIME, though.

pi