Solutions for the charset issue
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Sep 25 11:51:37 CEST 2002
Matthias Andree wrote:
>> cp1252 has some alphabetic characters at 128 (Euro sign), 138 (LATIN
>> CAPITAL LETTER S WITH CARON), 140 (LATIN CAPITAL LIGATURE OE) etc.
>
> Most of these are in iso-8859-15 -- at different positions than in
> Windows-1252 though.
Right. So if we don't analyze the used charset we lose.
>> So the question is: Can somebody come up with a set of characters
>> which are
>> a) always not part of words and
>> b) capture enough to separate words from punctuation and other words?
>
> The question is: do we want Unicode support?
As a long term goal I think we should support all charsets (well, all
that show up in real life;-), translate them internally to Unicode and
only work with Unicode internally. This way we know a word is the same
word no matter what charset it was coded in.
> Is there a flex-like tool that can deal with UTF-8?
How about recode? I don't know if it can be used here, though.
> Do we need to treat UTF-16 instead?
I don't think so.
> Should we
> convert all input to a particular character set, and take that as
> canonical? If so, we'd need to parse MIME, which makes the whole
> software somewhat slower.
Yes, slower, but *much* stronger. I don't think, we need full MIME
support. Understanding QP is pretty simple. Base64 isn't hard either,
but might be delayed a bit longer I think. That should be it.
> I fear for COMPLETE i18n support, we'd need to
> canonicalize things to a common character set, because that's the only
> way we get the same token for €1000 in ISO-8859-15 and Windows-1252 --
> however, this will fail with foul apple or draughty windows software
> that declares iso-8859-1 when windows-1252 is inside.
There is really no correct way to deal with falsly declared charsets.
There is one approach, which could work (Forte Agent uses it).
a) You can set a default charset which applies whenever there is no
charset declared.
b) Use a translation table (as from recode), but enrich it with common
mistakes. Like using cp1252 at the places where ISO-8859-1 is
undefinied in mails which are in ISO-8859-1.
> There are libraries that take care of this, like iconv, but I've never
> used those, and I'm not sure how good flex would deal with that. It
> might be necessary to look at another scanner generator or write one.
Yep.
> But let's move this to the bogofilter-dev list.
Well, I don't understand enough of C programming to really support
that. I hope, that I can share some thoughts about mail standards and
MIME, though.
pi
More information about the bogofilter-dev
mailing list