Solutions for the charset issue

Wed Sep 25 11:38:52 CEST 2002

On Wed, 25 Sep 2002, Boris 'pi' Piwinger wrote:

> UTF-8 will also cause a headache.
> 
> So lets stick with the following charsets for now:
> ISO-8859-x, cp125x
> 
> cp1252 has some alphabetic characters at 128 (Euro sign), 138 (LATIN
> CAPITAL LETTER S WITH CARON), 140 (LATIN CAPITAL LIGATURE OE) etc.

Most of these are in iso-8859-15 -- at different positions than in
Windows-1252 though.

> So the question is: Can somebody come up with a set of characters
> which are
> a) always not part of words and
> b) capture enough to separate words from punctuation and other words?

The question is: do we want Unicode support? Is there a flex-like tool
that can deal with UTF-8? Do we need to treat UTF-16 instead? Should we
convert all input to a particular character set, and take that as
canonical? If so, we'd need to parse MIME, which makes the whole
software somewhat slower. I fear for COMPLETE i18n support, we'd need to
canonicalize things to a common character set, because that's the only
way we get the same token for €1000 in ISO-8859-15 and Windows-1252 --
however, this will fail with foul apple or draughty windows software
that declares iso-8859-1 when windows-1252 is inside.

There are libraries that take care of this, like iconv, but I've never
used those, and I'm not sure how good flex would deal with that. It
might be necessary to look at another scanner generator or write one.

But let's move this to the bogofilter-dev list.

For summay digest subscription: bogofilter-digest-subscribe at aotto.com