Serious problem with non-ASCII words

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Fri Sep 20 19:30:31 CEST 2002


Matthias Andree <matthias.andree at gmx.de> wrote:

>> I don't speak C myself, but one thing should not happen. bogofilter
>> should not depend on the locales of the system. The mail server might
>> use anything completely unrelated to the mail. And spam comes in many
>> flavors and languages anyways.
>
>Bogofilter is always run explicitly, so you're into "env LC_CTYPE=de_AT
>bogofilter -p" if you wish that. No big deal. The drawback is that GNU
>flex is clueless.

Right. I could do that. But it only solves part of the
problem. No I could change to some locale which is useful
for languages in western Europe. But I receive spam from all
over the world, so I would have to find out what is the
right one before starting bogofilter, which is already
pretty complicated. But also that fails if the mail has no
declared charset. So my idea would be: Let's find out what
is a word delimiter. Everything else would be considered a
word (there might be exceptions for header analysis and
URLs, hostnames etc.).

Clearly whitespace and line ending are word delimiters. Also
punctuation. This assumes we have charsets which are
compatible with ASCII, though. But I don't see how we can do
better. How about hyphens?

>As you mentioned UTF-8 yourself: how should -- generally, not of
>programming languages -- the parser work? What should it consider a
>token?

Perfect solution would be to translate everything to
Unicode. But that would mean to understand all charset and
again we don't know what to do with undeclared charsets.

pi

For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list