Serious problem with non-ASCII words

Matthias Andree matthias.andree at gmx.de
Fri Sep 20 20:56:18 CEST 2002


On Fri, 20 Sep 2002, Boris 'pi' Piwinger wrote:

> Right. I could do that. But it only solves part of the
> problem. No I could change to some locale which is useful
> for languages in western Europe. But I receive spam from all
> over the world, so I would have to find out what is the
> right one before starting bogofilter, which is already
> pretty complicated. But also that fails if the mail has no

Let me counter this: What human languages can you read besides German
and English? And what character sets are these in? I kill off anything
in far-east character sets at the moment because the set of languages I
understand can completely be written in ISO-8859-15.

OTOH, opening for Unicode makes up for interesting new tasks: can you
tell a (printed) Cyrillic H from a Latin H or from a Greek Eta? I can't,
but Unicode can. This will haunt DNS if it goes Unicode. It will also
haunt bogofilter, because spammers could simply play mix'n'match with
the different character subsets that Unicode comprises.

> declared charset. So my idea would be: Let's find out what
> is a word delimiter. Everything else would be considered a
> word (there might be exceptions for header analysis and
> URLs, hostnames etc.).

The parenthesized part is the easy one: header analysis is done in the
US-ASCII domain, as long as bogofilter is limited to mail. Once it goes
for Usenet news, things will change, as Usenet is about to usurp UTF-8.

> Clearly whitespace and line ending are word delimiters. Also
> punctuation. This assumes we have charsets which are
> compatible with ASCII, though. But I don't see how we can do
> better. How about hyphens?

[:alnum:] would probably be the right thing to go for. I'm not sure how
far iconv and things are, to canonicalize the character set.

> Perfect solution would be to translate everything to
> Unicode. But that would mean to understand all charset and
> again we don't know what to do with undeclared charsets.

Undeclared means US-ASCII. However, you'd rather not want to know how
much legitimate mail is sent without proper encoding or declaration. I
tried running Postfix with strict_7bit_headers = yes and I was in for
deep trouble. I had to switch this off after two weeks, and replaced it
with a header_regexp that just warns. Most news letter software is hosed
and will happily send umlauts in the header without RFC-2047 encoding.

-- 
Matthias Andree

For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list