Serious problem with non-ASCII words

Fri Sep 20 23:29:04 CEST 2002

Matthias Andree <matthias.andree at gmx.de> wrote:

>> Right. I could do that. But it only solves part of the
>> problem. No I could change to some locale which is useful
>> for languages in western Europe. But I receive spam from all
>> over the world, so I would have to find out what is the
>> right one before starting bogofilter, which is already
>> pretty complicated. But also that fails if the mail has no
>
>Let me counter this: What human languages can you read besides German
>and English? And what character sets are these in? 

Well, I can reed some Russian and know the greek alphabet.
Anyways, there is spam like this. This should be caught, so
the alphabets need to be handled.

>I kill off anything
>in far-east character sets at the moment because the set of languages I
>understand can completely be written in ISO-8859-15.

Yes, I kill those completely unreadables before I start
bogofilter. This is a huge part of my spam. But there are to
many languages which are not that clear.

>OTOH, opening for Unicode makes up for interesting new tasks: can you
>tell a (printed) Cyrillic H from a Latin H or from a Greek Eta? I can't,
>but Unicode can. 

So bogofilter can.

>This will haunt DNS if it goes Unicode. It will also
>haunt bogofilter, because spammers could simply play mix'n'match with
>the different character subsets that Unicode comprises.

Which will in turn generate words which only show up in spam
and we are in a winning position. But that can happen
already now.

>> declared charset. So my idea would be: Let's find out what
>> is a word delimiter. Everything else would be considered a
>> word (there might be exceptions for header analysis and
>> URLs, hostnames etc.).
>
>The parenthesized part is the easy one: header analysis is done in the
>US-ASCII domain, as long as bogofilter is limited to mail. 

Well, MIME-Words are there.

>Once it goes
>for Usenet news, things will change, as Usenet is about to usurp UTF-8.

After all Usenet is based on mail. I don't see a difference.

>> Clearly whitespace and line ending are word delimiters. Also
>> punctuation. This assumes we have charsets which are
>> compatible with ASCII, though. But I don't see how we can do
>> better. How about hyphens?
>
>[:alnum:] would probably be the right thing to go for. I'm not sure how
>far iconv and things are, to canonicalize the character set.

If this is a safe class over all charsets, fine. Else we
should be explicit about the character ranges.

>> Perfect solution would be to translate everything to
>> Unicode. But that would mean to understand all charset and
>> again we don't know what to do with undeclared charsets.
>
>Undeclared means US-ASCII. However, you'd rather not want to know how
>much legitimate mail is sent without proper encoding or declaration.

That is the point. So much is broken we have to take care of
that.

pi

For summay digest subscription: bogofilter-digest-subscribe at aotto.com