Serious problem with non-ASCII words
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Fri Sep 20 23:29:04 CEST 2002
Matthias Andree <matthias.andree at gmx.de> wrote:
>> Right. I could do that. But it only solves part of the
>> problem. No I could change to some locale which is useful
>> for languages in western Europe. But I receive spam from all
>> over the world, so I would have to find out what is the
>> right one before starting bogofilter, which is already
>> pretty complicated. But also that fails if the mail has no
>
>Let me counter this: What human languages can you read besides German
>and English? And what character sets are these in?
Well, I can reed some Russian and know the greek alphabet.
Anyways, there is spam like this. This should be caught, so
the alphabets need to be handled.
>I kill off anything
>in far-east character sets at the moment because the set of languages I
>understand can completely be written in ISO-8859-15.
Yes, I kill those completely unreadables before I start
bogofilter. This is a huge part of my spam. But there are to
many languages which are not that clear.
>OTOH, opening for Unicode makes up for interesting new tasks: can you
>tell a (printed) Cyrillic H from a Latin H or from a Greek Eta? I can't,
>but Unicode can.
So bogofilter can.
>This will haunt DNS if it goes Unicode. It will also
>haunt bogofilter, because spammers could simply play mix'n'match with
>the different character subsets that Unicode comprises.
Which will in turn generate words which only show up in spam
and we are in a winning position. But that can happen
already now.
>> declared charset. So my idea would be: Let's find out what
>> is a word delimiter. Everything else would be considered a
>> word (there might be exceptions for header analysis and
>> URLs, hostnames etc.).
>
>The parenthesized part is the easy one: header analysis is done in the
>US-ASCII domain, as long as bogofilter is limited to mail.
Well, MIME-Words are there.
>Once it goes
>for Usenet news, things will change, as Usenet is about to usurp UTF-8.
After all Usenet is based on mail. I don't see a difference.
>> Clearly whitespace and line ending are word delimiters. Also
>> punctuation. This assumes we have charsets which are
>> compatible with ASCII, though. But I don't see how we can do
>> better. How about hyphens?
>
>[:alnum:] would probably be the right thing to go for. I'm not sure how
>far iconv and things are, to canonicalize the character set.
If this is a safe class over all charsets, fine. Else we
should be explicit about the character ranges.
>> Perfect solution would be to translate everything to
>> Unicode. But that would mean to understand all charset and
>> again we don't know what to do with undeclared charsets.
>
>Undeclared means US-ASCII. However, you'd rather not want to know how
>much legitimate mail is sent without proper encoding or declaration.
That is the point. So much is broken we have to take care of
that.
pi
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list