charsets [was: decoding implementation]

Sun Nov 24 23:33:14 CET 2002

At 05:19 PM 11/24/02, Clint Adams wrote:

> > One other thing I can imagine though: How the heck can we treat Greek
> > omikron, Latin o (oh) and Cyrillic o the same? Three different
>
>I don't think that we should.  The difference is significant.
>Except in a message like this, I don't think anyone from whom I want to
>receive mail is going to spell 'zoo' as zÎ¿Ð¾, zÐ¾o, or zoÎ¿.  Barring some
>accident, they're going to spell it 'zoo'.  Assuming this is true, I'd
>want 'zoo' to be a non-spam token, and the Greco-Russian spellings to
>be spam tokens.
>
>The same is true of high-bit apostrophes and such; I'd want the lexer to
>differentiate between them on all platforms, since I doubt that anyone
>will send me legitimate mail containing them.

Clint,

This discussion seems to be heading off in a new, interesting direction - 
character sets!

The problem that Allyn encountered is that hp-ux interprets the character 
set differently than does linux.  This caused different token parsing, 
which lead to a different set of tokens being extracted from the message, 
which lead to a different spamicity being calculated.  The different result 
caused the test framework to raise a red flag, known as "test failed".

Obviously we can modify the test so if passes.  We can remove any 
"controversial" characters from the test cases.  Test messages with just 
a-z, A-Z, ... are "safer" and less "interesting".

We can modify bogofilter so that it's not affected by hp-ux's 
interpretation of the character set.  By converting to a "lowest common" 
character set, we avoid the interpretation problem.  As you point out this 
there are issues to this approach.

What do you suggest be done?

David