charsets [was: decoding implementation]

David Relson relson at osagesoftware.com
Sun Nov 24 23:33:14 CET 2002


At 05:19 PM 11/24/02, Clint Adams wrote:

> > One other thing I can imagine though: How the heck can we treat Greek
> > omikron, Latin o (oh) and Cyrillic o the same? Three different
>
>I don't think that we should.  The difference is significant.
>Except in a message like this, I don't think anyone from whom I want to
>receive mail is going to spell 'zoo' as zοо, zоo, or zoο.  Barring some
>accident, they're going to spell it 'zoo'.  Assuming this is true, I'd
>want 'zoo' to be a non-spam token, and the Greco-Russian spellings to
>be spam tokens.
>
>The same is true of high-bit apostrophes and such; I'd want the lexer to
>differentiate between them on all platforms, since I doubt that anyone
>will send me legitimate mail containing them.

Clint,

This discussion seems to be heading off in a new, interesting direction - 
character sets!

The problem that Allyn encountered is that hp-ux interprets the character 
set differently than does linux.  This caused different token parsing, 
which lead to a different set of tokens being extracted from the message, 
which lead to a different spamicity being calculated.  The different result 
caused the test framework to raise a red flag, known as "test failed".

Obviously we can modify the test so if passes.  We can remove any 
"controversial" characters from the test cases.  Test messages with just 
a-z, A-Z, ... are "safer" and less "interesting".

We can modify bogofilter so that it's not affected by hp-ux's 
interpretation of the character set.  By converting to a "lowest common" 
character set, we avoid the interpretation problem.  As you point out this 
there are issues to this approach.

What do you suggest be done?

David





More information about the bogofilter-dev mailing list