unicode

John xd890cc2b41c31d74 at f4n.org
Tue Jun 28 12:32:38 CEST 2005


On Wed, Jun 22, 2005 at 11:02:56 +0200, Matthias Andree wrote:
> Pragmatic solution:
> 
> 1. Unicode encodings are unique by definition (discounting homographs,
>    i. e. Cyrillic A, Greek A and Latin A)
> 
> 2. if iconv() violates the 
> 
> In the long run, spam may try to use homograph attacks to evade filters,
> in which case a separate filter could discard the mail if alphabets are
> mixed in the same word.
(Point 2 is truncated?)

Shouldn't bogofilter normalize (Unicode Normalization) the tokens
before comparing/storing them? For example, "\x61\xcc\x81" and
"\xc3\xa1" (using C-style escapes) are both valid UTF-8 encoded
strings, corresponding to "á".

Similarly, normalization would fix the order in the case of several
combining marks.

(This has nothing to do with "spâmmy" --> "spammy" and so on, although
I'm guessing that will become more of a problem using less visible
combining marks. The "more than one alphabet"-idea is great.)

Sorry if this is already taken care of, and in any case sorry for the
late reply.



More information about the Bogofilter mailing list