unicode
John
xd890cc2b41c31d74 at f4n.org
Tue Jun 28 12:32:38 CEST 2005
On Wed, Jun 22, 2005 at 11:02:56 +0200, Matthias Andree wrote:
> Pragmatic solution:
>
> 1. Unicode encodings are unique by definition (discounting homographs,
> i. e. Cyrillic A, Greek A and Latin A)
>
> 2. if iconv() violates the
>
> In the long run, spam may try to use homograph attacks to evade filters,
> in which case a separate filter could discard the mail if alphabets are
> mixed in the same word.
(Point 2 is truncated?)
Shouldn't bogofilter normalize (Unicode Normalization) the tokens
before comparing/storing them? For example, "\x61\xcc\x81" and
"\xc3\xa1" (using C-style escapes) are both valid UTF-8 encoded
strings, corresponding to "á".
Similarly, normalization would fix the order in the case of several
combining marks.
(This has nothing to do with "spâmmy" --> "spammy" and so on, although
I'm guessing that will become more of a problem using less visible
combining marks. The "more than one alphabet"-idea is great.)
Sorry if this is already taken care of, and in any case sorry for the
late reply.
More information about the Bogofilter
mailing list