unicode

Wed Jun 29 23:40:26 CEST 2005

John <xd890cc2b41c31d74 at f4n.org> writes:

> On Wed, Jun 22, 2005 at 11:02:56 +0200, Matthias Andree wrote:
>> Pragmatic solution:
>> 
>> 1. Unicode encodings are unique by definition (discounting homographs,
>>    i. e. Cyrillic A, Greek A and Latin A)
>> 
>> 2. if iconv() violates the 
>> 
>> In the long run, spam may try to use homograph attacks to evade filters,
>> in which case a separate filter could discard the mail if alphabets are
>> mixed in the same word.
> (Point 2 is truncated?)
>
> Shouldn't bogofilter normalize (Unicode Normalization) the tokens
> before comparing/storing them? For example, "\x61\xcc\x81" and
> "\xc3\xa1" (using C-style escapes) are both valid UTF-8 encoded
> strings, corresponding to "á".

It should. Do you happen to know a good and GPL-compatible library that
can do this job for us?

What normalization form would you suggest? Any objections to NKFC?

-- 
Matthias Andree