matthias.andree at gmx.de
Wed Jun 29 17:40:26 EDT 2005
John <xd890cc2b41c31d74 at f4n.org> writes:
> On Wed, Jun 22, 2005 at 11:02:56 +0200, Matthias Andree wrote:
>> Pragmatic solution:
>> 1. Unicode encodings are unique by definition (discounting homographs,
>> i. e. Cyrillic A, Greek A and Latin A)
>> 2. if iconv() violates the
>> In the long run, spam may try to use homograph attacks to evade filters,
>> in which case a separate filter could discard the mail if alphabets are
>> mixed in the same word.
> (Point 2 is truncated?)
> Shouldn't bogofilter normalize (Unicode Normalization) the tokens
> before comparing/storing them? For example, "\x61\xcc\x81" and
> "\xc3\xa1" (using C-style escapes) are both valid UTF-8 encoded
> strings, corresponding to "á".
It should. Do you happen to know a good and GPL-compatible library that
can do this job for us?
What normalization form would you suggest? Any objections to NKFC?
More information about the Bogofilter