unicode

Wed Jun 22 11:02:56 CEST 2005

David Relson <relson at osagesoftware.com> writes:

> On Tue, 21 Jun 2005 08:17:34 +0200 (CEST)
> Boris 'pi' Piwinger wrote:
>
>> David Relson said:
>> 
>> > This release provides unicode support for new and converted wordlists.
>> 
>> That sounds really interesting. So is my understanding correct, that
>> this will make sure that the same word (e.g. Österreich) will show
>> up in the database always the same, no matter if it was encoded in
>> ISO-8859-1, ISO-8859-15, UTF-8 or any other charset? Presumably,
>> this will increase accuarancy. Has anybody tested that already?
>
> Hi pi,
>
> The iconv() function does the work of converting from one charset to
> another.  Testing of the type you mention needs to be done.  The
> messages available to me for testing are primarily ISO-8859-1,
> Windows-125x, and a few other.  I lack test cases for verifying that
> iconv() generates the same UTF-8.

Pragmatic solution:

1. Unicode encodings are unique by definition (discounting homographs,
   i. e. Cyrillic A, Greek A and Latin A)

2. if iconv() violates the 

In the long run, spam may try to use homograph attacks to evade filters,
in which case a separate filter could discard the mail if alphabets are
mixed in the same word.

OTOH, I'm seeing more and more spam (usually by illegal pharmacies) use
GIF images to convey the message,

> Again, the answer is "insufficient information and test cases".  We
> know that the lexer has never followed the "rules" of asian grammar.

In some cases, they'll be unable to tell words apart. Japanese names are
written in reverse order without space. Where we'll write "Naohiro
Takahara", they'll spell - in Japanese "letters" - takaharanaohiro. I'm
not sure if that affects other texts than names.

-- 
Matthias Andree