unicode [was: bogofilter-0.95.0 - New Current Release]

Tue Jun 21 13:17:44 CEST 2005

On Tue, 21 Jun 2005 08:17:34 +0200 (CEST)
Boris 'pi' Piwinger wrote:

> David Relson said:
> 
> > This release provides unicode support for new and converted wordlists.
> 
> That sounds really interesting. So is my understanding correct, that
> this will make sure that the same word (e.g. Österreich) will show
> up in the database always the same, no matter if it was encoded in
> ISO-8859-1, ISO-8859-15, UTF-8 or any other charset? Presumably,
> this will increase accuarancy. Has anybody tested that already?

Hi pi,

The iconv() function does the work of converting from one charset to
another.  Testing of the type you mention needs to be done.  The
messages available to me for testing are primarily ISO-8859-1,
Windows-125x, and a few other.  I lack test cases for verifying that
iconv() generates the same UTF-8.  If you have a good set of test
messages, they could be used to create an additional test case for make
check.

> On the other hand: With say Chinese texts we had good success in training.
> I don't fully understand why it worked, but it did. Will this change now,
> because we actually better understand those texts?

That should improve!

> Also: How about word boundaries? In Unicode there is much more whitespace
> than in the small charsets. How do we do this? Now the question which
> character makes a word really changes. I work with this as a legal character
> for tokens: [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]
> Does this fully translate to Unicode? That would seem great.

Again, the answer is "insufficient information and test cases".  We
know that the lexer has never followed the "rules" of asian grammar.
In spite of that it's parsing produces tokens that work well for
classifying email.  With better input (more consistent) for parsing,
this can only improve.

> > Command line options "--unicode=yes" and "--unicode=no" can be used.
> 
> Are there also config file options?

Yes.  

Bogofilter checks the database for the .ENCODING token and, if present,
uses its value.  The config file option only affects bogofilter when
creating a new wordlist.

Bogolexer doesn't use the wordlist, so the option controls how it
parses each message.

Bogoutil only uses the option in maintenance mode.

> > For a wordlist containing tokens from multiple languages, particularly
> > non-european languages, the conversion methods described above may not
> > work well for you.  Building a new wordlist (from scratch) will likely
> > work better as the new wordlist will be based solely on unicode.
> 
> I will (once I upgrade) certainly do that. Since for all train-on-error
> methods (in particular training to exhaustion) the set of messages used
> will certainly look differently.

You can build it and test it from the command line.  No need to replace
your mail delivery tool chain :->

> BTW: How about MIME encoding in mail headers? Is this treated properly?

I need to double check that.  It's been a while since the initial
implementation.  Recent work has been in the areas of options for
configuring and running, handling of .ENCODING, etc

More later ...

David