bogofilter-0.95.0 - New Current Release

Tue Jun 21 08:17:34 CEST 2005

David Relson said:

> This release provides unicode support for new and converted wordlists.

That sounds really interesting. So is my understanding correct, that
this will make sure that the same word (e.g. Österreich) will show
up in the database always the same, no matter if it was encoded in
ISO-8859-1, ISO-8859-15, UTF-8 or any other charset? Presumably,
this will increase accuarancy. Has anybody tested that already?

On the other hand: With say Chinese texts we had good success in training.
I don't fully understand why it worked, but it did. Will this change now,
because we actually better understand those texts?

Also: How about word boundaries? In Unicode there is much more whitespace
than in the small charsets. How do we do this? Now the question which
character makes a word really changes. I work with this as a legal character
for tokens: [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]
Does this fully translate to Unicode? That would seem great.

> Command line options "--unicode=yes" and "--unicode=no" can be used.

Are there also config file options?

> For a wordlist containing tokens from multiple languages, particularly
> non-european languages, the conversion methods described above may not
> work well for you.  Building a new wordlist (from scratch) will likely
> work better as the new wordlist will be based solely on unicode.

I will (once I upgrade) certainly do that. Since for all train-on-error
methods (in particular training to exhaustion) the set of messages used
will certainly look differently.

BTW: How about MIME encoding in mail headers? Is this treated properly?

pi