unicode [was: bogofilter-0.95.0 - New Current Release]

Boris 'pi' Piwinger 3.14 at piology.org
Tue Jun 21 19:54:16 CEST 2005


David Relson said:

>> Also: How about word boundaries? In Unicode there is much more whitespace
>> than in the small charsets. How do we do this? Now the question which
>> character makes a word really changes. I work with this as a legal character
>> for tokens: [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]
>> Does this fully translate to Unicode? That would seem great.
>
> Again, the answer is "insufficient information and test cases".

Actually, my question is mainly about [:blank:], punctuation (like French
quotes) will be a problem anyway, but this is unchanged.

>> > Command line options "--unicode=yes" and "--unicode=no" can be used.
>>
>> Are there also config file options?
>
> Yes.

Which are?

> Bogofilter checks the database for the .ENCODING token and, if present,
> uses its value.  The config file option only affects bogofilter when
> creating a new wordlist.

Good enough.

>> > For a wordlist containing tokens from multiple languages, particularly
>> > non-european languages, the conversion methods described above may not
>> > work well for you.  Building a new wordlist (from scratch) will likely
>> > work better as the new wordlist will be based solely on unicode.
>>
>> I will (once I upgrade) certainly do that. Since for all train-on-error
>> methods (in particular training to exhaustion) the set of messages used
>> will certainly look differently.
>
> You can build it and test it from the command line.  No need to replace
> your mail delivery tool chain :->

Well, I won't be able to build before the weekend (could be the weekend in a
month from now;-). But I will rebuild the database for sure.

pi



More information about the Bogofilter mailing list