unicode [was: bogofilter-0.95.0 - New Current Release]
Boris 'pi' Piwinger
3.14 at piology.org
Tue Jun 21 19:54:16 CEST 2005
David Relson said:
>> Also: How about word boundaries? In Unicode there is much more whitespace
>> than in the small charsets. How do we do this? Now the question which
>> character makes a word really changes. I work with this as a legal character
>> for tokens: [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]
>> Does this fully translate to Unicode? That would seem great.
>
> Again, the answer is "insufficient information and test cases".
Actually, my question is mainly about [:blank:], punctuation (like French
quotes) will be a problem anyway, but this is unchanged.
>> > Command line options "--unicode=yes" and "--unicode=no" can be used.
>>
>> Are there also config file options?
>
> Yes.
Which are?
> Bogofilter checks the database for the .ENCODING token and, if present,
> uses its value. The config file option only affects bogofilter when
> creating a new wordlist.
Good enough.
>> > For a wordlist containing tokens from multiple languages, particularly
>> > non-european languages, the conversion methods described above may not
>> > work well for you. Building a new wordlist (from scratch) will likely
>> > work better as the new wordlist will be based solely on unicode.
>>
>> I will (once I upgrade) certainly do that. Since for all train-on-error
>> methods (in particular training to exhaustion) the set of messages used
>> will certainly look differently.
>
> You can build it and test it from the command line. No need to replace
> your mail delivery tool chain :->
Well, I won't be able to build before the weekend (could be the weekend in a
month from now;-). But I will rebuild the database for sure.
pi
More information about the Bogofilter
mailing list