bogofilter-0.95.0 - New Current Release

Wed Jun 22 16:10:27 CEST 2005

Matthias Andree said:

>> That sounds really interesting. So is my understanding correct, that
>> this will make sure that the same word (e.g. Österreich) will show
>> up in the database always the same, no matter if it was encoded in
>> ISO-8859-1, ISO-8859-15, UTF-8 or any other charset?
>
> Yes. (Not that it mattered between ISO-8859-1 and -15 though.)

I just included that to make sure no encoding information would be stored
somewhere.

>> Also: How about word boundaries? In Unicode there is much more whitespace
>> than in the small charsets. How do we do this? Now the question which
>> character makes a word really changes. I work with this as a legal character
>> for tokens: [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]
>> Does this fully translate to Unicode? That would seem great.
>
> This would need to be adjusted for national spacing conventions where
> they use something else than ASCII SPACE (0x20 in ASCII and 0x0020 in
> Unicode), unless we want to rely on LC_CTYPE which I don't suggest.

I thought about space like emspace, enspace and things like that.

>> BTW: How about MIME encoding in mail headers? Is this treated
>> properly?
>
> Not yet. RFC-2047 is decoded, but recursively, i. e. if the decoded
> result looks like an encoded word, bogofilter decodes the new result
> again.

So it is guessing UTF-8?

pi