bogofilter-0.95.0 - New Current Release

Wed Jun 22 21:59:39 CEST 2005

"Boris 'pi' Piwinger" <3.14 at piology.org> writes:

>>> Also: How about word boundaries? In Unicode there is much more whitespace
>>> than in the small charsets. How do we do this? Now the question which
>>> character makes a word really changes. I work with this as a legal character
>>> for tokens: [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]
>>> Does this fully translate to Unicode? That would seem great.
>>
>> This would need to be adjusted for national spacing conventions where
>> they use something else than ASCII SPACE (0x20 in ASCII and 0x0020 in
>> Unicode), unless we want to rely on LC_CTYPE which I don't suggest.
>
> I thought about space like emspace, enspace and things like that.

Hm... for locales that have working and complete UTF-8 CTYPE tables,
this might work by just 

>>> BTW: How about MIME encoding in mail headers? Is this treated
>>> properly?
>>
>> Not yet. RFC-2047 is decoded, but recursively, i. e. if the decoded
>> result looks like an encoded word, bogofilter decodes the new result
>> again.
>
> So it is guessing UTF-8?

No, it reads the character set from encoded words. OTOH, I must admit I
haven't checked if RFC-2047 decoded words are put through iconv(). If
they aren't, that's for 0.95.1 :)

-- 
Matthias Andree