bogofilter-0.95.0 - New Current Release

Wed Jun 22 10:49:08 CEST 2005

"Boris 'pi' Piwinger" <3.14 at piology.org> writes:

> That sounds really interesting. So is my understanding correct, that
> this will make sure that the same word (e.g. Österreich) will show
> up in the database always the same, no matter if it was encoded in
> ISO-8859-1, ISO-8859-15, UTF-8 or any other charset?

Yes. (Not that it mattered between ISO-8859-1 and -15 though.)

> On the other hand: With say Chinese texts we had good success in training.
> I don't fully understand why it worked, but it did. Will this change now,
> because we actually better understand those texts?

The number of character sets to encode Chinese is limited.  I don't
think it will change.

> Also: How about word boundaries? In Unicode there is much more whitespace
> than in the small charsets. How do we do this? Now the question which
> character makes a word really changes. I work with this as a legal character
> for tokens: [^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]
> Does this fully translate to Unicode? That would seem great.

This would need to be adjusted for national spacing conventions where
they use something else than ASCII SPACE (0x20 in ASCII and 0x0020 in
Unicode), unless we want to rely on LC_CTYPE which I don't suggest.

> BTW: How about MIME encoding in mail headers? Is this treated
> properly?

Not yet. RFC-2047 is decoded, but recursively, i. e. if the decoded
result looks like an encoded word, bogofilter decodes the new result
again.  bogofilter's analyzer should make only one pass over the header.

-- 
Matthias Andree