Encoded tôkens of subjé ct with làrge têxt get splitted fróm MUA

Matthias Andree matthias.andree at gmx.de
Fri Aug 1 19:29:06 CEST 2003


David Relson <relson at osagesoftware.com> writes:

> P.S.  My MUA doesn't get it quite right, either.  It has an extra space
> in its subject.

No wonder. X-Mailer: QUALCOMM Windows Eudora Version 4.3.2
                             ---------

This "can't do RFC-2047 right" is shared by more than one Windows
mailer, so I wonder if they all use the same broken "system"^WOE
library.

SCNR.

Back on topic, how can we attack the real problem? One issue is: what do
we do when adjacent encoded words don't have the same character set?

We've seen with the case blindness that we've discarded useful
information, folding isn't desired. Folding all character sets into
UTF-8 discards information, but other than using a special string format
that encodes each character as three or four bytes (two bytes for the
character set and one or two for the character itself, or), I only see
one simple solution: decode, emit character set as token, and ignore it
otherwise, assuming that words in different character sets don't clash
(for example, the word "window" when read in a different character set
might make sense in a different language). The latter is easier though,
and if the words clash, the user might get some false estimations until
he's taught bogofilter enough so the token becomes unspecific for
ham/spam and gets ignored.

-- 
Matthias Andree




More information about the Bogofilter mailing list