Russian charsets and functions

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Sun Jan 9 21:24:43 CET 2005


On Sat, 8 Jan 2005, David Relson wrote:

> Using charsets, bogofilter's default character set is "us-ascii" with
> special handling of some characters (for example mapping 0x92 to
> apostrophe).  

Hmm....the appearance of non-ASCII (or non-ISO-8859-x) characters when the
source charset is alleged to be ASCII (or ISO-8859-x) could cause problems
(see below). This has to worked around.

I know spammers send texts allegedly encoded in ASCII or ISO-8859-1 and
assume recipients will interpret them in CP 1252(?). Does anyone has got
ham making the same (...insert your favourite expletive here...)
assumption?

> For consistency with present databases, tocode and fromcode should both
> be "us-ascii", i.e. whatever charset is set for DEFAULT_CHARSET.

us-ascii -> us-ascii is unable to handle non-ASCII characters (>= 128).
A completely backward-compatible mode should probably skip the conversion 
altogether.

> Since ./configure allows "--default-charset=UTF-8" (or CP866 or KOI-R),
> the value of DEFAULT_CHARSET can be set to personal preferences.

Some handling would be needed for characters that cannot be converted to 
the desired target charset.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."




More information about the bogofilter-dev mailing list