Russian charsets and functions

David Relson relson at osagesoftware.com
Sun Jan 9 21:40:39 CET 2005


On Sun, 9 Jan 2005 21:24:43 +0100 (CET)
Pavel Kankovsky wrote:

> On Sat, 8 Jan 2005, David Relson wrote:
> 
> > Using charsets, bogofilter's default character set is "us-ascii" with
> > special handling of some characters (for example mapping 0x92 to
> > apostrophe).  
> 
> Hmm....the appearance of non-ASCII (or non-ISO-8859-x) characters when the
> source charset is alleged to be ASCII (or ISO-8859-x) could cause problems
> (see below). This has to worked around.
> 
> I know spammers send texts allegedly encoded in ASCII or ISO-8859-1 and
> assume recipients will interpret them in CP 1252(?). Does anyone has got
> ham making the same (...insert your favourite expletive here...)
> assumption?
> 
> > For consistency with present databases, tocode and fromcode should both
> > be "us-ascii", i.e. whatever charset is set for DEFAULT_CHARSET.
> 
> us-ascii -> us-ascii is unable to handle non-ASCII characters (>= 128).
> A completely backward-compatible mode should probably skip the conversion 
> altogether.
> 
> > Since ./configure allows "--default-charset=UTF-8" (or CP866 or KOI-R),
> > the value of DEFAULT_CHARSET can be set to personal preferences.
> 
> Some handling would be needed for characters that cannot be converted to 
> the desired target charset.

iconv() stops converting characters in the following cases:

  done -- no problem!
  E2BIG  - output buffer has no more room
  EINVAL - incomplete multibyte sequence
  EILSEQ - invalid multibyte sequence

the second case is handled with the next read.  For the last two, the
new code is just copying the problem character.  An alternative would be
to replace the problem character with a known character, for example a
space or a question mark.  'Tis not at all clear if there's a "right"
thing to do or what the "best" thing do to is :-<


> 
> --Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
> "Resistance is futile. Open your source code and prepare for assimilation."
> 
> _______________________________________________
> Bogofilter-dev mailing list
> Bogofilter-dev at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter-dev


-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800



More information about the bogofilter-dev mailing list