... convert_unicode.c ...

David Relson relson at osagesoftware.com
Fri Jun 24 13:12:20 CEST 2005


On Fri, 24 Jun 2005 14:24:46 +0400
Yar Tikhiy wrote:

> On Fri, Jun 24, 2005 at 09:55:53AM +0200, Matthias Andree wrote:
> > Yar Tikhiy <yar at comp.chem.msu.su> writes:
> > 
> > > Hoping I may speak for Cyrillic users, they would rather choose
> > > between Windows-1251 and KOI8-R as their default-from-charset since
> > > literally nobody uses CP866 on the Net side.  Interestingly, I
> > > receive most ham in KOI8-R and most spam in Windows-1251, and I've
> > > never seen an email in CP866.  However, today most non-English
> > > spammers seem to specify charset right for their recipients to be
> > > able to read the junk in one click--who will ever spend two clicks
> > > to read spam?  Therefore US-ASCII is a reasonable default-from-charset
> > > for Cyrillic users.  I hope that it is for Chinese folks, too :-)
> > 
> > I'm not sure if someone hears your hopes.
> > 
> > I see considerable amounts of Korean and Chinese spam which has specific
> > patterns of character pairs, so it seems somewhat common in Asia to
> > override or default the specified charset to some national common
> > charset.
> 
> Perhaps someone concerned will show up on the list some day; I can't
> see how we should care about this issue until then.
> 
> -- 
> Yar

Yar,

I share your point of view.  

The parsing of Korean and
Chinese may not produce tokens that are meaningful to someone who
speaks the languages.  However the tokens work well for bogofilter's
classification and that's what we need.

We'll deal with the issue when someone
appears who knows about the problem domain.  

Regards,

David



More information about the bogofilter-dev mailing list