... convert_unicode.c ...

Matthias Andree matthias.andree at gmx.de
Wed Jun 22 11:13:33 CEST 2005


"Pavel Kankovsky" <peak at argo.troja.mff.cuni.cz> writes:

> I think we should emulate the behaviour of common MUAs--read MS Outlook 
> (Express)--in such a situation. This is what spammers expect and this is 
> what they optimize their "products" for.

Right. In the Unix browsers I use there's something like "character set
to assume for pages lacking specification", I don't know about MSOE.

> I guess MSOE assumes ASCII, 8859-1, or the charset of its current locale
> (most likely but ASCII or perhaps 8859-1 might still be a reasonable
> default value) when it encounters an unrecognized charset name.

I'd presume that "Western" Windows MSOE installations assume Codepage
1252 (a superset of ISO-8859-1 containing the z caron, euro currency
symbol, oe ligatures and some other stuff). Central European
installations (are there localized Windows versions for Polish, Czech
and so on?) might default to something else. A default or fallback
character set might work.

A more expensive approach would be to guess the character set in a
"maximum likelyhood" principle, based on a Markov model that gives
compound probabilities for runs of characters. It would however have to
estimate the language and character set at the same time, which is going
to be pretty expensive.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list