... convert_unicode.c ...

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Sun Jun 26 11:40:07 CEST 2005


On Sat, 25 Jun 2005, Matthias Andree wrote:

> > One final observation: it appears to be quite common feature of Asian HTML
> > spam not to provide the right charset in HTML <meta> tag rather in MIME 
> > headers i.e. Content-Type (in fact, I can find a few samples where there 
> > is an explicit (!) bogus charset in Content-Type, e.g. US-ASCII).
> 
> This appears a bit contradictory. Are you saying that we should look at
> the HTML META tag if present, and not at the Content-Type?

This is similar to HTTP: you can have charset in Content-Type, or in 
<meta>, or in both. The standards say Content-Type should take precedence 
in the last case but MSIE (what surprise) does the exact oposite.

I suspect MSOE, using MSIE engine, does the same thing and overrides
charset in Content-Type with the charset in <meta>.

Should we break the standards too and emulate (suspected) MSOE behaviour?
I don't know. It depends on whether there is a positive correlation 
between the exploitation of that misfeature (maybe?) and spamicity and on 
whether bogus tokens increase or decrease classification accuracy (I am 
not sure, afaict there are many cases when the result of Asian spam 
tokenization (without proper iconv()) contains *zero* meaningful 
tokens from the content of a message and the classification is based
on tokens from headers or HTML markup; it works but it is quite fragile 
imho).

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."




More information about the bogofilter-dev mailing list