... convert_unicode.c ...
Pavel Kankovsky
peak at argo.troja.mff.cuni.cz
Sun Jun 26 11:40:07 CEST 2005
On Sat, 25 Jun 2005, Matthias Andree wrote:
> > One final observation: it appears to be quite common feature of Asian HTML
> > spam not to provide the right charset in HTML <meta> tag rather in MIME
> > headers i.e. Content-Type (in fact, I can find a few samples where there
> > is an explicit (!) bogus charset in Content-Type, e.g. US-ASCII).
>
> This appears a bit contradictory. Are you saying that we should look at
> the HTML META tag if present, and not at the Content-Type?
This is similar to HTTP: you can have charset in Content-Type, or in
<meta>, or in both. The standards say Content-Type should take precedence
in the last case but MSIE (what surprise) does the exact oposite.
I suspect MSOE, using MSIE engine, does the same thing and overrides
charset in Content-Type with the charset in <meta>.
Should we break the standards too and emulate (suspected) MSOE behaviour?
I don't know. It depends on whether there is a positive correlation
between the exploitation of that misfeature (maybe?) and spamicity and on
whether bogus tokens increase or decrease classification accuracy (I am
not sure, afaict there are many cases when the result of Asian spam
tokenization (without proper iconv()) contains *zero* meaningful
tokens from the content of a message and the classification is based
on tokens from headers or HTML markup; it works but it is quite fragile
imho).
--Pavel Kankovsky aka Peak [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."
More information about the bogofilter-dev
mailing list