some SPAN results
Pavel Kankovsky
peak at argo.troja.mff.cuni.cz
Wed Jul 20 10:28:22 CEST 2005
On Mon, 18 Jul 2005, David Relson wrote:
> With a lexer modified to ignore <span...>...</span>, [...]
> It seems that the "span" change doesn't make a useful difference with
> my wordlist.
To be honest, I think that lexer modification was not a good idea for yet
another reason: the mere fact a part of an HTML document is wrapped in
<span> does not imply anything about its visibility. In fact, it is
possible (and quite easy) to make a document such that any text in <span>
is visible while the rest is invisible.
If Bogofilter ignored anything within <span> then it would have a
*deterministic* blind spot. It would be a bad thing wouldn't it?
It would be nice if we were able to discriminate between visible and
invisible parts of HTML. Unfortunately, one would need a full-featured
HTML+CSS+JS interpreter to be able to do it reliably.
A feasible approach might be to tokenize HTML tags and their attributes
(and CSS stylesheets) to help Bogofilter recognize subtle signs of markup
intended to fool readers.
--Pavel Kankovsky aka Peak [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."
More information about the Bogofilter
mailing list