some SPAN results

Wed Jul 20 10:28:22 CEST 2005

On Mon, 18 Jul 2005, David Relson wrote:

> With a lexer modified to ignore <span...>...</span>, [...]
> It seems that the "span" change doesn't make a useful difference with
> my wordlist.

To be honest, I think that lexer modification was not a good idea for yet
another reason: the mere fact a part of an HTML document is wrapped in
<span> does not imply anything about its visibility. In fact, it is
possible (and quite easy) to make a document such that any text in <span>
is visible while the rest is invisible.

If Bogofilter ignored anything within <span> then it would have a
*deterministic* blind spot. It would be a bad thing wouldn't it?

It would be nice if we were able to discriminate between visible and
invisible parts of HTML. Unfortunately, one would need a full-featured
HTML+CSS+JS interpreter to be able to do it reliably.

A feasible approach might be to tokenize HTML tags and their attributes
(and CSS stylesheets) to help Bogofilter recognize subtle signs of markup
intended to fool readers.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."