camouflags [was: some SPAN results]

David Relson relson at osagesoftware.com
Wed Jul 20 13:27:02 CEST 2005


On Wed, 20 Jul 2005 10:28:22 +0200 (CEST)
Pavel Kankovsky wrote:

> On Mon, 18 Jul 2005, David Relson wrote:
> 
> > With a lexer modified to ignore <span...>...</span>, [...]
> > It seems that the "span" change doesn't make a useful difference with
> > my wordlist.
> 
> To be honest, I think that lexer modification was not a good idea for yet
> another reason: the mere fact a part of an HTML document is wrapped in
> <span> does not imply anything about its visibility. In fact, it is
> possible (and quite easy) to make a document such that any text in <span>
> is visible while the rest is invisible.
> 
> If Bogofilter ignored anything within <span> then it would have a
> *deterministic* blind spot. It would be a bad thing wouldn't it?
> 
> It would be nice if we were able to discriminate between visible and
> invisible parts of HTML. Unfortunately, one would need a full-featured
> HTML+CSS+JS interpreter to be able to do it reliably.
> 
> A feasible approach might be to tokenize HTML tags and their attributes
> (and CSS stylesheets) to help Bogofilter recognize subtle signs of markup
> intended to fool readers.
> 
> --Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
> "Resistance is futile. Open your source code and prepare for assimilation."

Hi Pavel,

Don't worry about the "SPAN" patch.  It was an experiment and is not
destined for inclusion in bogofilter.

Bogofilter currently ignores most HTML tags.  There are 3 tags that are
parsed, i.e. "a", "img", and "font".  The tokens within these tags are
used in scoring.

There's a term for "invisible" text.  It's called camouflage :->  As
you point out, detecting it is difficult.  If one ignores CSS and JS,
detecting white text on a white background isn't hard -- just check for
color "white" or color "#255255255".  However there's also "#254254254"
which is almost white.  Does one consider that what also?

Camouflage is a problem I worried about at one time (a year or two
ago).  I'm not presently worrying about it as bogofilter is doing well.

Regards,

David



More information about the Bogofilter mailing list