obscured URL not being tokenized

David Relson relson at osagesoftware.com
Sat Dec 20 22:43:21 CET 2003


On Sat, 20 Dec 2003 20:52:12 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson <relson at osagesoftware.com> wrote:
> 
> >As mentionned in a previous message, color info appears in a number
> >of tags.  Examples include:
> >
> >	<body bgcolor="white" text="black">
> >	<font color="#000000">
> >	<TABLE BGCOLOR="#cccccc">
> >	<td="#9EBAC6">
> >	<td bgcolor=lightblue>
> >
> >Likely, the most value for the least effort would come from
> >recognizing 000000 as black and FFFFFF as white.
> >
> >Question:  what other tags allow color info?
> 
> It is not that easy. As you said before, even minimal
> changes will make this impossible. But even worse, basically
> any HTML element can have a color by CSS. And CSS will
> overwrite any of those in your example. Without complete
> understanding of HTML we cannot do anything.
> 
> pi

I was thinking about this on the way to my daughter's soccer game...

The general subject can be termed "hidden text".  One technique is
matching text color to background, which is how this thread started. 
Another technique is having unrelated mime parts, for example the
text/html section selling dirty pictures and the text/plain section
being totally different, for example several paragraphs on archery. 
There are other techniques, but I can't think of them at the moment.

Anyhow, as regards white text on a white background, it's relatively
easy to look at the most recent color directives and make a decision
based on them.  Unfortunately, this isn't adequate.  Directives are
nested, for example a table contains table data which can include font
directives.  Proper processing of all this requires a stack for saving
the previous state and popping the stack as end tags are encountered. 
It all gets more complicated since the html may be improperly formed, as
in <table><tr><td><font>...</table>, where the end directive pops
several stack levels.

Getting it right isn't simple ...




More information about the Bogofilter mailing list