obscured URL not being tokenized

David Relson relson at osagesoftware.com
Sun Dec 21 02:52:02 CET 2003


On 20 Dec 2003 20:14:29 -0500
Tom Anderson <tanderso at oac-design.com> wrote:

> On Sat, 2003-12-20 at 16:43, David Relson wrote:
> > based on them.  Unfortunately, this isn't adequate.  Directives are
> > nested, for example a table contains table data which can include
> > font directives.  Proper processing of all this requires a stack for
> > saving the previous state and popping the stack as end tags are
> > encountered. It all gets more complicated since the html may be
> > improperly formed, as in <table><tr><td><font>...</table>, where the
> > end directive pops several stack levels.
> 
> I don't think bogofilter needs to be interpreting html.  Simply
> recognizing the tokens such as 'size=-5' and 'color="#fffffe"' ought
> to be enough.  If there were to be such functionality as html
> interpretation and image recognition attempted, it should be in a
> preprocessor seperate from bogofilter which perhaps sets its own
> "x-bogosity"-type line.  This way it could be turned on or off easily.
> 
> Let's not bloat the core of bogofilter with such stuff.
> 
> Tom

Tom,

Of necessity, bogofilter is already doing some html interpretation:

comments - removed from consideration so "ju<!xxx>nk" is recognized
tags a, img, font - innards parsed because of demonstratable benefits
html character decoding - "a" converted to 'a'
url character decoding - "%30" converted to '0'
etc

For the heck of it, I'm experimenting with color as regards hidden text.
 The following ideas have become clear:

Proper color processing requires a stack to track levels (nesting) and
attributes.

Including additional html tags in the parser _may_ be space efficient --
remains to be seen.  Including additional html tags in the parser will
_definitely_ be faster than a preprocessor.

An alternate experiment could accept "size=-5", "color=red",
"color=#ff0000", etc as tokens.

It'd be interesting to see which method is more effective.

David




More information about the Bogofilter mailing list