obscured URL not being tokenized
David Relson
relson at osagesoftware.com
Sun Dec 21 02:52:02 CET 2003
On 20 Dec 2003 20:14:29 -0500
Tom Anderson <tanderso at oac-design.com> wrote:
> On Sat, 2003-12-20 at 16:43, David Relson wrote:
> > based on them. Unfortunately, this isn't adequate. Directives are
> > nested, for example a table contains table data which can include
> > font directives. Proper processing of all this requires a stack for
> > saving the previous state and popping the stack as end tags are
> > encountered. It all gets more complicated since the html may be
> > improperly formed, as in <table><tr><td><font>...</table>, where the
> > end directive pops several stack levels.
>
> I don't think bogofilter needs to be interpreting html. Simply
> recognizing the tokens such as 'size=-5' and 'color="#fffffe"' ought
> to be enough. If there were to be such functionality as html
> interpretation and image recognition attempted, it should be in a
> preprocessor seperate from bogofilter which perhaps sets its own
> "x-bogosity"-type line. This way it could be turned on or off easily.
>
> Let's not bloat the core of bogofilter with such stuff.
>
> Tom
Tom,
Of necessity, bogofilter is already doing some html interpretation:
comments - removed from consideration so "ju<!xxx>nk" is recognized
tags a, img, font - innards parsed because of demonstratable benefits
html character decoding - "a" converted to 'a'
url character decoding - "%30" converted to '0'
etc
For the heck of it, I'm experimenting with color as regards hidden text.
The following ideas have become clear:
Proper color processing requires a stack to track levels (nesting) and
attributes.
Including additional html tags in the parser _may_ be space efficient --
remains to be seen. Including additional html tags in the parser will
_definitely_ be faster than a preprocessor.
An alternate experiment could accept "size=-5", "color=red",
"color=#ff0000", etc as tokens.
It'd be interesting to see which method is more effective.
David
More information about the Bogofilter
mailing list