interesting use of eyespace
David Relson
relson at osagesoftware.com
Tue Mar 4 15:26:44 CET 2003
Nick,
Remember the lexer issue of what tokens should come out of
tsts/bogofilter/inputs/spam.box? I used galeon to take a look and
discovered something interesting.
The problem html had to do with the <FONT> tags in the following:
<B><FONT COLOR="ff0008"><BR>
No Due Dates<FONT Color="fef0d0">zzzzzz</FONT>No Hidden Charges<FONT
COLOR="#fef0d0">zzzzz</FONT>No
Commitments</FONT></B>
The old lexer gives "... dates zzzzzz no..." while the new lexer gives "...
dateszzzzzzno ...".
I looked at the line in a slightly larger context and noticed that the html
is bracketed by "<TD ... bgcolor=fef0d0> ... </TD>". The interesting part
is the bgcolor, i.e. fef0d0, which is the same color as the "zzzzzz"
strings. What's actually happening is that "zzzzzz" is being used as
spaces between the words, i.e. as " ". So when bogofilter gets smart
enough, it'll know that the "zzzzzz"s are just background and will totally
ignore them.
For the time being, I've taken the expedient route. I've added a couple of
spaces to spam.mbx so that both lexers parse it in the same way.
David
More information about the bogofilter-dev
mailing list