interesting use of eyespace

David Relson relson at osagesoftware.com
Tue Mar 4 15:26:44 CET 2003


Nick,

Remember the lexer issue of what tokens should come out of 
tsts/bogofilter/inputs/spam.box?  I used galeon to take a look and 
discovered something interesting.

The problem html had to do with the <FONT> tags in the following:

<B><FONT COLOR="ff0008"><BR>
No Due Dates<FONT Color="fef0d0">zzzzzz</FONT>No Hidden Charges<FONT 
COLOR="#fef0d0">zzzzz</FONT>No
Commitments</FONT></B>

The old lexer gives "... dates zzzzzz no..." while the new lexer gives "... 
dateszzzzzzno ...".

I looked at the line in a slightly larger context and noticed that the html 
is bracketed by "<TD ... bgcolor=fef0d0> ... </TD>".  The interesting part 
is the bgcolor, i.e. fef0d0, which is the same color as the "zzzzzz" 
strings.  What's actually happening is that "zzzzzz" is being used as 
spaces between the words, i.e. as " ".  So when bogofilter gets smart 
enough, it'll know that the "zzzzzz"s are just background and will totally 
ignore them.

For the time being, I've taken the expedient route. I've added a couple of 
spaces to spam.mbx so that both lexers parse it in the same way.

David





More information about the bogofilter-dev mailing list