bogolexer

David Relson relson at osagesoftware.com
Mon Feb 3 13:47:28 CET 2003


Hello Nick,

At 01:12 AM 2/3/03, Nick Simicich wrote:
>At 09:37 AM 2003-02-02 -0500, David Relson wrote:
>
> >However that's not how flex operates and the
> >discarded comment is treated as a delimiter.  Thus "chara<!--junk-->cter" is
> >two tokens.
>
>Does this mean that you can't write a flex/lex html parser?  That seems 
>odd.  I think that this may be more of an artifact of how the 
>implementation was done.  I am not a lex/Flex expert, but I think that you 
>can modify things and then push them back onto the stack with a 
>REJECT.  Thus, a match could modify the input stream and then push it back 
>for tokenizing.  Hmmm...supposedly the longer matches win, and the matches 
>that are earlier beat matches that are later.

It means that, at the present time, _I_ don't know flex/lex well enough to 
write an html parser.

>I think it would be possible to do the eliding of the comments with 
>lex.  But if the right thing to do is to re-order the html section to push 
>all tokens to the beginning or end of the section, then that might be 
>beyond lex.
>
> >That made it necessary for killing html comments to be a
> >preprocessor pass.  Life would have been good if all spammers used "<!--"
> >and "-->" to begin and end their comments.  However some spam uses ">" as
> >the end.  So, the code changes as reality intrudes.
>
>Does this not break things?  I am sort of surprised, as I thought you 
>could comment out html tags?

You are correct, using ">" to end html tags is less than 
optimal.  Initially we were checking for "<!--" and "-->".  However, 
shortly after the first beta of the new code, several sample messages that 
violate proper syntax were found.  One contained "<!--#rotate>" in three 
places and the other has a style sheet that starts with "<!--" but has no 
"-->".  Since we want bogofilter to do something reasonable, even with 
broken html, the code was changed to accept ">" as an end of comment.

As I think about it, the code does handle nested levels of html comments, 
though it uses only "<!--" as the start of a new level.  Once in a comment, 
it might be reasonable to use "<" and ">" for counting levels.

>As in:
>
>startsville
><!-- We are commenting this stuff:
>
><h3>This is gone</h3>
>Gone, man.<P>
>
>-->
>endsville
>
>If you pop the comment at the first >, what does that do?
>
>OK, lynx formats this as "startsville endsville".  That is correct.  You 
>put out "startsville this gone gone man endsville".
>
>I understand that you noted that some spam is using <!-- comment > -- 
>closing comments with a naked >. ...the question is, does any formatter 
>understand when a spammer does it that way?  If a spammer puts out 
>something that no one can format, do you really care?
>
>Seriously, if I am a spammer, and I have formatted "Com<!-- 
>postmaster >mon spa<!-- postmaster >mmer phr<!-- postmaster >ase" and it 
>is formatted out as "Com", does it really matter?  The alternative is that 
>you misformat things where someone has properly commented out things and 
>buried stuff that does not matter.

I presume that spammers test their messages to confirm that common MUA's 
can read them.  In this area, I'm a bit more familiar with browsers.  As 
best I can tell, no two browsers deal identically with broken html.  I 
expect that MUA's show similar differences.

>What, again, was the reason for using > to terminate the comments?

If you want, I can send you the non-conformant messages that were sent to you.

>I guess I could see it - if there were no --> at all.  But you have to 
>allow for -->

Are you suggesting that bogofilter read the whole message looking for the 
"-->" and, if not found, back up and rescan allowing ">" for the end 
comment?  It _could_ be done.


>>At present, bogofilter also discards the contents of html tags.
>
>I got some indication at 3:00 AM (and I am not 100% sure that this is 
>reality, of course, I mention the time to indicate reliability) that the 
>contents of the html tags are not being discarded 100%.  I was testing my 
>tag eliding change and I noted that the tests failed.  I started comparing 
>output and I believe that it was picking up, at least ip addresses from URLs.

Send me a sample and I'll take a look.

>>  That's likely to change, though we developers need feedback as to what 
>> people think should be done with them.  Should we discard the standard 
>> keywords or keep them?  What should we do with URL's?  with color 
>> values? etc, etc.  There are many things that can be done and there's 
>> the whole future in which to do them.
>
>Personally, I think that you should start by simply keeping all strings of 
>letters and numbers (and things that look like domain names - periods 
>followed by alphamerics) that are longer than 2 characters.

Sounds like you want to process tokens inside "<" and ">" rather than 
discard them.  May I suggest my favorite solution - a config file option?

>>Moving html tags to the beginning or end of the buffer could be done.
>
>I am not sure it can be done in lex/flex.  Maybe it needs to be a separate 
>step, like eliding comments is.

I'm not sure either.  However I expect that others on the list can tell us.





More information about the Bogofilter mailing list