bogolexer

Mon Feb 3 13:47:28 CET 2003

Hello Nick,

At 01:12 AM 2/3/03, Nick Simicich wrote:
>At 09:37 AM 2003-02-02 -0500, David Relson wrote:
>
> >However that's not how flex operates and the
> >discarded comment is treated as a delimiter.  Thus "chara<!--junk-->cter" is
> >two tokens.
>
>Does this mean that you can't write a flex/lex html parser?  That seems 
>odd.  I think that this may be more of an artifact of how the 
>implementation was done.  I am not a lex/Flex expert, but I think that you 
>can modify things and then push them back onto the stack with a 
>REJECT.  Thus, a match could modify the input stream and then push it back 
>for tokenizing.  Hmmm...supposedly the longer matches win, and the matches 
>that are earlier beat matches that are later.

It means that, at the present time, _I_ don't know flex/lex well enough to 
write an html parser.

>I think it would be possible to do the eliding of the comments with 
>lex.  But if the right thing to do is to re-order the html section to push 
>all tokens to the beginning or end of the section, then that might be 
>beyond lex.
>
> >That made it necessary for killing html comments to be a
> >preprocessor pass.  Life would have been good if all spammers used "<!--"
> >and "-->" to begin and end their comments.  However some spam uses ">" as
> >the end.  So, the code changes as reality intrudes.
>
>Does this not break things?  I am sort of surprised, as I thought you 
>could comment out html tags?

You are correct, using ">" to end html tags is less than 
optimal.  Initially we were checking for "".  However, 
shortly after the first beta of the new code, several sample messages that 
violate proper syntax were found.  One contained "".  Since we want bogofilter to do something reasonable, even with 
broken html, the code was changed to accept ">" as an end of comment.

As I think about it, the code does handle nested levels of html comments, 
though it uses only "<!--" as the start of a new level.  Once in a comment, 
it might be reasonable to use "<" and ">" for counting levels.

>As in:
>
>startsville
><!-- We are commenting this stuff:
>
><h3>This is gone</h3>
>Gone, man.<P>
>
>-->
>endsville
>
>If you pop the comment at the first >, what does that do?
>
>OK, lynx formats this as "startsville endsville".  That is correct.  You 
>put out "startsville this gone gone man endsville".
>
>I understand that you noted that some spam is using <!-- comment > -- 
>closing comments with a naked >. ...the question is, does any formatter 
>understand when a spammer does it that way?  If a spammer puts out 
>something that no one can format, do you really care?
>
>Seriously, if I am a spammer, and I have formatted "Com<!-- 
>postmaster >mon spa<!-- postmaster >mmer phr<!-- postmaster >ase" and it 
>is formatted out as "Com", does it really matter?  The alternative is that 
>you misformat things where someone has properly commented out things and 
>buried stuff that does not matter.

I presume that spammers test their messages to confirm that common MUA's 
can read them.  In this area, I'm a bit more familiar with browsers.  As 
best I can tell, no two browsers deal identically with broken html.  I 
expect that MUA's show similar differences.

>What, again, was the reason for using > to terminate the comments?

If you want, I can send you the non-conformant messages that were sent to you.

>I guess I could see it - if there were no --> at all.  But you have to 
>allow for -->

Are you suggesting that bogofilter read the whole message looking for the 
"-->" and, if not found, back up and rescan allowing ">" for the end 
comment?  It _could_ be done.

>>At present, bogofilter also discards the contents of html tags.
>
>I got some indication at 3:00 AM (and I am not 100% sure that this is 
>reality, of course, I mention the time to indicate reliability) that the 
>contents of the html tags are not being discarded 100%.  I was testing my 
>tag eliding change and I noted that the tests failed.  I started comparing 
>output and I believe that it was picking up, at least ip addresses from URLs.

Send me a sample and I'll take a look.

>>  That's likely to change, though we developers need feedback as to what 
>> people think should be done with them.  Should we discard the standard 
>> keywords or keep them?  What should we do with URL's?  with color 
>> values? etc, etc.  There are many things that can be done and there's 
>> the whole future in which to do them.
>
>Personally, I think that you should start by simply keeping all strings of 
>letters and numbers (and things that look like domain names - periods 
>followed by alphamerics) that are longer than 2 characters.

Sounds like you want to process tokens inside "<" and ">" rather than 
discard them.  May I suggest my favorite solution - a config file option?

>>Moving html tags to the beginning or end of the buffer could be done.
>
>I am not sure it can be done in lex/flex.  Maybe it needs to be a separate 
>step, like eliding comments is.

I'm not sure either.  However I expect that others on the list can tell us.