bogolexer
David Relson
relson at osagesoftware.com
Mon Feb 3 13:47:28 CET 2003
Hello Nick,
At 01:12 AM 2/3/03, Nick Simicich wrote:
>At 09:37 AM 2003-02-02 -0500, David Relson wrote:
>
> >However that's not how flex operates and the
> >discarded comment is treated as a delimiter. Thus "chara<!--junk-->cter" is
> >two tokens.
>
>Does this mean that you can't write a flex/lex html parser? That seems
>odd. I think that this may be more of an artifact of how the
>implementation was done. I am not a lex/Flex expert, but I think that you
>can modify things and then push them back onto the stack with a
>REJECT. Thus, a match could modify the input stream and then push it back
>for tokenizing. Hmmm...supposedly the longer matches win, and the matches
>that are earlier beat matches that are later.
It means that, at the present time, _I_ don't know flex/lex well enough to
write an html parser.
>I think it would be possible to do the eliding of the comments with
>lex. But if the right thing to do is to re-order the html section to push
>all tokens to the beginning or end of the section, then that might be
>beyond lex.
>
> >That made it necessary for killing html comments to be a
> >preprocessor pass. Life would have been good if all spammers used "<!--"
> >and "-->" to begin and end their comments. However some spam uses ">" as
> >the end. So, the code changes as reality intrudes.
>
>Does this not break things? I am sort of surprised, as I thought you
>could comment out html tags?
You are correct, using ">" to end html tags is less than
optimal. Initially we were checking for "<!--" and "-->". However,
shortly after the first beta of the new code, several sample messages that
violate proper syntax were found. One contained "<!--#rotate>" in three
places and the other has a style sheet that starts with "<!--" but has no
"-->". Since we want bogofilter to do something reasonable, even with
broken html, the code was changed to accept ">" as an end of comment.
As I think about it, the code does handle nested levels of html comments,
though it uses only "<!--" as the start of a new level. Once in a comment,
it might be reasonable to use "<" and ">" for counting levels.
>As in:
>
>startsville
><!-- We are commenting this stuff:
>
><h3>This is gone</h3>
>Gone, man.<P>
>
>-->
>endsville
>
>If you pop the comment at the first >, what does that do?
>
>OK, lynx formats this as "startsville endsville". That is correct. You
>put out "startsville this gone gone man endsville".
>
>I understand that you noted that some spam is using <!-- comment > --
>closing comments with a naked >. ...the question is, does any formatter
>understand when a spammer does it that way? If a spammer puts out
>something that no one can format, do you really care?
>
>Seriously, if I am a spammer, and I have formatted "Com<!--
>postmaster >mon spa<!-- postmaster >mmer phr<!-- postmaster >ase" and it
>is formatted out as "Com", does it really matter? The alternative is that
>you misformat things where someone has properly commented out things and
>buried stuff that does not matter.
I presume that spammers test their messages to confirm that common MUA's
can read them. In this area, I'm a bit more familiar with browsers. As
best I can tell, no two browsers deal identically with broken html. I
expect that MUA's show similar differences.
>What, again, was the reason for using > to terminate the comments?
If you want, I can send you the non-conformant messages that were sent to you.
>I guess I could see it - if there were no --> at all. But you have to
>allow for -->
Are you suggesting that bogofilter read the whole message looking for the
"-->" and, if not found, back up and rescan allowing ">" for the end
comment? It _could_ be done.
>>At present, bogofilter also discards the contents of html tags.
>
>I got some indication at 3:00 AM (and I am not 100% sure that this is
>reality, of course, I mention the time to indicate reliability) that the
>contents of the html tags are not being discarded 100%. I was testing my
>tag eliding change and I noted that the tests failed. I started comparing
>output and I believe that it was picking up, at least ip addresses from URLs.
Send me a sample and I'll take a look.
>> That's likely to change, though we developers need feedback as to what
>> people think should be done with them. Should we discard the standard
>> keywords or keep them? What should we do with URL's? with color
>> values? etc, etc. There are many things that can be done and there's
>> the whole future in which to do them.
>
>Personally, I think that you should start by simply keeping all strings of
>letters and numbers (and things that look like domain names - periods
>followed by alphamerics) that are longer than 2 characters.
Sounds like you want to process tokens inside "<" and ">" rather than
discard them. May I suggest my favorite solution - a config file option?
>>Moving html tags to the beginning or end of the buffer could be done.
>
>I am not sure it can be done in lex/flex. Maybe it needs to be a separate
>step, like eliding comments is.
I'm not sure either. However I expect that others on the list can tell us.
More information about the Bogofilter
mailing list