bogolexer

Mon Feb 3 19:17:47 CET 2003

At 07:47 AM 2003-02-03 -0500, David Relson wrote:

>Hello Nick,
>
>At 01:12 AM 2/3/03, Nick Simicich wrote:
>>At 09:37 AM 2003-02-02 -0500, David Relson wrote:
>>
>> >However that's not how flex operates and the
>> >discarded comment is treated as a delimiter.  Thus 
>> "chara<!--junk-->cter" is
>> >two tokens.
>>
>>Does this mean that you can't write a flex/lex html parser?  That seems 
>>odd.  I think that this may be more of an artifact of how the 
>>implementation was done.  I am not a lex/Flex expert, but I think that 
>>you can modify things and then push them back onto the stack with a 
>>REJECT.  Thus, a match could modify the input stream and then push it 
>>back for tokenizing.  Hmmm...supposedly the longer matches win, and the 
>>matches that are earlier beat matches that are later.
>
>It means that, at the present time, _I_ don't know flex/lex well enough to 
>write an html parser.

Hey, guess what, that is something we have in common. :-)

>>I think it would be possible to do the eliding of the comments with 
>>lex.  But if the right thing to do is to re-order the html section to 
>>push all tokens to the beginning or end of the section, then that might 
>>be beyond lex.
>>
>> >That made it necessary for killing html comments to be a
>> >preprocessor pass.  Life would have been good if all spammers used "<!--"
>> >and "-->" to begin and end their comments.  However some spam uses ">" as
>> >the end.  So, the code changes as reality intrudes.
>>
>>Does this not break things?  I am sort of surprised, as I thought you 
>>could comment out html tags?
>
>You are correct, using ">" to end html tags is less than 
>optimal.  Initially we were checking for "<!--" and "-->".  However, 
>shortly after the first beta of the new code, several sample messages that 
>violate proper syntax were found.  One contained "<!--#rotate>" in three 
>places and the other has a style sheet that starts with "<!--" but has no 
>"-->".  Since we want bogofilter to do something reasonable, even with 
>broken html, the code was changed to accept ">" as an end of comment.
>
>As I think about it, the code does handle nested levels of html comments, 
>though it uses only "<!--" as the start of a new level.  Once in a 
>comment, it might be reasonable to use "<" and ">" for counting levels.

I found this page:

http://www.htmlhelp.com/reference/wilbur/misc/comment.html
http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5

As it turns out, comments are more complex than I thought.  Apparently, the

<!  and > delimit the comment declaration
-- and -- delimit the comment itself.

But...any character is legal in a comment, including: >

So <!-- this is a comment --
         --> this is a 'greater' comment --
         -- this is the last comment -->

No whitespace is allowed between <! and the first --.  Whitespace is 
allowed at other points.

They also comment that no one gets it quite right - that some (many) 
browsers will end on the string    --> when what they should do is to count 
the number of -- sequences to insure that there are a multiple of four.

I created a page called http://scifi.squawk.com/comments.html, which sets 
up a number of comment test cases, and I tried it against all of the 
browsers I could find.  Essentially, Mozilla, IE, and Konqueror all close a 
comment at --> and only -->, unconditionally, irrespective of the number of 
-- pairs.

Links and lynx will close at the most correct comment closing they can 
find, it seems.  In other words, they will skip > signs in some 
circumstances to close at --> or even -- > if that is more right.  But if 
there is no --> available, they seem to close at >.  The case that brings 
this out is test case seven.  However, I try to nest comments in my test 
case eight - this should work, with sgml syntax (it is clear that is what 
the complex stuff in the syntax allows is at least some nesting of comments:

more stuff -->

Actually works if you parse the comments according to strict sgml rules, 
because the first --> does not close the comment because there have not 
been an even number of -- pairs....

However, no other browser I could find would do that.  Most of the 
formatters simply refuse to close a comment at anything other than a --> 
and then close it willy-nilly.  Lynx has more complex rules that generally 
work better than the willy nilly close at -->, except for when they do not. :-)

My point is that, there was something brought out at the spam conference, 
in the part that I pointed to with the realvideo link I posted 
here:  eyespace.  The guy who was claiming the enormous good numbers by 
reducing phrases to hashes and then doing bayesian analysis on the hashes, 
whose name escapes me, had a point, and that was that he was trying to do 
all analysis based on what the end user *saw*, at least in terms of the 
html text.  He was mostly talking about things like, I think, trying to 
ignore comments, meaningless formatting tags, and text that matched the 
background.  That is the same thing I think we are looking at, or at least 
I am looking at.  Specifically, the issue of properly removing comments is 
what we are looking at, and then I want to move on to doing something, 
hopefully the right thing with formatting tags.

My point is that I do not see any formatters supporting nested html 
comments (even though the SGML standard makes some provision for them, so I 
do not think that you *should*.

I seriously wonder what the formatters did with those messages.  Do you 
still happen to have those spams around?  What happened to them when you 
dropped them into, say, mozilla, or IE, or even lynx?

If I know that bogofilter works like this, I can then format a spam like:

big penis and whereas the renderers will render 
it as "big penis" bogofilter will break it into "big", probably neutral, 
followed by 4 two letter chunks that will be tossed by the lexer for being 
too small, because the rendered ends comments on --> unconditionally, and 
bogofilter is ending the comments on the enclosed >.

>>As in:
>>
>>startsville
>><!-- We are commenting this stuff:
>>
>><h3>This is gone</h3>
>>Gone, man.<P>
>>
>>-->
>>endsville
>>
>>If you pop the comment at the first >, what does that do?
>>
>>OK, lynx formats this as "startsville endsville".  That is correct.  You 
>>put out "startsville this gone gone man endsville".
>>
>>I understand that you noted that some spam is using <!-- comment > -- 
>>closing comments with a naked >. ...the question is, does any formatter 
>>understand when a spammer does it that way?  If a spammer puts out 
>>something that no one can format, do you really care?
>>
>>Seriously, if I am a spammer, and I have formatted "Com<!-- 
>>postmaster >mon spa<!-- postmaster >mmer phr<!-- postmaster >ase" and it 
>>is formatted out as "Com", does it really matter?  The alternative is 
>>that you misformat things where someone has properly commented out things 
>>and buried stuff that does not matter.
>
>I presume that spammers test their messages to confirm that common MUA's 
>can read them.  In this area, I'm a bit more familiar with browsers.  As 
>best I can tell, no two browsers deal identically with broken html.  I 
>expect that MUA's show
>similar differences.

Some MUAs use their own internal renderers, but others use system 
renenders.  I suspect that the common real world is that AOL uses 
Netscape's renderer (don't they?) and the rest of the world uses the 
Microsoft renderer.  I purposely use Eudora because it has a weak, broken 
renderer that works fine for all legit, non-marketing e-mail while not 
rendering javascript, cookies, frames, or any of the rest of that crap that 
no one legit uses in their messages.

>>What, again, was the reason for using > to terminate the comments?
>
>If you want, I can send you the non-conformant messages that were sent to you.

I would really appreciate that.  I am beginning to believe, fairly 
strongly, that a couple of broken messages is not a good reason to have the 
html interpreted in any other way than the way the major renderers do it 
(unconditionally end the comment on -->, ignore everything else.

>>I guess I could see it - if there were no --> at all.  But you have to 
>>allow for -->
>
>Are you suggesting that bogofilter read the whole message looking for the 
>"-->" and, if not found, back up and rescan allowing ">" for the end 
>comment?  It _could_ be done.

No, I am suggesting that bogofilter interpret comments the same way the 
renderers that people use to read their mail will.  If that is what they 
do, then that is what bogofilter should do. It might be that these e-mails 
are just broken.

>>>At present, bogofilter also discards the contents of html tags.
>>
>>I got some indication at 3:00 AM (and I am not 100% sure that this is 
>>reality, of course, I mention the time to indicate reliability) that the 
>>contents of the html tags are not being discarded 100%.  I was testing my 
>>tag eliding change and I noted that the tests failed.  I started 
>>comparing output and I believe that it was picking up, at least ip 
>>addresses from URLs.
>
>Send me a sample and I'll take a look.

First I have to figure out if this is real, and second I have to figure out 
if it happens to an unmodified bogofilter.  Remember (I tried to make this 
clear) that I was playing.

>>>  That's likely to change, though we developers need feedback as to what 
>>> people think should be done with them.  Should we discard the standard 
>>> keywords or keep them?  What should we do with URL's?  with color 
>>> values? etc, etc.  There are many things that can be done and there's 
>>> the whole future in which to do them.
>>
>>Personally, I think that you should start by simply keeping all strings 
>>of letters and numbers (and things that look like domain names - periods 
>>followed by alphamerics) that are longer than 2 characters.
>
>Sounds like you want to process tokens inside "<" and ">" rather than 
>discard them.  May I suggest my favorite solution - a config file option?

I think that it is reasonable to have config file options that discuss 
whether comments and tags should be processed, and reordered.  That gives 
two options with a total of four states.  I am probably willing to write 
the code here.  I just do not want to shoot at a moving target.

>>>Moving html tags to the beginning or end of the buffer could be done.
>>
>>I am not sure it can be done in lex/flex.  Maybe it needs to be a 
>>separate step, like eliding comments is.
>
>I'm not sure either.  However I expect that others on the list can tell us.

Good question.  Is this something that can be done in lex/flex and is it a 
good use for that tool?

General help?

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!