bogolexer

Nick Simicich njs at scifi.squawk.com
Mon Feb 3 07:12:51 CET 2003


At 09:37 AM 2003-02-02 -0500, David Relson wrote:

 >Naturally all these tasks seemed easier to implement than they really were.
 >Ideally a lexer rule to discard an html comment would elide the text before
 >and after the comment.

elide means to omit.  I think you mean combine rather than elide.  Perhaps 
you do mean to elide, but I think that would be a bad idea -- the point is 
to stop the spammers from being able to split words and getting the halves 
ignored.

 >However that's not how flex operates and the
 >discarded comment is treated as a delimiter.  Thus "chara<!--junk-->cter" is
 >two tokens.

Does this mean that you can't write a flex/lex html parser?  That seems 
odd.  I think that this may be more of an artifact of how the 
implementation was done.  I am not a lex/Flex expert, but I think that you 
can modify things and then push them back onto the stack with a 
REJECT.  Thus, a match could modify the input stream and then push it back 
for tokenizing.  Hmmm...supposedly the longer matches win, and the matches 
that are earlier beat matches that are later.

I think it would be possible to do the eliding of the comments with 
lex.  But if the right thing to do is to re-order the html section to push 
all tokens to the beginning or end of the section, then that might be 
beyond lex.

 >That made it necessary for killing html comments to be a
 >preprocessor pass.  Life would have been good if all spammers used "<!--"
 >and "-->" to begin and end their comments.  However some spam uses ">" as
 >the end.  So, the code changes as reality intrudes.

Does this not break things?  I am sort of surprised, as I thought you could 
comment out html tags?

As in:

startsville
<!-- We are commenting this stuff:

<h3>This is gone</h3>
Gone, man.<P>

-->
endsville

If you pop the comment at the first >, what does that do?

OK, lynx formats this as "startsville endsville".  That is correct.  You 
put out "startsville this gone gone man endsville".

I understand that you noted that some spam is using <!-- comment > -- 
closing comments with a naked >. ...the question is, does any formatter 
understand when a spammer does it that way?  If a spammer puts out 
something that no one can format, do you really care?

Seriously, if I am a spammer, and I have formatted "Com<!-- postmaster >mon 
spa<!-- postmaster >mmer phr<!-- postmaster >ase" and it is formatted out 
as "Com", does it really matter?  The alternative is that you misformat 
things where someone has properly commented out things and buried stuff 
that does not matter.

What, again, was the reason for using > to terminate the comments?

I guess I could see it - if there were no --> at all.  But you have to 
allow for -->

>At present, bogofilter also discards the contents of html tags.

I got some indication at 3:00 AM (and I am not 100% sure that this is 
reality, of course, I mention the time to indicate reliability) that the 
contents of the html tags are not being discarded 100%.  I was testing my 
tag eliding change and I noted that the tests failed.  I started comparing 
output and I believe that it was picking up, at least ip addresses from URLs.

>  That's likely to change, though we developers need feedback as to what 
> people think should be done with them.  Should we discard the standard 
> keywords or keep them?  What should we do with URL's?  with color values? 
> etc, etc.  There are many things that can be done and there's the whole 
> future in which to do them.

Personally, I think that you should start by simply keeping all strings of 
letters and numbers (and things that look like domain names - periods 
followed by alphamerics) that are longer than 2 characters.

>Moving html tags to the beginning or end of the buffer could be done.

I am not sure it can be done in lex/flex.  Maybe it needs to be a separate 
step, like eliding comments is.

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list