bogolexer
Nick Simicich
njs at scifi.squawk.com
Mon Feb 3 07:12:51 CET 2003
At 09:37 AM 2003-02-02 -0500, David Relson wrote:
>Naturally all these tasks seemed easier to implement than they really were.
>Ideally a lexer rule to discard an html comment would elide the text before
>and after the comment.
elide means to omit. I think you mean combine rather than elide. Perhaps
you do mean to elide, but I think that would be a bad idea -- the point is
to stop the spammers from being able to split words and getting the halves
ignored.
>However that's not how flex operates and the
>discarded comment is treated as a delimiter. Thus "chara<!--junk-->cter" is
>two tokens.
Does this mean that you can't write a flex/lex html parser? That seems
odd. I think that this may be more of an artifact of how the
implementation was done. I am not a lex/Flex expert, but I think that you
can modify things and then push them back onto the stack with a
REJECT. Thus, a match could modify the input stream and then push it back
for tokenizing. Hmmm...supposedly the longer matches win, and the matches
that are earlier beat matches that are later.
I think it would be possible to do the eliding of the comments with
lex. But if the right thing to do is to re-order the html section to push
all tokens to the beginning or end of the section, then that might be
beyond lex.
>That made it necessary for killing html comments to be a
>preprocessor pass. Life would have been good if all spammers used "<!--"
>and "-->" to begin and end their comments. However some spam uses ">" as
>the end. So, the code changes as reality intrudes.
Does this not break things? I am sort of surprised, as I thought you could
comment out html tags?
As in:
startsville
<!-- We are commenting this stuff:
<h3>This is gone</h3>
Gone, man.<P>
-->
endsville
If you pop the comment at the first >, what does that do?
OK, lynx formats this as "startsville endsville". That is correct. You
put out "startsville this gone gone man endsville".
I understand that you noted that some spam is using <!-- comment > --
closing comments with a naked >. ...the question is, does any formatter
understand when a spammer does it that way? If a spammer puts out
something that no one can format, do you really care?
Seriously, if I am a spammer, and I have formatted "Com<!-- postmaster >mon
spa<!-- postmaster >mmer phr<!-- postmaster >ase" and it is formatted out
as "Com", does it really matter? The alternative is that you misformat
things where someone has properly commented out things and buried stuff
that does not matter.
What, again, was the reason for using > to terminate the comments?
I guess I could see it - if there were no --> at all. But you have to
allow for -->
>At present, bogofilter also discards the contents of html tags.
I got some indication at 3:00 AM (and I am not 100% sure that this is
reality, of course, I mention the time to indicate reliability) that the
contents of the html tags are not being discarded 100%. I was testing my
tag eliding change and I noted that the tests failed. I started comparing
output and I believe that it was picking up, at least ip addresses from URLs.
> That's likely to change, though we developers need feedback as to what
> people think should be done with them. Should we discard the standard
> keywords or keep them? What should we do with URL's? with color values?
> etc, etc. There are many things that can be done and there's the whole
> future in which to do them.
Personally, I think that you should start by simply keeping all strings of
letters and numbers (and things that look like domain names - periods
followed by alphamerics) that are longer than 2 characters.
>Moving html tags to the beginning or end of the buffer could be done.
I am not sure it can be done in lex/flex. Maybe it needs to be a separate
step, like eliding comments is.
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list