bogolexer
David Relson
relson at osagesoftware.com
Mon Feb 3 19:48:56 CET 2003
At 01:17 PM 2/3/03, Nick Simicich wrote:
>At 07:47 AM 2003-02-03 -0500, David Relson wrote:
>>It means that, at the present time, _I_ don't know flex/lex well enough
>>to write an html parser.
>
>Hey, guess what, that is something we have in common. :-)
I figure there're at least two other commonalities. bogofilter and eudora.
>I found this page:
>
>http://www.htmlhelp.com/reference/wilbur/misc/comment.html
>http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5
>
>As it turns out, comments are more complex than I thought.
It's _always_ more complicated.
>Apparently, the
>
><! and > delimit the comment declaration
>-- and -- delimit the comment itself.
Interesting details, of which I was unaware.
>But...any character is legal in a comment, including: >
>
>So <!-- this is a comment --
> --> this is a 'greater' comment --
> -- this is the last comment -->
>
>No whitespace is allowed between <! and the first --. Whitespace is
>allowed at other points.
>
>They also comment that no one gets it quite right - that some (many)
>browsers will end on the string --> when what they should do is to
>count the number of -- sequences to insure that there are a multiple of four.
... to ensure that there are an even number of pairs ...
>My point is that I do not see any formatters supporting nested html
>comments (even though the SGML standard makes some provision for them, so
>I do not think that you *should*.
>
>I seriously wonder what the formatters did with those messages. Do you
>still happen to have those spams around? What happened to them when you
>dropped them into, say, mozilla, or IE, or even lynx?
>
>If I know that bogofilter works like this, I can then format a spam like:
>
>big pe<!-- > tt -->n<!-- > ux -->is and whereas the renderers will render
>it as "big penis" bogofilter will break it into "big", probably neutral,
>followed by 4 two letter chunks that will be tossed by the lexer for being
>too small, because the rendered ends comments on --> unconditionally, and
>bogofilter is ending the comments on the enclosed >.
Like any other open source product, bogofilter will always have a handicap
in the fight against spam. Of course, seeing bogofilter specific tricks
would indicate that bogofilter is considered a worthy opponent :-)
>>>As in:
>>>
>>>startsville
>>><!-- We are commenting this stuff:
>>>
>>><h3>This is gone</h3>
>>>Gone, man.<P>
>>>
>>>-->
>>>endsville
We can add that to the test suite!
>>If you want, I can send you the non-conformant messages that were sent to
>>you.
>
>I would really appreciate that. I am beginning to believe, fairly
>strongly, that a couple of broken messages is not a good reason to have
>the html interpreted in any other way than the way the major renderers do
>it (unconditionally end the comment on -->, ignore everything else.
They're in the attached tarball.
>>>I guess I could see it - if there were no --> at all. But you have to
>>>allow for -->
>>
>>Are you suggesting that bogofilter read the whole message looking for the
>>"-->" and, if not found, back up and rescan allowing ">" for the end
>>comment? It _could_ be done.
>
>No, I am suggesting that bogofilter interpret comments the same way the
>renderers that people use to read their mail will. If that is what they
>do, then that is what bogofilter should do. It might be that these e-mails
>are just broken.
Sounds reasonable.
>>>>At present, bogofilter also discards the contents of html tags.
>>>
>>>I got some indication at 3:00 AM (and I am not 100% sure that this is
>>>reality, of course, I mention the time to indicate reliability) that the
>>>contents of the html tags are not being discarded 100%. I was testing
>>>my tag eliding change and I noted that the tests failed. I started
>>>comparing output and I believe that it was picking up, at least ip
>>>addresses from URLs.
>>
>>Send me a sample and I'll take a look.
>
>First I have to figure out if this is real, and second I have to figure
>out if it happens to an unmodified bogofilter. Remember (I tried to make
>this clear) that I was playing.
The current state of the art, bogofilter-0.10.2.cvs, is available for your
experimentation. It's biggest change, so far, is that source code and
tests are in a "src" directory. The content is otherwise unchanged. The
content (but not the functionality) will change as the text_t code is released.
>I think that it is reasonable to have config file options that discuss
>whether comments and tags should be processed, and reordered. That gives
>two options with a total of four states. I am probably willing to write
>the code here. I just do not want to shoot at a moving target.
An understandable desire. Give me a couple of days to get text_t as I want
it and then the rate of change should slow down signficantly.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: msg.gs.0123.tgz
Type: application/x-compressed
Size: 4473 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030203/81ac316f/attachment.bin>
More information about the Bogofilter
mailing list