bogolexer

Mon Feb 3 19:48:56 CET 2003

At 01:17 PM 2/3/03, Nick Simicich wrote:
>At 07:47 AM 2003-02-03 -0500, David Relson wrote:
>>It means that, at the present time, _I_ don't know flex/lex well enough 
>>to write an html parser.
>
>Hey, guess what, that is something we have in common. :-)

I figure there're at least two other commonalities.  bogofilter and eudora.

>I found this page:
>
>http://www.htmlhelp.com/reference/wilbur/misc/comment.html
>http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5
>
>As it turns out, comments are more complex than I thought.

It's _always_ more complicated.

>Apparently, the
>
><!  and > delimit the comment declaration
>-- and -- delimit the comment itself.

Interesting details, of which I was unaware.

>But...any character is legal in a comment, including: >
>
>So <!-- this is a comment --
>         --> this is a 'greater' comment --
>         -- this is the last comment -->
>
>No whitespace is allowed between <! and the first --.  Whitespace is 
>allowed at other points.
>
>They also comment that no one gets it quite right - that some (many) 
>browsers will end on the string    --> when what they should do is to 
>count the number of -- sequences to insure that there are a multiple of four.

... to ensure that there are an even number of pairs ...

>My point is that I do not see any formatters supporting nested html 
>comments (even though the SGML standard makes some provision for them, so 
>I do not think that you *should*.
>
>I seriously wonder what the formatters did with those messages.  Do you 
>still happen to have those spams around?  What happened to them when you 
>dropped them into, say, mozilla, or IE, or even lynx?
>
>If I know that bogofilter works like this, I can then format a spam like:
>
>big pe<!-- > tt -->n<!-- > ux -->is and whereas the renderers will render 
>it as "big penis" bogofilter will break it into "big", probably neutral, 
>followed by 4 two letter chunks that will be tossed by the lexer for being 
>too small, because the rendered ends comments on --> unconditionally, and 
>bogofilter is ending the comments on the enclosed >.

Like any other open source product, bogofilter will always have a handicap 
in the fight against spam.  Of course, seeing bogofilter specific tricks 
would indicate that bogofilter is considered a worthy opponent :-)

>>>As in:
>>>
>>>startsville
>>><!-- We are commenting this stuff:
>>>
>>><h3>This is gone</h3>
>>>Gone, man.<P>
>>>
>>>-->
>>>endsville

We can add that to the test suite!

>>If you want, I can send you the non-conformant messages that were sent to 
>>you.
>
>I would really appreciate that.  I am beginning to believe, fairly 
>strongly, that a couple of broken messages is not a good reason to have 
>the html interpreted in any other way than the way the major renderers do 
>it (unconditionally end the comment on -->, ignore everything else.

They're in the attached tarball.

>>>I guess I could see it - if there were no --> at all.  But you have to 
>>>allow for -->
>>
>>Are you suggesting that bogofilter read the whole message looking for the 
>>"-->" and, if not found, back up and rescan allowing ">" for the end 
>>comment?  It _could_ be done.
>
>No, I am suggesting that bogofilter interpret comments the same way the 
>renderers that people use to read their mail will.  If that is what they 
>do, then that is what bogofilter should do. It might be that these e-mails 
>are just broken.

Sounds reasonable.

>>>>At present, bogofilter also discards the contents of html tags.
>>>
>>>I got some indication at 3:00 AM (and I am not 100% sure that this is 
>>>reality, of course, I mention the time to indicate reliability) that the 
>>>contents of the html tags are not being discarded 100%.  I was testing 
>>>my tag eliding change and I noted that the tests failed.  I started 
>>>comparing output and I believe that it was picking up, at least ip 
>>>addresses from URLs.
>>
>>Send me a sample and I'll take a look.
>
>First I have to figure out if this is real, and second I have to figure 
>out if it happens to an unmodified bogofilter.  Remember (I tried to make 
>this clear) that I was playing.

The current state of the art, bogofilter-0.10.2.cvs, is available for your 
experimentation.  It's biggest change, so far, is that source code and 
tests are in a "src" directory.  The content is otherwise unchanged.  The 
content (but not the functionality) will change as the text_t code is released.

>I think that it is reasonable to have config file options that discuss 
>whether comments and tags should be processed, and reordered.  That gives 
>two options with a total of four states.  I am probably willing to write 
>the code here.  I just do not want to shoot at a moving target.

An understandable desire.  Give me a couple of days to get text_t as I want 
it and then the rate of change should slow down signficantly.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: msg.gs.0123.tgz
Type: application/x-compressed
Size: 4473 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20030203/81ac316f/attachment.bin>