What to do for HTML comment processing ???

David Relson relson at osagesoftware.com
Thu Mar 6 17:17:37 CET 2003


Hi Nick,

At 02:41 AM 3/6/03, Nick Simicich wrote:
>At 05:50 PM 2003-03-04 -0500, David Relson wrote:
>
>>Nick,
>>
>>Your feed back on processing html comments is appreciated.  May I suggest 
>>a compromise?
>
>Why?  Unless you have a REASON!  Personal preference is not a good 
>reason.  Because we have done it that way was not good enough to block the 
>flag changes.
>
>And "because it is a standard" is not a good reason.  NO ONE FOLLOWS 
>STANDARDS.

The standard says a comment begins with "<!--" and ends with "-->", which 
is what bogofilter is presently doing.  What bogofilter is not doing is 
nesting (since your research showed that none of the browsers is doing that).

Unfortunately, spammers don't always include the dashes.  Since 
bogofilter's purpose is to recognize spam, there's valid reason for it to 
process messages without the dashes.  Life would be simpler if all html 
email followed the standards, but it doesn't.  Bogofilter exists in "the 
real world" so should be able to deal with real messages.

>>As the default have bogofilter follow the standard in processing html 
>>comments and also have an "aggressive comment" mode that would be more 
>>agressive, as has been requested.
>
>Please do not be silly.  This is a stupid suggestion.  There is no valid 
>compromise.  What was done in the past, adjusting to spam that opens with 
><!-- and closes with > is wrong - the rest of the e-mail (unless there is 
>a --> somewhere) is a comment and will not be displayed.  Try displaying 
>that in outlook, you know, the mail reader that many spam victims use?  Or 
>the Netscape based AOL?
>
>NO BROWSER, NOT ONE SINGLE ONE FOLLOWS THE STANDARDS!!!!!  THE TESTING 
>PROVED THAT. THEREFORE IT WOULD BE STUPID FOR BOGOFILTER TO FOLLOW THE 
>STANDARDS.
>
>Doing it any way other than the way than ie does it is just plain dumb.
>
>My comments (on how to process comments) were based on actually testing 
>how IE and Netscape process comments.  If you do things any other way, you 
>are simply allowing people to use comments to eat holes in bogofilter.  I 
>will listen to someone who has actually tested IE and/or Netscape and come 
>to a different conclusion about how they display comments. Or there might 
>be a reason I have not heard yet. IE is also the "Microsoft HTML Renderer" 
>so if you use the Microsoft Renderer from within, oh, Eudora, it will also 
>process the comments the way I mentioned.
>
>I also believe, by the way, that we should process tokens out of comments 
>and use those, so that if someone has, for example, javascript routines 
>that are common to the spam world, like obfuscators, we will recognize 
>them.  The point is to move the comments out of words.  If they are not in 
>words, you process them in place.

I'm waiting for feedback from the bogofilter user community on whether to 
process the innards of html comments and tags.  So far that feedback has 
been lacking.

I know of one significant problem in accepting "innards" and that's the 
random character sequences spammers have started to include.  I grepped 
some recent email for "asdf" (straight from the keyboard!) and found that 
148 of the 2064 spam I received last month had that "random" character 
sequence.  So, perhaps I'm making the case using tokens from inside html 
tags/comments, but the concern is that random sequences will consume large 
amounts of database space and will make bogofilter less accurate.


>And I think that there should not be a flag to make that processing 
>optional.  It is a useless complexity.  OK, we can have a flag if we want, 
>but we should ignore it. :-)
>
>But compromise for no reason is simply going to add complexity for no reason.

I respectfully submit that there _is_ a reason, so adding the option is 
providing something useful - not needless complexity.

David





More information about the Bogofilter mailing list