What to do for HTML comment processing ???

David Relson relson at osagesoftware.com
Fri Mar 7 00:58:01 CET 2003


At 06:01 PM 3/6/03, Matthias Andree wrote:

>On Thu, 06 Mar 2003, David Relson wrote:
>
> > Unfortunately, spammers don't always include the dashes.  Since
>
>Well, <! is the "markup declaration open delimiter", and the -- in the
>markup declaration introduces a comment.
>
>According to HTML 4.01, <!-- comment --    > would be a valid comment.

Since white space is allowed on either side of the delimiters and anything 
is allowed in the comment portion (including angle brackets), "<! -- 
<comment><garbage>>><<<> -- >" is perfectly valid.  For the moment, comment 
removal is done in C.  I'm awaiting performance numbers for doing it via flex.

><! this is a markup declaration> - and therefore still invisible, although
>not a comment.
>
> > bogofilter's purpose is to recognize spam, there's valid reason for it to
> > process messages without the dashes.  Life would be simpler if all html
> > email followed the standards, but it doesn't.  Bogofilter exists in "the
> > real world" so should be able to deal with real messages.
>
>Yes, and therefore I think Nick's right when he writes:
>
> > >My comments (on how to process comments) were based on actually testing
> > >how IE and Netscape process comments.  If you do things any other way, 
> you
> > >are simply allowing people to use comments to eat holes in bogofilter.  I
>
> > >I also believe, by the way, that we should process tokens out of comments
> > >and use those, so that if someone has, for example, javascript routines
> > >that are common to the spam world, like obfuscators, we will recognize
> > >them.  The point is to move the comments out of words.  If they are 
> not in
> > >words, you process them in place.
>
>The moot point about the latter paragraph though is if these should be
>processed. The parsing of comments can be either indicative or
>misleading: if $pammers stuff long innocuous text into the comments,
>this may fool the filter, and that's the reason why it was chosen to
>kill the stuff.
>
>The technical compromise would be to calculate spamicity two times: once
>with comments accounted for, once with comments ignored, and use the
>maximum of these two values and see if the false positive rate is still
>acceptable. That way, spammers will not be able to pull the spamicity
>value down by putting William Shakespeare or Johann Wolfgang [von]
>Goethe texts into their comments or markup declarations or tags.
>
> > I'm waiting for feedback from the bogofilter user community on whether to
> > process the innards of html comments and tags.  So far that feedback has
> > been lacking.
>
>I hope this mail adds to the "feedback" category ;-)

Mostly, I an interested in what people want bogofilter to do with tokens 
inside of html tags and comments.  Obvious choices include the following:

1a - discard all tokens inside html tags
1b - discard all tokens inside html comments
2a - score all tokens inside html tags
2b - score all tokens inside html comments
3  - identify and parse URL's inside html tags
4  - score the html tags, but not other content, e.g. "<body 
bgcolor=12345>" would give "body"
5  - score known html keywords, e.g. "<body bgcolor=12345>" would give 
"body" and "bgcolor"
6  - only allow properly constructed html comments, i.e. "<!" and ">" 
required, whitespace allowed, leading/trailing "--", anything allowed 
between pairs of dashes
7 -  allow improperly constructed html comments, i.e. don't check for "--"
...

Yes, I know that #7 is contrary to the standard.  Several bogofilter users 
want bogofilter to work that way.


> > I know of one significant problem in accepting "innards" and that's the
> > random character sequences spammers have started to include.  I grepped
> > some recent email for "asdf" (straight from the keyboard!) and found that
> > 148 of the 2064 spam I received last month had that "random" character
> > sequence.  So, perhaps I'm making the case using tokens from inside html
> > tags/comments, but the concern is that random sequences will consume large
> > amounts of database space and will make bogofilter less accurate.
>
>Well, we have weed-out functionality in place (although I wonder if it
>will work properly in the long run, because we need to decrease the
>message count accordingly lest we skew our probabilities) that would
>allow to free tokens in the data base.

It's not clear whether accepting "spammer generated random character 
sequences" matters or not.   With standard bogofilter parameters, e.g. 
robx=0.415 and min_dev=0.10, previously unseen words don't affect the spam 
score of the current message.  If the message is registered as spam, then 
those words will be spam indicators in subsequent messages.





More information about the Bogofilter mailing list