What to do for HTML comment processing ???

Fri Mar 7 00:58:01 CET 2003

At 06:01 PM 3/6/03, Matthias Andree wrote:

>On Thu, 06 Mar 2003, David Relson wrote:
>
> > Unfortunately, spammers don't always include the dashes.  Since
>
>Well, <! is the "markup declaration open delimiter", and the -- in the
>markup declaration introduces a comment.
>
>According to HTML 4.01, <!-- comment --    > would be a valid comment.

Since white space is allowed on either side of the delimiters and anything 
is allowed in the comment portion (including angle brackets), "<! -- 
<comment><garbage>>><<<> -- >" is perfectly valid.  For the moment, comment 
removal is done in C.  I'm awaiting performance numbers for doing it via flex.

><! this is a markup declaration> - and therefore still invisible, although
>not a comment.
>
> > bogofilter's purpose is to recognize spam, there's valid reason for it to
> > process messages without the dashes.  Life would be simpler if all html
> > email followed the standards, but it doesn't.  Bogofilter exists in "the
> > real world" so should be able to deal with real messages.
>
>Yes, and therefore I think Nick's right when he writes:
>
> > >My comments (on how to process comments) were based on actually testing
> > >how IE and Netscape process comments.  If you do things any other way, 
> you
> > >are simply allowing people to use comments to eat holes in bogofilter.  I
>
> > >I also believe, by the way, that we should process tokens out of comments
> > >and use those, so that if someone has, for example, javascript routines
> > >that are common to the spam world, like obfuscators, we will recognize
> > >them.  The point is to move the comments out of words.  If they are 
> not in
> > >words, you process them in place.
>
>The moot point about the latter paragraph though is if these should be
>processed. The parsing of comments can be either indicative or
>misleading: if $pammers stuff long innocuous text into the comments,
>this may fool the filter, and that's the reason why it was chosen to
>kill the stuff.
>
>The technical compromise would be to calculate spamicity two times: once
>with comments accounted for, once with comments ignored, and use the
>maximum of these two values and see if the false positive rate is still
>acceptable. That way, spammers will not be able to pull the spamicity
>value down by putting William Shakespeare or Johann Wolfgang [von]
>Goethe texts into their comments or markup declarations or tags.
>
> > I'm waiting for feedback from the bogofilter user community on whether to
> > process the innards of html comments and tags.  So far that feedback has
> > been lacking.
>
>I hope this mail adds to the "feedback" category ;-)

Mostly, I an interested in what people want bogofilter to do with tokens 
inside of html tags and comments.  Obvious choices include the following:

1a - discard all tokens inside html tags
1b - discard all tokens inside html comments
2a - score all tokens inside html tags
2b - score all tokens inside html comments
3  - identify and parse URL's inside html tags
4  - score the html tags, but not other content, e.g. "<body 
bgcolor=12345>" would give "body"
5  - score known html keywords, e.g. "<body bgcolor=12345>" would give 
"body" and "bgcolor"
6  - only allow properly constructed html comments, i.e. "<!" and ">" 
required, whitespace allowed, leading/trailing "--", anything allowed 
between pairs of dashes
7 -  allow improperly constructed html comments, i.e. don't check for "--"
...

Yes, I know that #7 is contrary to the standard.  Several bogofilter users 
want bogofilter to work that way.

> > I know of one significant problem in accepting "innards" and that's the
> > random character sequences spammers have started to include.  I grepped
> > some recent email for "asdf" (straight from the keyboard!) and found that
> > 148 of the 2064 spam I received last month had that "random" character
> > sequence.  So, perhaps I'm making the case using tokens from inside html
> > tags/comments, but the concern is that random sequences will consume large
> > amounts of database space and will make bogofilter less accurate.
>
>Well, we have weed-out functionality in place (although I wonder if it
>will work properly in the long run, because we need to decrease the
>message count accordingly lest we skew our probabilities) that would
>allow to free tokens in the data base.

It's not clear whether accepting "spammer generated random character 
sequences" matters or not.   With standard bogofilter parameters, e.g. 
robx=0.415 and min_dev=0.10, previously unseen words don't affect the spam 
score of the current message.  If the message is registered as spam, then 
those words will be spam indicators in subsequent messages.