What to do for HTML comment processing ???

Fri Mar 7 00:01:39 CET 2003

On Thu, 06 Mar 2003, David Relson wrote:

> Unfortunately, spammers don't always include the dashes.  Since 

Well, <! is the "markup declaration open delimiter", and the -- in the
markup declaration introduces a comment.

According to HTML 4.01, <!-- comment --    > would be a valid comment.

<! this is a markup declaration> - and therefore still invisible, although
not a comment.

> bogofilter's purpose is to recognize spam, there's valid reason for it to 
> process messages without the dashes.  Life would be simpler if all html 
> email followed the standards, but it doesn't.  Bogofilter exists in "the 
> real world" so should be able to deal with real messages.

Yes, and therefore I think Nick's right when he writes:

> >My comments (on how to process comments) were based on actually testing 
> >how IE and Netscape process comments.  If you do things any other way, you 
> >are simply allowing people to use comments to eat holes in bogofilter.  I 

> >I also believe, by the way, that we should process tokens out of comments 
> >and use those, so that if someone has, for example, javascript routines 
> >that are common to the spam world, like obfuscators, we will recognize 
> >them.  The point is to move the comments out of words.  If they are not in 
> >words, you process them in place.

The moot point about the latter paragraph though is if these should be
processed. The parsing of comments can be either indicative or
misleading: if $pammers stuff long innocuous text into the comments,
this may fool the filter, and that's the reason why it was chosen to
kill the stuff.

The technical compromise would be to calculate spamicity two times: once
with comments accounted for, once with comments ignored, and use the
maximum of these two values and see if the false positive rate is still
acceptable. That way, spammers will not be able to pull the spamicity
value down by putting William Shakespeare or Johann Wolfgang [von]
Goethe texts into their comments or markup declarations or tags.

> I'm waiting for feedback from the bogofilter user community on whether to 
> process the innards of html comments and tags.  So far that feedback has 
> been lacking.

I hope this mail adds to the "feedback" category ;-)

> I know of one significant problem in accepting "innards" and that's the 
> random character sequences spammers have started to include.  I grepped 
> some recent email for "asdf" (straight from the keyboard!) and found that 
> 148 of the 2064 spam I received last month had that "random" character 
> sequence.  So, perhaps I'm making the case using tokens from inside html 
> tags/comments, but the concern is that random sequences will consume large 
> amounts of database space and will make bogofilter less accurate.

Well, we have weed-out functionality in place (although I wonder if it
will work properly in the long run, because we need to decrease the
message count accordingly lest we skew our probabilities) that would
allow to free tokens in the data base.

-- 
Matthias Andree