What to do for HTML comment processing ???

Fri Mar 7 15:01:12 CET 2003

At 11:17 AM 2003-03-06 -0500, David Relson wrote:

>Hi Nick,
>
>At 02:41 AM 3/6/03, Nick Simicich wrote:
>>At 05:50 PM 2003-03-04 -0500, David Relson wrote:
>>
>>>Nick,
>>>
>>>Your feed back on processing html comments is appreciated.  May I 
>>>suggest a compromise?
>>
>>Why?  Unless you have a REASON!  Personal preference is not a good 
>>reason.  Because we have done it that way was not good enough to block 
>>the flag changes.
>>
>>And "because it is a standard" is not a good reason.  NO ONE FOLLOWS 
>>STANDARDS.
>
>The standard says a comment begins with "<!--" and ends with "-->", which 
>is what bogofilter is presently doing.  What bogofilter is not doing is 
>nesting (since your research showed that none of the browsers is doing that).
>
>Unfortunately, spammers don't always include the dashes.  Since 
>bogofilter's purpose is to recognize spam, there's valid reason for it to 
>process messages without the dashes.  Life would be simpler if all html 
>email followed the standards, but it doesn't.  Bogofilter exists in "the 
>real world" so should be able to deal with real messages.

When a tag is coded as <! it ends with >.  That is not a comment, it is a 
tag, and it should be processed as a tag. I covered this in another 
message. The browsers that I have tested simply ignore tags that they do 
not know about, so that they do not cause eyespace breaks.  Therefore they 
should be elided from tokens.  In the html parser I gave you, I had 
experimented with which tokens caused eyebreaks.

>[...]
>I'm waiting for feedback from the bogofilter user community on whether to 
>process the innards of html comments and tags.  So far that feedback has 
>been lacking.
>
>I know of one significant problem in accepting "innards" and that's the 
>random character sequences spammers have started to include.  I grepped 
>some recent email for "asdf" (straight from the keyboard!) and found that 
>148 of the 2064 spam I received last month had that "random" character 
>sequence.  So, perhaps I'm making the case using tokens from inside html 
>tags/comments, but the concern is that random sequences will consume large 
>amounts of database space and will make
>  bogofilter less accurate.

Sure seems like a spam indication to me, that asdf thing.  The way around 
your above issue is to start aging database contents to adapt to new 
spam.  There is also "reducing", hashing the tokens.  One of our 
competitors is hashing not only individual tokens but also phrases, up to 
six words, and putting the hashes into the database.

>>And I think that there should not be a flag to make that processing 
>>optional.  It is a useless complexity.  OK, we can have a flag if we 
>>want, but we should ignore it. :-)
>>
>>But compromise for no reason is simply going to add complexity for no reason.
>
>I respectfully submit that there _is_ a reason, so adding the option is 
>providing something useful - not needless complexity.

The point is that there is (or should already be) processing that elides 
those tokens and it is not the comment processing.  They should be casually 
elided by the "unknown tag processing" (because that is how they are 
processed by browsers, not as comments).  Just allow the first character of 
an unknown tag to be [A-Za-z!0-9].

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!