What to do for HTML comment processing ???

Fri Mar 7 17:23:35 CET 2003

Nick,

Thank you for your comments.  They extend my understand, which clarifies my 
thoughts.  Now I recognize 3 flavors of html "tags" (and have names for 
them as well)!

1 - valid html tags - html, body, font, a, br, hr, etc
2 - html comments - "<!-- ... -->"
3 - invalid html tags - "<asdf ...>" and friends

I'm thinking that two or three options would provide sufficient flexibility:

score_html_comments - a boolean with values Yes/no
score_html_tags - a boolean with values Yes/No, or
         an enum with values - None/Valid/All
score_html_urls - a boolean with values Yes/No

What bogofilter is currently doing corresponds to settings No/None for the 
above options.   For score_html_tags, the distinction between All and Valid 
is whether to include everything or just known tags.  For score_html_urls, 
bogofilter could identify hrefs and parse them as tokens.

At 09:01 AM 3/7/03, Nick Simicich wrote:
>At 11:17 AM 2003-03-06 -0500, David Relson wrote:

... [snip] ...

>>Unfortunately, spammers don't always include the dashes.  Since 
>>bogofilter's purpose is to recognize spam, there's valid reason for it to 
>>process messages without the dashes.  Life would be simpler if all html 
>>email followed the standards, but it doesn't.  Bogofilter exists in "the 
>>real world" so should be able to deal with real messages.
>
>When a tag is coded as <! it ends with >.  That is not a comment, it is a 
>tag, and it should be processed as a tag. I covered this in another 
>message. The browsers that I have tested simply ignore tags that they do 
>not know about, so that they do not cause eyespace breaks.  Therefore they 
>should be elided from tokens.  In the html parser I gave you, I had 
>experimented with which tokens caused eyebreaks.

I know of the eyebreak experiment.  I hadn't understood the distinction 
you've made between html comments and invalid html tags.  I was lumping 
them together because they seemed to call for the same treatment.

>>[...]
>>I'm waiting for feedback from the bogofilter user community on whether to 
>>process the innards of html comments and tags.  So far that feedback has 
>>been lacking.
>>
>>I know of one significant problem in accepting "innards" and that's the 
>>random character sequences spammers have started to include.  I grepped 
>>some recent email for "asdf" (straight from the keyboard!) and found that 
>>148 of the 2064 spam I received last month had that "random" character 
>>sequence.  So, perhaps I'm making the case using tokens from inside html 
>>tags/comments, but the concern is that random sequences will consume 
>>large amounts of database space and will make
>>  bogofilter less accurate.
>
>Sure seems like a spam indication to me, that asdf thing.  The way around 
>your above issue is to start aging database contents to adapt to new 
>spam.  There is also "reducing", hashing the tokens.  One of our 
>competitors is hashing not only individual tokens but also phrases, up to 
>six words, and putting the hashes into the database.

Bogofilter is not presently in the hashing business.  Bogofilter figures 
"tokens is tokens", though it does convert upper case to lower case.

>>>And I think that there should not be a flag to make that processing 
>>>optional.  It is a useless complexity.  OK, we can have a flag if we 
>>>want, but we should ignore it. :-)
>>>
>>>But compromise for no reason is simply going to add complexity for no 
>>>reason.
>>
>>I respectfully submit that there _is_ a reason, so adding the option is 
>>providing something useful - not needless complexity.
>
>The point is that there is (or should already be) processing that elides 
>those tokens and it is not the comment processing.  They should be 
>casually elided by the "unknown tag processing" (because that is how they 
>are processed by browsers, not as comments).  Just allow the first 
>character of an unknown tag to be [A-Za-z!0-9].

I'll see what can be done.