What to do for HTML comment processing ???
Nick Simicich
njs at scifi.squawk.com
Fri Mar 7 15:01:12 CET 2003
At 11:17 AM 2003-03-06 -0500, David Relson wrote:
>Hi Nick,
>
>At 02:41 AM 3/6/03, Nick Simicich wrote:
>>At 05:50 PM 2003-03-04 -0500, David Relson wrote:
>>
>>>Nick,
>>>
>>>Your feed back on processing html comments is appreciated. May I
>>>suggest a compromise?
>>
>>Why? Unless you have a REASON! Personal preference is not a good
>>reason. Because we have done it that way was not good enough to block
>>the flag changes.
>>
>>And "because it is a standard" is not a good reason. NO ONE FOLLOWS
>>STANDARDS.
>
>The standard says a comment begins with "<!--" and ends with "-->", which
>is what bogofilter is presently doing. What bogofilter is not doing is
>nesting (since your research showed that none of the browsers is doing that).
>
>Unfortunately, spammers don't always include the dashes. Since
>bogofilter's purpose is to recognize spam, there's valid reason for it to
>process messages without the dashes. Life would be simpler if all html
>email followed the standards, but it doesn't. Bogofilter exists in "the
>real world" so should be able to deal with real messages.
When a tag is coded as <! it ends with >. That is not a comment, it is a
tag, and it should be processed as a tag. I covered this in another
message. The browsers that I have tested simply ignore tags that they do
not know about, so that they do not cause eyespace breaks. Therefore they
should be elided from tokens. In the html parser I gave you, I had
experimented with which tokens caused eyebreaks.
>[...]
>I'm waiting for feedback from the bogofilter user community on whether to
>process the innards of html comments and tags. So far that feedback has
>been lacking.
>
>I know of one significant problem in accepting "innards" and that's the
>random character sequences spammers have started to include. I grepped
>some recent email for "asdf" (straight from the keyboard!) and found that
>148 of the 2064 spam I received last month had that "random" character
>sequence. So, perhaps I'm making the case using tokens from inside html
>tags/comments, but the concern is that random sequences will consume large
>amounts of database space and will make
> bogofilter less accurate.
Sure seems like a spam indication to me, that asdf thing. The way around
your above issue is to start aging database contents to adapt to new
spam. There is also "reducing", hashing the tokens. One of our
competitors is hashing not only individual tokens but also phrases, up to
six words, and putting the hashes into the database.
>>And I think that there should not be a flag to make that processing
>>optional. It is a useless complexity. OK, we can have a flag if we
>>want, but we should ignore it. :-)
>>
>>But compromise for no reason is simply going to add complexity for no reason.
>
>I respectfully submit that there _is_ a reason, so adding the option is
>providing something useful - not needless complexity.
The point is that there is (or should already be) processing that elides
those tokens and it is not the comment processing. They should be casually
elided by the "unknown tag processing" (because that is how they are
processed by browsers, not as comments). Just allow the first character of
an unknown tag to be [A-Za-z!0-9].
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list