HTML comment idea [was: How to avoid s p lit up wor ds?]

David Relson relson at osagesoftware.com
Sat Jan 18 15:12:16 CET 2003


At 02:28 PM 1/17/03, Chris Wilkes wrote:

>I'm starting to get a lot of spam that looks like:
>         buy  to ner  car tri dg es
>where the bad words are split up into 2 or 3 letter words.  Since BF
>throws out those words it could get by.
>
>What can BF do to combat this?  Granted most spam list that has to
>contain a URL in it that can be caught.
>
>Maybe a simple frequency count of spam words vs larger ones would catch
>this?
>
>Chris

Chris,

Having followed this thread for the past day or so, I have had a couple of 
ideas.  Adding code to process html comments, i.e. "<!--sdfjaldf-->", is 
easy enough.  They can be removed and/or counted.  As it's not clear that 
_everyone_ will want this to happen, using config file options to turn the 
capabilities on/off can be implemented.  If it's decided to count them, 
bogofilter could score multiple comments as multiple spammy tokens.  Some 
(or all) of the following options could be added:

remove_html_comments=boolean
count_html_comments=number
score_html_comments=value

where "boolean" is true or false (with aliases of 1, 0, on, off, yes, no), 
"number" is max value for the count, and "value" is a spamicity score (0.0 
to 1.0).

If either of the first two options is used, the html parser would 
preprocess each input line to remove and count comments.    The count would 
be capped at the specified max value.  When scoring the message, the 
supplied score could be used.

Default values can be assigned to all these parameters, with current 
behavior being "remove...=no" and "count...=no" and "score..." being unused.

Do people want a feature like this?  What should the defaults be?  Would 
someone like to do the implementation?

David





More information about the Bogofilter mailing list