HTML comment idea [was: How to avoid s p lit up wor ds?]
David Relson
relson at osagesoftware.com
Sat Jan 18 15:12:16 CET 2003
At 02:28 PM 1/17/03, Chris Wilkes wrote:
>I'm starting to get a lot of spam that looks like:
> buy to ner car tri dg es
>where the bad words are split up into 2 or 3 letter words. Since BF
>throws out those words it could get by.
>
>What can BF do to combat this? Granted most spam list that has to
>contain a URL in it that can be caught.
>
>Maybe a simple frequency count of spam words vs larger ones would catch
>this?
>
>Chris
Chris,
Having followed this thread for the past day or so, I have had a couple of
ideas. Adding code to process html comments, i.e. "<!--sdfjaldf-->", is
easy enough. They can be removed and/or counted. As it's not clear that
_everyone_ will want this to happen, using config file options to turn the
capabilities on/off can be implemented. If it's decided to count them,
bogofilter could score multiple comments as multiple spammy tokens. Some
(or all) of the following options could be added:
remove_html_comments=boolean
count_html_comments=number
score_html_comments=value
where "boolean" is true or false (with aliases of 1, 0, on, off, yes, no),
"number" is max value for the count, and "value" is a spamicity score (0.0
to 1.0).
If either of the first two options is used, the html parser would
preprocess each input line to remove and count comments. The count would
be capped at the specified max value. When scoring the message, the
supplied score could be used.
Default values can be assigned to all these parameters, with current
behavior being "remove...=no" and "count...=no" and "score..." being unused.
Do people want a feature like this? What should the defaults be? Would
someone like to do the implementation?
David
More information about the Bogofilter
mailing list