invalid html warfare
David Relson
relson at osagesoftware.com
Thu May 29 00:34:40 CEST 2003
At 06:19 PM 5/28/03, Sam Hills wrote:
>>If the spammers are attempting to pollute our databases by flooding them
>>with unique trash tokens (snip)
>>
>I suspect that's what they're trying to do. Much of the spam I've seen
>recently has a lot of gibberish at the end.
>
>However, does such pollution have any effect on how BF ranks messages? If
>I understand BF's operation correctly, lots of gibberish words in the db
>would merely bloat the db (and thereby possibly slow BF down slightly),
>but not have any effect on the accuracy of BF's evaluation of msg's that
>_don't_ contain any of these previously-seen gibberish "words".
Sam,
You've got it right. New gibberish adds to the data base when '-s', '-n',
or '-u' is used. After that, if the same gibberish appears it's recognized
as ham or spam (depending on which wordlist it went into the first
time). The effect is: (a) none, for the new gibberish; or (b) guilt by
association, for repeated gibberish.
Either way, it doesn't hurt bogofilter.
Also, stuff like "vi<junk>ag<junk>ra" is processed nicely with the junk
being discarded and the letter pairs making a nicely spammish 6 letter word :-)
>>What if we removed from the database each token occurring only once in
>>the database? (bogoutil -c 1 *.db??)This would only be practical if
>>done on a sufficiently infrequent interval for "good data" to accumulate
>>more than one hit, but often enough to prevent database pollution.
>How frequently should this be done? I think that would depend on the
>volume of spam you receive. (I get a few hundred per day.)
I'm running bogofilter on a P-133 with 64MB ram - not a high end machine by
any means. My wordlists are:
spam - 15,500 messages, 5.5 MB
good - 35,000 messages, 16 MB
So far, I've seen no need to use database maintenance. I wrote the code
and tested it, but haven't needed it. YMMV.
David
More information about the Bogofilter
mailing list