invalid html warfare

David Relson relson at osagesoftware.com
Thu May 29 00:34:40 CEST 2003


At 06:19 PM 5/28/03, Sam Hills wrote:

>>If the spammers are attempting to pollute our databases by flooding them 
>>with unique trash tokens (snip)
>>
>I suspect that's what they're trying to do.  Much of the spam I've seen 
>recently has a lot of gibberish at the end.
>
>However, does such pollution have any effect on how BF ranks messages? If 
>I understand BF's operation correctly, lots of gibberish words in the db 
>would merely bloat the db (and thereby possibly slow BF down slightly), 
>but not have any effect on the accuracy of BF's evaluation of msg's that 
>_don't_ contain any of these previously-seen gibberish "words".

Sam,

You've got it right.  New gibberish adds to the data base when '-s', '-n', 
or '-u' is used.  After that, if the same gibberish appears it's recognized 
as ham or spam (depending on which wordlist it went into the first 
time).  The effect is: (a) none, for the new gibberish; or (b) guilt by 
association, for repeated gibberish.

Either way, it doesn't hurt bogofilter.

Also, stuff like "vi<junk>ag<junk>ra" is processed nicely with the junk 
being discarded and the letter pairs making a nicely spammish 6 letter word :-)

>>What if we removed from the database each token occurring only once in 
>>the database?  (bogoutil -c 1  *.db??)This would only be practical if 
>>done on a sufficiently infrequent interval for "good data" to accumulate 
>>more than one hit, but often enough to prevent database pollution.
>How frequently should this be done?  I think that would depend on the 
>volume of spam you receive.  (I get a few hundred per day.)

I'm running bogofilter on a P-133 with 64MB ram - not a high end machine by 
any means.  My wordlists are:
spam - 15,500 messages, 5.5 MB
good - 35,000 messages,  16 MB

So far, I've seen no need to use database maintenance.  I wrote the code 
and tested it, but haven't needed it. YMMV.

David





More information about the Bogofilter mailing list