invalid html warfare

Sam Hills rcb at bbll.com
Thu May 29 00:19:51 CEST 2003


>If the spammers are attempting to pollute our databases by 
>flooding them with unique trash tokens (snip)
>  
>
I suspect that's what they're trying to do.  Much of the spam I've seen 
recently has a lot of gibberish at the end.

However, does such pollution have any effect on how BF ranks messages? 
 If I understand BF's operation correctly, lots of gibberish words in 
the db would merely bloat the db (and thereby possibly slow BF down 
slightly), but not have any effect on the accuracy of BF's evaluation of 
msg's that _don't_ contain any of these previously-seen gibberish "words".

>What if we removed from the database each token occurring only once in the 
>database?  (bogoutil -c 1  *.db??)This would only be practical if done on a 
>sufficiently infrequent interval for "good data" to accumulate more than one 
>hit, but often enough to prevent database pollution.
>
How frequently should this be done?  I think that would depend on the 
volume of spam you receive.  (I get a few hundred per day.)





More information about the Bogofilter mailing list