Honeytraps and garbage removal

Peter Bishop pgb at csr.city.ac.uk
Tue Apr 15 19:15:02 CEST 2003


I have installed a shared wordlist bogofilter on our mailserver.
+ I have a "honeytrap" account (i.e. a non-person) that is known to 
spammers and hence provides a continuous stream of "pure" spam that
is fed direct into the shared spamlist.db. Users can provide non-spam 
examples for goodllist.db via another dummy account (usually their false 
positives).

This seems to work OK, but the spemlist.db is steadily expanding
(a magabyte so far). 

At some stage it might be necessary to weed out junk,
e.g. like those random strings that get added to spam these days.
But my problem is this - how can I find out which words are junk?

I need to know how often a word is hit by my users in "filter" mode
NOT how how many messages the word was in in "store" mode
(as my spam source messages are not necessarily the same as my
user spam.

I guess I could use -u in filter mode to get that count, but so far I have tried 
to avoid auto-updates as it could potentially pollute my spamlist.

It might be nice to have a separate "hitlist" option that maintains a count of 
words hit, and this info could be used for pruning the database.
-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list