Honeytraps and garbage removal
David Relson
relson at osagesoftware.com
Tue Apr 15 20:26:02 CEST 2003
Hello Peter,
At 01:15 PM 4/15/03, Peter Bishop wrote:
>I have installed a shared wordlist bogofilter on our mailserver.
>+ I have a "honeytrap" account (i.e. a non-person) that is known to
>spammers and hence provides a continuous stream of "pure" spam that
>is fed direct into the shared spamlist.db. Users can provide non-spam
>examples for goodllist.db via another dummy account (usually their false
>positives).
I like your idea of the "honeytrap" account. Clever :-)
>This seems to work OK, but the spemlist.db is steadily expanding
>(a magabyte so far).
One megabyte still rates as quite small. You have a long ways to go before
you really need to start worrying about any size impacts.
>At some stage it might be necessary to weed out junk,
>e.g. like those random strings that get added to spam these days.
>But my problem is this - how can I find out which words are junk?
>
>I need to know how often a word is hit by my users in "filter" mode
>NOT how how many messages the word was in in "store" mode
>(as my spam source messages are not necessarily the same as my
>user spam.
>
>I guess I could use -u in filter mode to get that count, but so far I have
>tried
>to avoid auto-updates as it could potentially pollute my spamlist.
>
>It might be nice to have a separate "hitlist" option that maintains a
>count of
>words hit, and this info could be used for pruning the database.
You're the first person to suggest/request a feature of this nature. It's
probably not too difficult to implement. There already exists code for
opening/locking/closing wordlists and updating them with counts and
timestamps. The main routine for getting counts is compute_probability()
in robinson.c. You could add another wordlist type, perhaps "history", and
then do update that wordlist for each token read. That'd give you a
history. You'd also need a routine to select the tokens to discard from
that list. Probably you could use's bogoutil's dump_file option ('-d') and
a script for that info. Why not write it yourself and submit a patch for
bogofilter?
Have fun!
David
More information about the Bogofilter
mailing list