Honeytraps and garbage removal

David Relson relson at osagesoftware.com
Tue Apr 15 20:26:02 CEST 2003


Hello Peter,

At 01:15 PM 4/15/03, Peter Bishop wrote:

>I have installed a shared wordlist bogofilter on our mailserver.
>+ I have a "honeytrap" account (i.e. a non-person) that is known to
>spammers and hence provides a continuous stream of "pure" spam that
>is fed direct into the shared spamlist.db. Users can provide non-spam
>examples for goodllist.db via another dummy account (usually their false
>positives).

I like your idea of the "honeytrap" account.  Clever :-)

>This seems to work OK, but the spemlist.db is steadily expanding
>(a magabyte so far).

One megabyte still rates as quite small.  You have a long ways to go before 
you really need to start worrying about any size impacts.

>At some stage it might be necessary to weed out junk,
>e.g. like those random strings that get added to spam these days.
>But my problem is this - how can I find out which words are junk?
>
>I need to know how often a word is hit by my users in "filter" mode
>NOT how how many messages the word was in in "store" mode
>(as my spam source messages are not necessarily the same as my
>user spam.
>
>I guess I could use -u in filter mode to get that count, but so far I have 
>tried
>to avoid auto-updates as it could potentially pollute my spamlist.
>
>It might be nice to have a separate "hitlist" option that maintains a 
>count of
>words hit, and this info could be used for pruning the database.

You're the first person to suggest/request a feature of this nature.  It's 
probably not too difficult to implement.  There already exists code for 
opening/locking/closing wordlists and updating them with counts and 
timestamps.  The main routine for getting counts is compute_probability() 
in robinson.c.  You could add another wordlist type, perhaps "history", and 
then do update that wordlist for each token read.  That'd give you a 
history.  You'd also need a routine to select the tokens to discard from 
that list.  Probably you could use's bogoutil's dump_file option ('-d') and 
a script for that info.  Why not write it yourself and submit a patch for 
bogofilter?

Have fun!

David





More information about the Bogofilter mailing list