Advice on group filtering

Mark Constable markc at renta.net
Thu Feb 19 00:00:32 CET 2004


Yes I know, "don't do it", but...

Potential target userbase is 4000 typical dialup ISP clients.

The aim is to reduce spam from no filtering (anything is an
improvement) but try to get to near zero false positives with 
a view to /dev/null'ing them eventually, so we're aiming for
no false positives at the expense of 10% to 20% spam if the 
equation needs to be skewed that way to keep the false positives 
down as near to zero as possible.

4000 x ~25mb wordlist.db's (in 3 months) = 100 gb on upwards.
1/4 million messages per day = approx 2 gb per day = no, we
don't really want to queue it up to have to troll thru it
looking for any users false positives. Providing the hard drive
space and even a dedicated server is not an issue but dealing
with the extra complexity is. Approx 50% of our users have
absolutely no idea about anything computer-wise let alone
expecting them to be able to use IMAP to train bogofilter
(by using ThisIsSpam and ThisIsNotSpam folders managed by a
cron job to retrain, for instance, their own dbs = disaster)
and about the same number get 1% real mail compared to spam.
Nearly all this class of users have to use POP because IMAP is
too complicated for them and expecting these users to onsend
emails for further training is also a disaster in waiting.

So, if we could get rid of even 75% of spam from a single staff
managed wordlist.db of 100mb or so using a single C program
binary (as opposed to any spamassasin-like perl chains) and
eventually be confident enough to ditch the 75%+ we do detect
then we would be really happy and have only added a relativily
simple extra layer of complexity to our server system.

Current setup is no /etc/bogofilter.cf (absolute defaults)
and a testbase of 200 mostly POP users. The intitial wordlist.db 
is 46mb with a .MSG_COUNT of spam = 58265 and good = 1373 and 
the invocation of bogofilter is simply -e -p -d so it's only 
being trained on errors ATM. At 1/4m messages a day I, uhm, 
hesitate to use -u.

I'm seeking any advice or pointers of bogofilter.cf settings 
and how best to manage the wordlist.db over time, with a view 
to putting up a public document, maybe a Wiki, on how to do 
this -- if it's workable at all. I feel we have a good test
case to prove whether this is possible or not so I hope any 
feedback will be useful for others too.

--markc




More information about the Bogofilter mailing list