Advice on group filtering
relson at osagesoftware.com
Wed Feb 18 23:57:55 EST 2004
On Thu, 19 Feb 2004 09:00:32 +1000
Mark Constable wrote:
> Yes I know, "don't do it", but...
> Potential target userbase is 4000 typical dialup ISP clients.
> The aim is to reduce spam from no filtering (anything is an
> improvement) but try to get to near zero false positives with
> a view to /dev/null'ing them eventually, so we're aiming for
> no false positives at the expense of 10% to 20% spam if the
> equation needs to be skewed that way to keep the false positives
> down as near to zero as possible.
> 4000 x ~25mb wordlist.db's (in 3 months) = 100 gb on upwards.
> 1/4 million messages per day = approx 2 gb per day = no, we
> don't really want to queue it up to have to troll thru it
> looking for any users false positives. Providing the hard drive
> space and even a dedicated server is not an issue but dealing
> with the extra complexity is. Approx 50% of our users have
> absolutely no idea about anything computer-wise let alone
> expecting them to be able to use IMAP to train bogofilter
> (by using ThisIsSpam and ThisIsNotSpam folders managed by a
> cron job to retrain, for instance, their own dbs = disaster)
> and about the same number get 1% real mail compared to spam.
> Nearly all this class of users have to use POP because IMAP is
> too complicated for them and expecting these users to onsend
> emails for further training is also a disaster in waiting.
> So, if we could get rid of even 75% of spam from a single staff
> managed wordlist.db of 100mb or so using a single C program
> binary (as opposed to any spamassasin-like perl chains) and
> eventually be confident enough to ditch the 75%+ we do detect
> then we would be really happy and have only added a relativily
> simple extra layer of complexity to our server system.
> Current setup is no /etc/bogofilter.cf (absolute defaults)
> and a testbase of 200 mostly POP users. The intitial wordlist.db
> is 46mb with a .MSG_COUNT of spam = 58265 and good = 1373 and
> the invocation of bogofilter is simply -e -p -d so it's only
> being trained on errors ATM. At 1/4m messages a day I, uhm,
> hesitate to use -u.
I believe there's at least one ISP already using bogofilter and dealing
with a comparable daily message load. They've modified bogofilter to
use word pairs and, last I knew, had a wordlist in the 300-500MB range.
> I'm seeking any advice or pointers of bogofilter.cf settings
> and how best to manage the wordlist.db over time, with a view
> to putting up a public document, maybe a Wiki, on how to do
> this -- if it's workable at all. I feel we have a good test
> case to prove whether this is possible or not so I hope any
> feedback will be useful for others too.
I'd suggest deciding on how big is acceptable. Once it reaches that
size, generate some statistics on quantities of old tokens, little used
tokens (very low counts), and neutral tokens (scoring near 0.5). With
that info you'll have an idea of how to choose tokens for deleting and
can create a maintenance plan.
More information about the Bogofilter