Advice on group filtering

Thu Feb 19 07:46:01 CET 2004

On Wed, 2004-02-18 at 18:00, Mark Constable wrote:
> The aim is to reduce spam from no filtering (anything is an
> improvement) but try to get to near zero false positives with 
> a view to /dev/null'ing them eventually, so we're aiming for
> no false positives at the expense of 10% to 20% spam if the 
> equation needs to be skewed that way to keep the false positives 
> down as near to zero as possible.

I have the same goal.  Zero false positives.  And so far, for about five
months, I've been successful.

I would suggest setting up a spam trap address to collect emails you
absolutely know are spam, and just train with those, at least in the
beginning.  This way you don't mix in too much personal bias.  Spread
the address around varied online forums, etc., to get a wide selection
of spams not targetted toward any one group.  

Then set your robx value to be well into the ham side (between 0.35 and
0.45)... this will artificially bias previously unclassified tokens
toward ham.  Additionally, use a relatively high min_dev (eg 0.25 to
0.35) so that only very strongly classified tokens are used in the final
spamicity score on each message.  Finally, set your spam_cutoff very
high (>0.9).

Using these settings probably won't meet your goal of only 10-20% spam,
but it will almost certainly meet the 0% fp requirement.  Once you're
comfortable with this initial level of filtering, you can tweak some of
these parameters to be more aggressive toward spam.  I've been slowly
nudging my spam_cutoff down every week or two... I still don't get any
false positives (even in my "unsures") at a spam_cutoff of 0.60 and
ham_cutoff of 0.20.  Nonetheless, I'm very careful about making any
drastic changes.  But at this point, I'd be comfortable /dev/null'ing
anything with a spamicity higher than 0.80, and I may start doing that
soon.

> absolutely no idea about anything computer-wise let alone
> expecting them to be able to use IMAP to train bogofilter
> (by using ThisIsSpam and ThisIsNotSpam folders managed by a
> cron job to retrain, for instance, their own dbs = disaster)
> and about the same number get 1% real mail compared to spam.
> Nearly all this class of users have to use POP because IMAP is
> too complicated for them and expecting these users to onsend
> emails for further training is also a disaster in waiting.

I've written a program called bfproxy which I've been testing with my
less saavy users.  They just forward any spam they receive to an
address-book contact I named "0spam" (0 so that it is at the top of the
list).  A client-side "filter" puts spams into a "spam" folder (this is
POP3, so by "folder" I mean client-side) based on the X-bogosity header,
and if anyone sees a false positive, they can forward to "0notspam". 
But nobody has ever done this yet since false positives just aren't
happening.  Forwarding an email is simple enough that everyone can grasp
it.  For more saavy users I use an "unsure" folder too, and I use -u. 
For the script to handle the forwards, you can download from here:
http://www.orderamidchaos.com/bogofilter/bfproxy (documentation
included).  No new aliases or users required.  It assumes individual
databases though.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040219/ab26ea29/attachment.sig>