Detecting false-positives

Mon May 23 00:08:31 CEST 2005

On Sun, 2005-05-22 at 16:30, David Carmean wrote:
> How do those of you detect false-positives?

Some simple common-sense signals to search for false positives:

1) If it's really important they'll try again, or call.  Go back and
find the original false positive after they do and register it.

2) If suddenly you stop getting emails from a source you used to receive
regularly (newsletters, etc), or a particular email you're expecting
(confirmations, etc), search your spam for that sender.

3) Normally before someone's emails start becoming false positives, they
first become unsures... when you get a ham in your unsures, search your
spams for similar emails.

Some ways to reduce the clutter you need to search through:

1) Use some objective DNSBLs to prefilter spams from servers that have
landed themselves on such blacklists.  Objective ones only list known
sources of spam, and allow easy removal, thus they are the most
accurate.  Don't use subjective ones, as they employ blocking for
political reasons sometimes rather than strictly technical ones.  The
nice thing about DNSBLs is that the sender will know they were rejected
rather than the spam just ending up in a bit bucket... this will prompt
any false positives (which would be extremely rare anyway) to be resent
another way.

2) You could have higher spamicities (eg. 0.99 - 1.0) automatically
deleted at the MDA (procmail, maildrop, etc) so as to not clog up your
spam folder with certain spam... this would make looking at the
remainder easier.  I don't see how you could ever have a false positive
>0.99 unless you've been extremely sloppy in how you register your
corrections.  In fact, I'd be comfortable as low as 0.8 for automatic
deletion, maybe even lower.

3) Adjust your bogofilter settings.  Your spam cutoff is extremely high
IMHO.  Going by your graph, I'd set it to 0.7 at the most.  I set mine
(0.42) just above my robx (0.41), which is within my min_dev range
(0.4-0.6).  If you find you're getting lots of hams in the upper
spamicities, try increasing your min_dev to exclude some of the less
certain tokens.  You can also lower your robx slightly, but don't let it
go below your min_dev range.  This will help bias new tokens slightly
toward ham.

4) Use exhaustive registration.  I haven't received any false positives
ever that I know of (except one short period where I corrupted my
database, but that doesn't really count), and the last ham classified as
unsure was a few months ago.  That, with a ham cutoff at 0.1.  The main
tactic I use to achieve this is that I repetitively register those hams
classified as unsure until they are classified as ham.  And I
repetitively register spam unsures and false negatives until they
classify as spam.  This serves to really polarize tokens appropriately. 
5) Prefilter your headers to remove extraneous noise.  I've discussed
this on this list over the past week or two.

It's possible you might miss a false positive occasionally.  But if the
sender didn't bother to try again or to contact you in another way, and
if you didn't miss the email enough to wonder about not receiving it,
then it probably wasn't very important.

If you regularly need to get emails from new prospects or clients which
you wouldn't know to miss, create an alias or subaddress for each source
to which you distribute your contact info.  Put those into a seperate
folder you check more closely.  If one such alias or subaddress becomes
the victim of lots of spam, phase it out.  If you use a web form, insert
some extra tokens into the header which will bias those emails toward
ham.  Often the alias, subaddress, or your webserver tokens will do so
anyway.

These are the tactics that I use, and they have been quite successful so
far.

Tom