New version

Wed Mar 17 14:04:14 CET 2004

On 20040316 (Tue) at 2259:30 -0500, Tom Anderson wrote:
> On Tue, 2004-03-16 at 12:55, Greg Louis wrote:

> I wouldn't call it a straw man, as that implies it is false.  It is not
> a false case, just a worst case.

I disagree here -- let's settle on "artificial" rather than "false" --
and that is the actual point of the discussion (no hostility or
contempt intended, I respect your position and am arguing in the hope
that we can reconcile our opinions).  My expectation is that nobody
will ever get a message consisting entirely of unknowns, once the
training database gets to a reasonable size.  Similarly, it would
greatly surprise me if anyone with a production training db ever got a
message with no tokens outside (0.4,0.6) or even outside (0.05,0.95) --
that would imply that the nonspam message had not one single token that
was present in fewer than one in 20 spams (roughly, since message
counts, s and x will alter the actual proportion a bit).

Statistical methods are all about likelihood and never about certainty.
Anyone who depends on bogofilter _never_ to misclassify a nonspam as
positive needs to use a spam cutoff of 1.

Maybe there will be some benefit (word chosen deliberately) in looking at
a different example.  We've been talking about unknown tokens and their
role in a possible misclassification.  I contend that this is no
different conceptually from the following scenario:

In my training database the token "benefit" occurs 466 times in spam
and 169 times in nonspam.  There are 23,435 spam and 21,659 messages
that have been used in training.  My x is 0.610612, s is 0.0178;
therefore the token score fw is 0.7181822.  It would not be impossible
to concoct a nonspam with a very significant number of such moderately
spammy words; but in any non-contrived nonspam, it's extremely unlikely
that there wouldn't be enough strong-valued tokens to override.  (My
training db has seen "contrived" in 7 nonspams and no spams; fw is
0.0015488.)  Sure, you could put min_dev up to 0.25 and be safe from
these moderates; but whether that would really pay in terms of better
classification needs to be determined with controlled experimentation.
Some find it does for them, some (including me) find it doesn't for us.

The point I wish to make is that bogofilter works by means, and
because, of having accumulated a large body of information about the
characteristics of _the_actual_message_population_ and anything we do
to distort that information has a strong chance of _worsening_
bogofilter's overall classification accuracy.  Forcing allowance for
hypothetical worst cases, that will "never" (in the statistical sense,
ie very very improbably) be seen in practice, is just such a
distortion.

> I feel that full training is not a practical option for most users,
> especially in large deployments where users do not have ssh or terminal
> access to the mail server.  In such cases, they will start with an empty
> database or a minimal database, and therefore will necessarily receive
> all-unknown and all-ambiguous emails.  Bogofilter would not be an option
> if these were allowed to be discarded or even drowned in a spam box. 
> This claim is not so humble, but a firm testament of the reality for me
> and my users.  And it is my humble opinion that keeping robx within the
> min_dev range serves to prevent false positives in these cases.

s/serves/may help/ and I don't disagree.  In fact we are in agreement
that _if_one's_training_db_is_small_ one should keep x within 0.5+/-
mindev.  I would say that in such cases one should keep bogofilter's
default parameters as they are, except play cautiously with the
spam_cutoff value, altering the rest only very very gingerly and with
the aid of a test corpus.  I've drafted a recommendation about keeping
x inside at first, and sent it along to David.

> Out of tens of thousands of emails over the past few months, I've not
> received a single false positive.  That's how it should be.  Bill
> McClain boasted 0.08% fp rate.  And while that sounds low at face value,
> I think it is horrible.

It's fine if you only get six messages a day -- about one fp every 208
days.  'T'all depends on volume.  I've had about 150,000 nonspams in
the 8 weeks since I last had a false positive, and that contents me
fairly well, though I hope it'll be another 8 weeks at least before I
get the next one.  For an ISP where that's an hour's volume, however,
such an fp rate means 12 unhappy customers every day, and really is
intolerable; I spoke last April with a vendor of a commercial spam
filter who said he had to achieve one fp in a million, at the cost of
letting through 13% -- thirteen percent!! of spam.  You're not alone in
abhorring fp, as you see.

I get 300 to 500 spams a day and they're quarantined, not discarded.  I
scan the quarantine daily but that's not certain to detect every fp. 
Eventually, however, they get carefully sifted for use in training
and/or experimentation.  If there were fp's, I'd catch them then at the
latest.  So I really am getting fewer than 1/150,000 (and hoping for
1/300,000 but that's slow to measure :)

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |