New version

Greg Louis glouis at dynamicro.on.ca
Tue Mar 16 16:55:12 CET 2004


On 20040316 (Tue) at 0815:27 -0500, Tom Anderson wrote:
> On Tue, 2004-03-16 at 07:46, Greg Louis wrote:
> > robx        = 0.610600 (6.11e-01)
> > robs        = 0.017800 (1.78e-02)
> > min_dev     = 0.020000 (2.00e-02)
> > ham_cutoff  = 0.281000 (2.81e-01)
> > spam_cutoff = 0.532200 (5.32e-01)
> > 
> > gives me 1.1% fn and I haven't had an fp in 8 weeks now (150,000-odd
> > messages).  Same basic setup: a somewhat spammy robx and minimal
> > minimum deviation.  Unknowns bias the scoring spamward, which is ok,
> > because -- especially these days -- spams do contain more unknowns. 
> > Ok, at least, if you have enough registered nonspam to balance out that
> 
> Wow, that sounds incredibly dangerous.  If I sent you an email about a
> subject you've never received before, maybe about igpe atinle, or
> silghlty rareraegnd lteetrs, or lysdexia, or some obscure science or
> sport with a strange vernacular, then you would likely classify it as
> spam.

Only if you also carefully avoided using words strongly associated with
nonspam.

>  I'd much rather get it as unsure, and at least have a chance to
> register it as spam once.  Therefore, the robx ought to be less than the
> spam_cutoff

Sorry but this betrays a fundamental misconception on your part.  The
values of x and the spam cutoff are not to be compared in that way,
because they are not linearly related _at_all_.  Remember, the score
that the spam cutoff is compared against is calculated by Fisher's
method of combining probabilities, not the old Robinson geometric-mean
thing; a message consisting of ten tokens with fw of 0.532 (smaller
than the spam_cutoff, although not much so) would still score 0.5637.
The value of robx is supposed to be a guess at how likely it is that an
unknown token is to be found in spam.  In my message corpus, that
likelihood really is around 0.6, so that's what the prior should be.

> if not within the min_dev range.  Biasing unknowns strongly
> toward spam (above the cutoff and min_dev) is crazy IMHO.

Strong words, whether your O be H or not.

If you actually sent me away essagemay otallytay inway igpay atinlay,
it might indeed get billed as spam -- but remember that your email
headers and such like are probably in my training database with fairly
high nonspam counts by now, so Fisher might well get it right even
then.  And frankly, the chance of a total stranger sending me
ibberishgay I'm interested in is small enough not to frighten me much.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list