min_dev vs spam_cutoff [was: spam cutoff less than neutral? ]

Tue Feb 24 23:22:39 CET 2004

On 20040224 (Tue) at 0857:05 -0500, Tom Anderson wrote:
> On Tue, 2004-02-24 at 07:43, David Relson wrote:
> > I think Tom Allison is on track here.  There is little or no relation
> > between min_dev and spam_cutoff.  min_dev says ignore neutrally scored
> > tokens.  spam_cutoff says to label as spam those messages with lots of
> > high scoring tokens.  These parameters address different realms.

You will, however, find that varying min_dev does alter the optima for
robs and robx (robs is the weight given to robx when a token has been
seen before, but only a few times).  It does that by letting in more or
fewer tokens that robs/robx naturally apply to.  And altering robs and
robx can produce quite a variation in the distribution of message
scores, and hence in the cutoff values needed.  (If there's any value
left in the pre-bogotune howto I wrote, it's that I explain the _order_
in which manual tuning should proceed.)

> > The robx parameter _does_ relate (somewhat) to spam_cutoff.  If you
> > recall, it's the value given to unknown words and has a default of
> > 0.415.  _If_ you ever got a message with never before seen tokens, you'd
> > expect it's score to be 0.415.  Since we prefer false negatives to false
> > positives, we want robx to be less than spam_cutoff.  If robx >
> > spam_cutoff, then the message of unknown words would be spam.
> 
> So I'll continue reducing my spam_cutoff until my spam unsures are
> reduced, but not beyond my robx.  Thanks.

Gary points out that the starting value of robx should be the average
f(w) for all tokens with some reasonable minimum of counts.  You know
where 0.415 came from?  It was my average f(w) rather more than a year
ago, got put into the code of that time, and hasn't been updated since. 
With the increase in spam these days, I'm now at about 0.52 at home and
0.6 at work.  What's more, and this is why bogotune plays around with
it, that average really is only a good starting point from which to
determine the optimum.

As pi points out, the final score is a combination of the likelihood
that the message _is_ spam with the likelihood that it _isn't_.

> Seeing as I'm getting lots of unsures near 0.5, is there any way to pass
> in options which will cause this skewing to favor 1.0 a little more?

If I felt such a need, I would probably raise robx above 0.5 and lower
min_dev to something minimal like 0.02 -- but I don't promise that
doing that will actually achieve the desired effect!  Changing
parameter values can shift the distributions of spam and nonspam scores
up or down the scoring scale, but it's also possible to flatten them
and increase the overlap -- something one doesn't want to do.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |