[glouis at dynamicro.on.ca: Re: New vs Old]

Greg Louis glouis at dynamicro.on.ca
Fri Mar 26 00:29:56 CET 2004


On 20040325 (Thu) at 2244:14 +0100, Pavel Kankovsky wrote:
> On Thu, 25 Mar 2004, Greg Louis wrote:
> 
> > No, in fact it's the spam cutoff that determines that balance. 
> > Unknowns are excluded by both sets, and the tiny s values ensure that
> > no significant prior weight is given to low-count tokens.
> 
> Hmmm...doesn't a tiny value of s ensure the exact opposite?
> 
> The lower the value of s is, the weaker is the "pull" towards x, ergo
> the lower the value of s is, the more significant low-count tokens are
> (assuming x is near 0.5).

What you say is right but what you missed was that x is known, in
statistical circles, as "the prior" -- so the confusion is my fault for
being geeky in the wrong context ;)  Sorry.

x is a guess a priori at what fraction of the time an unknown token
will appear in spam.  s is a weighting factor that determines how much
weight is given to x in the case where a token is not completely
unknown but has seen very few times before.  A tiny s will mean that
the low-count tokens affect the score as if x didn't matter; a large
value of s means that x, more strongly than the p(w) ratio of the
token, will influence the f(w) score that is used (or not, depending on
min_dev) in calculating the overall message score.

So what I meant was that when s is small no significant weight is given
to x.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list