Spam / ham registration issue
David Relson
relson at osagesoftware.com
Wed Mar 3 17:53:36 CET 2004
On Wed, 3 Mar 2004 10:58:08 -0500
Greg Louis wrote:
> On 20040303 (Wed) at 1346:21 +0100, Boris 'pi' Piwinger wrote:
...[snip]...
> I did the calculation manually, with R. It's conceivable that I blew
> it, since I used the wrong distribution function; the spamicities for
> 2 and 3 are higher than I reported initially. For a correct
> calculation see below. The per-token f(w) value for n=2 (one spam one
> nonspam) is within 0.1 (min_dev) of 0.5; that for n=3 is not, nor is
> that for n=4.
>
> > s <- 0.01
> > x <- 0.415
> > b <- 1
> > g <- 1
> > pw <- b/(b+g)
Hi Greg,
You're using token counts, without normalizing for message counts.
Bogofilter uses:
bad_cnt = (double) max(1, msgs_bad);
good_cnt = (double) max(1, msgs_good);
pw = ((b / bad_cnt) / (b / bad_cnt + g / good_cnt));
fw = (robs * robx + n * pw) / (robs + n);
which is different :-)
David
More information about the Bogofilter
mailing list