Spam / ham registration issue

David Relson relson at osagesoftware.com
Wed Mar 3 17:53:36 CET 2004


On Wed, 3 Mar 2004 10:58:08 -0500
Greg Louis wrote:

> On 20040303 (Wed) at 1346:21 +0100, Boris 'pi' Piwinger wrote:

...[snip]...

> I did the calculation manually, with R.  It's conceivable that I blew
> it, since I used the wrong distribution function; the spamicities for
> 2 and 3 are higher than I reported initially.  For a correct
> calculation see below.  The per-token f(w) value for n=2 (one spam one
> nonspam) is within 0.1 (min_dev) of 0.5; that for n=3 is not, nor is
> that for n=4.
> 
> > s <- 0.01
> > x <- 0.415
> > b <- 1
> > g <- 1
> > pw <- b/(b+g)

Hi Greg,

You're using token counts, without normalizing for message counts.  

Bogofilter uses:

	bad_cnt  = (double) max(1, msgs_bad);
	good_cnt = (double) max(1, msgs_good);
	pw = ((b / bad_cnt) / (b / bad_cnt + g / good_cnt));
	fw = (robs * robx + n * pw) / (robs + n);

which is different :-)

David




More information about the Bogofilter mailing list