Testing fisher

Greg Louis glouis at dynamicro.on.ca
Wed Jan 29 21:33:01 CET 2003


On 20030129 (Wed) at 1625:09 +0100, Boris 'pi' Piwinger wrote:
> Greg Louis wrote:
> 
> > Any chance you could repeat the experiment, just for
> > 0.025, 0.020 and 0.015, with the exact same training db and message
> > corpora, with both robx values?  We could be onto something big here.
> 
> algorithm    min_def    spam_cutoff    test.spam    test.ham
>                                        total  F-N   total F-P
> robx=0.415
> fisher-2     0.015      0.6            4403   103   15605 3
> fisher-2     0.020      0.6            4403   104   15605 3
> fisher-2     0.025      0.6            4403   105   15605 3
> fisher-2     0.15       0.6            4403   184   15605 1
> fisher-2     0.20       0.6            4403   205   15605 1
> fisher-2     0.25       0.6            4403   203   15605 1
> 
> robx=0.48
> fisher-2     0.015      0.6            4403   103   15605 4
> fisher-2     0.020      0.6            4403   104   15605 3
> fisher-2     0.025      0.6            4403   105   15605 3
> fisher-2     0.15       0.6            4403   184   15605 1
> fisher-2     0.20       0.6            4403   205   15605 1
> fisher-2     0.25       0.6            4403   203   15605 1
> 
> Yes, the values do agree almost perfectly.
> 
And the difference is understandable ;-)  Thanks very much for taking
the trouble.

I'm running a large-scale test on another machine today; don't know how
long it will take to finish.  Very similar to yours, except that for
every combination of min_dev and robs (I'm varying robs too) I'm
picking a spam_cutoff high enough to yield zero fps (0.000001 larger
than the highest nonspam in a batch of unsures).  So the experimental
datum is the number of false negatives for the given min_dev and robs
combination.  I'm staying with robx=0.415 for this test even though the
calculated value for my training db is 0.762; my instinct is that it's
not likely to be helpful to go outside the range 0.4-0.6, and that it's
safer to keep 0.5 >= robx > 0.4 -- that way, little-known words move
the classification away from spam, which I think helps avoid fp.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list