Testing fisher
Greg Louis
glouis at dynamicro.on.ca
Wed Jan 29 21:33:01 CET 2003
On 20030129 (Wed) at 1625:09 +0100, Boris 'pi' Piwinger wrote:
> Greg Louis wrote:
>
> > Any chance you could repeat the experiment, just for
> > 0.025, 0.020 and 0.015, with the exact same training db and message
> > corpora, with both robx values? We could be onto something big here.
>
> algorithm min_def spam_cutoff test.spam test.ham
> total F-N total F-P
> robx=0.415
> fisher-2 0.015 0.6 4403 103 15605 3
> fisher-2 0.020 0.6 4403 104 15605 3
> fisher-2 0.025 0.6 4403 105 15605 3
> fisher-2 0.15 0.6 4403 184 15605 1
> fisher-2 0.20 0.6 4403 205 15605 1
> fisher-2 0.25 0.6 4403 203 15605 1
>
> robx=0.48
> fisher-2 0.015 0.6 4403 103 15605 4
> fisher-2 0.020 0.6 4403 104 15605 3
> fisher-2 0.025 0.6 4403 105 15605 3
> fisher-2 0.15 0.6 4403 184 15605 1
> fisher-2 0.20 0.6 4403 205 15605 1
> fisher-2 0.25 0.6 4403 203 15605 1
>
> Yes, the values do agree almost perfectly.
>
And the difference is understandable ;-) Thanks very much for taking
the trouble.
I'm running a large-scale test on another machine today; don't know how
long it will take to finish. Very similar to yours, except that for
every combination of min_dev and robs (I'm varying robs too) I'm
picking a spam_cutoff high enough to yield zero fps (0.000001 larger
than the highest nonspam in a batch of unsures). So the experimental
datum is the number of false negatives for the given min_dev and robs
combination. I'm staying with robx=0.415 for this test even though the
calculated value for my training db is 0.762; my instinct is that it's
not likely to be helpful to go outside the range 0.4-0.6, and that it's
safer to keep 0.5 >= robx > 0.4 -- that way, little-known words move
the classification away from spam, which I think helps avoid fp.
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
| Help free our mailboxes. Include |
| http://wecanstopspam.org in your signature. |
More information about the Bogofilter
mailing list