Testing fisher

Wed Jan 29 14:05:25 CET 2003

On 20030129 (Wed) at 1149:34 +0100, Boris 'pi' Piwinger wrote:
> David Relson wrote:
> 
> > On the other hand, _you_ are using ROBX=0.415 for unknown words, but with 
> > min_dev=0.025 giving a discard range of 0.475 to 0.525, those unknown words 
> > are contributing to the message's spam score.  I'm wondering what effect 
> > that has on your test results.
> > 
> > Observation: "ROBX < EVEN_ODDS - min_dev" includes hammish words in the 
> > message score.
> > Hypothesis:  This contributes to the false negative count.
> > Experiment:  Change ROBX so that it's closer to EVEN_ODDS, say 0.48, and 
> > then rerun your test with min_dev=0.15, 0.20, and 0.25.
> > Expected result:  0.15 will have more false negatives than 0.25 (due to 
> > including/excluding unknown words).
> 
> 
> algorithm    min_def    spam_cutoff    test.spam    test.ham
>                                        total  F-N   total F-P
> robx=0.415
> fisher-2        0.10          0.95     4186   364   15140  1
> fisher-2        0.25          0.60     4335   191   15362  0
> fisher-2        0.20          0.60     4221   184   15140  0
> fisher-2        0.15          0.60     4237   170   15251  0
> fisher-2        0.10          0.60     4221   139   15140  1
> fisher-2        0.075         0.60     4237   132   15251  1
> fisher-2        0.05          0.60     4237   116   15251  1
> fisher-2        0.035         0.60     4262   101   15251  1
> fisher-2        0.025         0.60     4262    89   15251  1
> fisher-2        0.02          0.60     4297    92   15362  1
> fisher-2        0.015         0.60     4295    92   15361  1
> fisher-2        0.00          0.60     4221   140   15140  1
> 
> robx=0.48
> fisher-2        0.025         0.60     4367   198   15479  0
> fisher-2        0.020         0.60     4367   204   15479  0
> fisher-2        0.015         0.60     4367   182   15479  0
> 
> So your expectation is not true for me. Interesting enough I
> get significantly *more* FNs.

It is interesting.  Given a relatively high robs of 0.001, words with
low counts will have their probabilities pulled toward robx.  You have
raised robx from 0.415 to 0.48, and you'd think, as David did, that
this would (1) raise the average spamicity for messages with low-count
words, and (2) pull unknown words into the min_dev black hole at 0.025,
but not at 0.020 nor 0.015.  Your results tend to imply that the issue
here is not with unknown words.

I note that your totals have grown; one improbable but not impossible
artefact could be that you've introduced a bunch of hard spams in the
most recent batch.  "Hard" in the sense of being hard to classify and
producing spamicities below 0.6.  Given that you're getting a hundred
more fn's with only 70 more spams under test, this doesn't seem a
plausible explanation.

Trouble is, I can't think of a mechanism by which raising robx, in the
absence of any other parameter changes, could lower resulting
spamicities.  Any chance you could repeat the experiment, just for
0.025, 0.020 and 0.015, with the exact same training db and message
corpora, with both robx values?  We could be onto something big here.
Gary, do you have any suggestions?

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |