Accuracy is lacking

Mon Feb 17 21:02:14 CET 2003

On 20030217 (Mon) at 0810:15 -0500, David Relson wrote:
> The Robinson-GM (geometric mean) method followed as a 
> refinement that used all words of the message to generate the ham/spam 
> score.  More recently, the Fisher algorithm has added a chi-square test 
> that provides even better discrimination.

Strictly speaking, that last sentence lacks accuracy :)  With careful
tuning, you should get identical optimal discrimination with G-M or
Fisher.  If, as in this case, the token "probabilities" are calculated
in the same way - the f(w) calculation - then no matter what valid
method is used to combine those token "probabilities," the
discrimination capability cannot vary.  ("Valid" -- if, for
example, you combined them by multiplying them all by zero and then
adding the results, that might make it a bit worse ;)

What Fisher buys us is a (slightly) less fussy spam-cutoff and a much
broadened grey area (the unsures) that helps with interpretation when
"difficult" classifications are encountered.  ("Difficult"
classifications are often nonspams with a lot of spammy words in them;
I have users who subscribe to newsletters that are a problem that way.)

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |