Accuracy is lacking
Greg Louis
glouis at dynamicro.on.ca
Mon Feb 17 21:02:14 CET 2003
On 20030217 (Mon) at 0810:15 -0500, David Relson wrote:
> The Robinson-GM (geometric mean) method followed as a
> refinement that used all words of the message to generate the ham/spam
> score. More recently, the Fisher algorithm has added a chi-square test
> that provides even better discrimination.
Strictly speaking, that last sentence lacks accuracy :) With careful
tuning, you should get identical optimal discrimination with G-M or
Fisher. If, as in this case, the token "probabilities" are calculated
in the same way - the f(w) calculation - then no matter what valid
method is used to combine those token "probabilities," the
discrimination capability cannot vary. ("Valid" -- if, for
example, you combined them by multiplying them all by zero and then
adding the results, that might make it a bit worse ;)
What Fisher buys us is a (slightly) less fussy spam-cutoff and a much
broadened grey area (the unsures) that helps with interpretation when
"difficult" classifications are encountered. ("Difficult"
classifications are often nonspams with a lot of spammy words in them;
I have users who subscribe to newsletters that are a problem that way.)
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
| Help free our mailboxes. Include |
| http://wecanstopspam.org in your signature. |
More information about the Bogofilter
mailing list