fisher algorithm

Mon Nov 25 12:45:39 CET 2002

On 20021124 (Sun) at 1747:58 -0600, Graham Wilson wrote:
> why is the spam_cutoff so high for the fisher algorithm?
> 
> i remember reading that for mails that fisher knows are spam it usually
> returns very high or very low spamicity numbers? is that why the cutoff
> is so high?

Yes.

> that is, to only catch mails it knows are definitely spam?

In the middle ground are spams and nonspams both, and nonspams with
values up to 0.95 are seen occasionally (rarely) in my traffic.  I find
that 0.952 gives me of the order of 0.35% to 0.5% false positivies, so
that's what I'm using.  (I'm almost ready to bump it up a bit; the spam
level is hovering around 1% and I could tolerate a bit higher in the
interest of fewer false positives.)

> i also remember reading [1] that the fisher algorithm has a `middle
> ground'. what are the bounds, with regard to spamicity values, and how
> should i treat emails in that range? spam? non-spam?

I'm using 0.1 as the lower bound; ie below that I say we know it's
nonspam.  Between 0.1 and 0.952 we are uncertain, so the only way to
avoid lots of false positives is to treat as nonspam.  Both of those
values need to be tuned to the training data, though the values seem
less critical than with the geometric-mean method.

> so far, most of the
> mails that i have received with spamicity greater than 0.0 (using the
> fisher method) have been spam, so i am inclined to lean toward spam as
> the answer to that question.
> 
Depends on your tolerance for spam vs false positives.  I'm hoping to
get the false-positive count down to <0.1% but if I do that now I'll
get more false negatives than I want.

Mail in the middle ground, with the Fisher combination method, is mail
that possesses both spammishness and nonspammishness, and the algorithm
is telling us that it has no basis for a real decision and we might as
well roll dice to decide.  Over 90% of mail should be either very close
to zero (as in less than 0.01) or very close to 1 (greater than 0.99).
That percentage may rise as training improves; it's said to be very
effective to train on the middle-ground messages exclusively (of course
only after building a large enough training set).

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |