min_dev

Thu Jul 1 12:04:58 CEST 2004

Tom Anderson wrote:
> On Tue, 2004-06-29 at 21:52, David Relson wrote:
> 
>>>min_dev     = 0.465
>>>ham_cutoff  = 0.15
>>>spam_cutoff = 0.51
>>
>>With that min_dev, all the tokens going into the final score will have
>>extreme scores.  I run with a comparable min_dev and often see message
>>scores based on just 5 or 6 tokens because the large min_dev excludes
>>everything else.  Today I had a false negative with just 3 tokens used.
> 
> 
> That's exactly why I suggested removing the dependence on 0.5 in the
> min_dev calculation.  When your cutoffs are nowhere near centered around
> 0.5, you need to have a huge min_dev in order to encompass your actual
> unsure center, or a tiny one to make it insignificant.
> 
> In the case of Tom's numbers above, his unsure zone is from 0.15 to
> 0.51, the center of which would be at 0.33 with a min_dev of 0.18. 
> There's no reason to ignore a token score of 0.6, because that would be
> very spammy given his numbers above.  But for a moderately sized min_dev
> such as 0.18, such a token would be ignored if centered from 0.5 instead
> of 0.33, contributing to false negatives.  Conversely, a token that
> scores 0.3 should actually be unsure, whereas bogofilter would currently
> score it somewhat hammy with a min_dev of 0.18, again contributing to
> false negatives.  The effect would be even more severe if his spam
> cutoff were lower.
> 
> In order to reduce false negatives, users are forced to increase the
> size of their min_dev so that false hammy tokens are not added, or they
> are required to trivialize the min_dev so that spammy tokens aren't
> ignored.  The effect is that min_dev is ineffective either way.  To
> again give min_dev its intended purpose, and to allow moderately sized
> min_devs to be effective, I believe that the centering of it at 0.5 must
> be changed.  We can either go to an exclusion min and max, or provide
> for a parameter to change the center.
> 
> The least effect on existing users would be to add a parameter to the
> configuration file to change the center, having it default to 0.5. 
> Anyone who does not want to change it could leave it as is with no
> change to their scoring, while those who prefer to experiment could
> modify it.
> 
> Tom
> 
robx        = 0.560000  # (5.60e-01)
robs        = 0.031600  # (3.16e-02)
min_dev     = 0.465000  # (4.65e-01)
ham_cutoff  = 0.150000  # (1.50e-01)
spam_cutoff = 0.510000  # (5.10e-01)
ns_esf      = 1.000000  # (1.00e+00)
sp_esf      = 1.000000  # (1.00e+00)

hapaxes:  ham   69877 (16.20%), spam  185978 (43.13%)
    pure:  ham  143816 (33.35%), spam  259955 (60.28%)

I think all this discussion is centered around a hypothesis that the 
variety of tokens that contribute to Spam is much greater than the 
variety of tokens that contribute to Ham.  And when you consider the 
that occurrence of the Ham tokens will therefore be much greater, the 
certainty of a token being Ham is higher than the certainty of a token 
being Spam simply on the basis of how often it's been seen.

Spammers, by their use of spelling variations, are attempting to merely 
confuse the filters enough to register as Uncertain and therefore 
deliverable.

My guess here is that enough review of wordlists and spam will show that 
much of what we are doing with these "offcenter" scores is not trying to 
detect what is spam, but detect what isn't ham.

I can identify what is ham by merely scanning Subject lines and Senders.