min_dev

Tom Anderson tanderso at oac-design.com
Thu Jul 1 13:28:12 CEST 2004


On Thu, 2004-07-01 at 06:04, Tom Allison wrote:
> I think all this discussion is centered around a hypothesis that the 
> variety of tokens that contribute to Spam is much greater than the 
> variety of tokens that contribute to Ham.  And when you consider the 
> that occurrence of the Ham tokens will therefore be much greater, the 
> certainty of a token being Ham is higher than the certainty of a token 
> being Spam simply on the basis of how often it's been seen.
> 
> Spammers, by their use of spelling variations, are attempting to merely 
> confuse the filters enough to register as Uncertain and therefore 
> deliverable.
> 
> My guess here is that enough review of wordlists and spam will show that 
> much of what we are doing with these "offcenter" scores is not trying to 
> detect what is spam, but detect what isn't ham.

Yes.  That's exactly it.

"Microsoft", "Office", "Windows", "Adobe", etc., are things I rarely
discuss, but something I'm constantly offered in spam.  These things are
scored well under 0.5, but they are not ham.

> I can identify what is ham by merely scanning Subject lines and Senders.

We have bogofilter so that that is not necessary.  The idea is to
improve bogofilter's scoring to prevent this problem.  How to do so I'm
not sure about.

Tom





More information about the Bogofilter mailing list