Spam / ham registration issue

David Relson relson at osagesoftware.com
Wed Mar 3 15:16:43 CET 2004


On 03 Mar 2004 08:38:02 -0500
Tom Anderson wrote:

> On Wed, 2004-03-03 at 08:25, David Relson wrote:
> > Pretty much.  The basic principle is comparing the likelihod of the
> > word being in spam to the word being in ham.  You've maxed out both
> > of them :-)
> 
> So registering _other_ hams and spams not having these tokens would
> tend to have more effect than registering this same one over and over?
> 
> > An alternate view of the world would use message counts rather than
> > percents of words in messages.  The alternate view could give us "I
> > get 5 times as much spam as ham, so the odds are 5::1 that the next
> > message is spam."
> 
> Although it sounds almost reasonable, it fails for the same reason as
> racial profiling.  The innocent ones get harassed unduly.  Biasing 5:1
> toward spam on each email would lead to an inordinate amount of false
> positives.

Which is why bogofilter uses word::message ratios, rather than simpler
word counts (and their ratio).




More information about the Bogofilter mailing list