Spam / ham registration issue
David Relson
relson at osagesoftware.com
Wed Mar 3 15:16:43 CET 2004
On 03 Mar 2004 08:38:02 -0500
Tom Anderson wrote:
> On Wed, 2004-03-03 at 08:25, David Relson wrote:
> > Pretty much. The basic principle is comparing the likelihod of the
> > word being in spam to the word being in ham. You've maxed out both
> > of them :-)
>
> So registering _other_ hams and spams not having these tokens would
> tend to have more effect than registering this same one over and over?
>
> > An alternate view of the world would use message counts rather than
> > percents of words in messages. The alternate view could give us "I
> > get 5 times as much spam as ham, so the odds are 5::1 that the next
> > message is spam."
>
> Although it sounds almost reasonable, it fails for the same reason as
> racial profiling. The innocent ones get harassed unduly. Biasing 5:1
> toward spam on each email would lead to an inordinate amount of false
> positives.
Which is why bogofilter uses word::message ratios, rather than simpler
word counts (and their ratio).
More information about the Bogofilter
mailing list