What is spam?
relson at osagesoftware.com
Wed May 12 07:52:24 EDT 2004
On 12 May 2004 07:30:15 -0400
Tom Anderson wrote:
> On Tue, 2004-05-11 at 19:27, David Relson wrote:
> > is valid. I've got some doubts about the "over-sensitivity" clause.
> It was my understanding that this was part of the impetus behind
> implementing this in the first place. Registering the same tokens
> over and over again gives them more weight than they perhaps deserve,
> giving the wordlist increased "momentum" toward the same
> classifications. Desensitizing the wordlist allows quicker evolution
> in the face of new spam. I'd imagine the space-saving effect is
> minimal since incrementing a counter does not require much, if any,
> additional space.
thresh_update was implemented for database size reasons. I don't worry
about momentum/inertia/sensitivity in the database and didn't even think
about that when I implemented it.
As of 0700 this morning, my wordlist has 1,335,216 tokens from 61,382
spam and 75,096 ham, I accept the fact that messages from certain
sources are likely to score Unsure and this is unlikely to change.
Since bogofilter is correctly classifying over 99% of my incoming mail
as spam or ham (with less than 1% Unsures), I'm happy with its
More information about the Bogofilter