scaling and learning [wasRe: Inline image based spam]

Chris Wilkes cwilkes-bf at ladro.com
Sat Oct 7 01:54:55 CEST 2006


On Fri, Oct 06, 2006 at 07:31:44PM -0400, David Relson wrote:
>
> Bogofilter _is_ catching some of the messages, but not all.  The
> messages commonly have a passage from a book (or some such) in hopes of
> fooling filters.  Since those passages rarely match my ham email, I
> anticipate that bogofilter will eventually come to recognize the new
> words as spammish.

The length of the English (and French, German, etc) dictionaries exceed
my patience.  They are using random words / passages that might never
repeat themselves.

> It'll take a lot of messages for their score to change noticeably.  To
> lessen the wordlist's inertia, I may scale the numbers so
> that .MSG_COUNT is 1000//1000 and the others are correspondingly
> small.  It'll be interesting to see how this affects the ability to
> learn.

Is there a way to tell bogofilter to reject using a word in calculating
the spamicity unless it has appeared X times?  That could get rid of the
random word issue -- unless a word show up in a dozen emails out of 1000
it isn't counted in the score.

Chris



More information about the Bogofilter mailing list