scaling and learning [wasRe: Inline image based spam]
Chris Wilkes
cwilkes-bf at ladro.com
Sat Oct 7 01:54:55 CEST 2006
On Fri, Oct 06, 2006 at 07:31:44PM -0400, David Relson wrote:
>
> Bogofilter _is_ catching some of the messages, but not all. The
> messages commonly have a passage from a book (or some such) in hopes of
> fooling filters. Since those passages rarely match my ham email, I
> anticipate that bogofilter will eventually come to recognize the new
> words as spammish.
The length of the English (and French, German, etc) dictionaries exceed
my patience. They are using random words / passages that might never
repeat themselves.
> It'll take a lot of messages for their score to change noticeably. To
> lessen the wordlist's inertia, I may scale the numbers so
> that .MSG_COUNT is 1000//1000 and the others are correspondingly
> small. It'll be interesting to see how this affects the ability to
> learn.
Is there a way to tell bogofilter to reject using a word in calculating
the spamicity unless it has appeared X times? That could get rid of the
random word issue -- unless a word show up in a dozen emails out of 1000
it isn't counted in the score.
Chris
More information about the Bogofilter
mailing list