scaling and learning [wasRe: Inline image based spam]
David Relson
relson at osagesoftware.com
Sat Oct 7 01:31:44 CEST 2006
On Fri, 6 Oct 2006 16:13:09 -0700
Chris Wilkes wrote:
...[snip]...
>
> Anyway I'm open for other ideas, this is very annoying.
>
> Chris
Hi Chris,
I agree. 'Tis annoying. I'm seeing a few such Unsures each day.
Bogofilter _is_ catching some of the messages, but not all. The
messages commonly have a passage from a book (or some such) in hopes of
fooling filters. Since those passages rarely match my ham email, I
anticipate that bogofilter will eventually come to recognize the new
words as spammish.
My wordlist is about 4 yrs old which means the message count is high
and some of the tokens have very high counts. That produces a type of
inertia and slows down learning. For example, here are 2 token counts:
bogoutil -p $BOGOFILTER_DIR osagesoftware.com to:osagesoftware.com
spam good Fisher
.MSG_COUNT 350984 120977 0.500000
osagesoftware.com 53543 11119 0.624030
to:osagesoftware.com 322413 39974 0.735452
It'll take a lot of messages for their score to change noticeably. To
lessen the wordlist's inertia, I may scale the numbers so
that .MSG_COUNT is 1000//1000 and the others are correspondingly
small. It'll be interesting to see how this affects the ability to
learn.
Regards,
David
More information about the Bogofilter
mailing list