scaling and learning [wasRe: Inline image based spam]

David Relson relson at osagesoftware.com
Sat Oct 7 01:31:44 CEST 2006


On Fri, 6 Oct 2006 16:13:09 -0700
Chris Wilkes wrote:

...[snip]...

> 
> Anyway I'm open for other ideas, this is very annoying.
> 
> Chris

Hi Chris,

I agree.  'Tis annoying.  I'm seeing a few such Unsures each day.
Bogofilter _is_ catching some of the messages, but not all.  The
messages commonly have a passage from a book (or some such) in hopes of
fooling filters.  Since those passages rarely match my ham email, I
anticipate that bogofilter will eventually come to recognize the new
words as spammish.

My wordlist is about 4 yrs old which means the message count is high
and some of the tokens have very high counts.  That produces a type of
inertia and slows down learning.  For example, here are 2 token counts:

bogoutil -p $BOGOFILTER_DIR osagesoftware.com to:osagesoftware.com 
                          spam    good  Fisher
.MSG_COUNT              350984  120977  0.500000
osagesoftware.com        53543   11119  0.624030
to:osagesoftware.com    322413   39974  0.735452

It'll take a lot of messages for their score to change noticeably.  To
lessen the wordlist's inertia, I may scale the numbers so
that .MSG_COUNT is 1000//1000 and the others are correspondingly
small.  It'll be interesting to see how this affects the ability to
learn.

Regards,

David



More information about the Bogofilter mailing list