scaling and learning [wasRe: Inline image based spam]

Dwayne Hottinger dhottinger at harrisonburg.k12.va.us
Sat Oct 7 02:55:56 CEST 2006


My wordlist is around 2 years old.  Would a fresh list be better?

Quoting David Relson <relson at osagesoftware.com>:

> On Fri, 6 Oct 2006 16:13:09 -0700
> Chris Wilkes wrote:
>
> ...[snip]...
>
> >
> > Anyway I'm open for other ideas, this is very annoying.
> >
> > Chris
>
> Hi Chris,
>
> I agree.  'Tis annoying.  I'm seeing a few such Unsures each day.
> Bogofilter _is_ catching some of the messages, but not all.  The
> messages commonly have a passage from a book (or some such) in hopes of
> fooling filters.  Since those passages rarely match my ham email, I
> anticipate that bogofilter will eventually come to recognize the new
> words as spammish.
>
> My wordlist is about 4 yrs old which means the message count is high
> and some of the tokens have very high counts.  That produces a type of
> inertia and slows down learning.  For example, here are 2 token counts:
>
> bogoutil -p $BOGOFILTER_DIR osagesoftware.com to:osagesoftware.com
>                           spam    good  Fisher
> .MSG_COUNT              350984  120977  0.500000
> osagesoftware.com        53543   11119  0.624030
> to:osagesoftware.com    322413   39974  0.735452
>
> It'll take a lot of messages for their score to change noticeably.  To
> lessen the wordlist's inertia, I may scale the numbers so
> that .MSG_COUNT is 1000//1000 and the others are correspondingly
> small.  It'll be interesting to see how this affects the ability to
> learn.
>
> Regards,
>
> David
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
>


--
Dwayne Hottinger
Network Administrator
Harrisonburg City Public Schools



More information about the Bogofilter mailing list