bogotrain

David Relson relson at osagesoftware.com
Wed Aug 4 19:17:59 CEST 2004


On Wed, 4 Aug 2004 13:04:59 -0400
Bob Vincent wrote:

> On Wed, Aug 04, 2004 at 12:44:30PM -0400, David Relson wrote:
> > On Wed, 4 Aug 2004 12:27:19 -0400
> > Bob Vincent wrote:
> > > I don't understand why the suggested spam_cutoff is lower than the
> > > suggested ham_cutoff.  Can anyone explain?
> <snip!>
> > 
> > How large are your message samples (spam and non-spam)?
> 
> About 2,400 each.

That's just over the minimum number of messages for running bogotune.  I
like to have 10,000 each ham and spam.  I use 20-30% of each for
creating a new wordlist and use the remaining 80-70% for tuning.  

One scenario I've used is to divide the ham/spam into quartiles (like
dealing a deck of cards into 4 piles).  Then I do 4 bogotune runs.

Run 1 uses ham.1 and spam.1 for the wordlist and uses *.[234] for
tuning.
Run 2 uses *.2 for wordlist and *.[134] for tuning.
Run 3 is *.3 for wordlist, *.[124] for tuning.
...

Ideally the same results would come from all 4 runs.  Unfortunately,
I've not seen that happen.  Instead I look at the 4 sets of results and
pick what looks good to me.  It's not fully scientific, but it is
satisfying.

...[snip]...

> Thanks.  Will double-check.  Bogofilter seems have to peaked at about
> 99.5% accuracy. I'd like to exceed that, as 0.5% of 4000+ messages
> still means that I see roughtly 20 spams a day.

When I recheck, I usually find some messages I now consider
mis-classified.  For example, in addition to spam about body parts,
mortgages, etc, I want bogofilter to catch virus messages and "bounced
message" messages from mailing lists.

With your volume of messages, you should be able to do a large bogotune
run :-)

David



-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800



More information about the Bogofilter mailing list