bogotrain
David Relson
relson at osagesoftware.com
Wed Aug 4 19:17:59 CEST 2004
On Wed, 4 Aug 2004 13:04:59 -0400
Bob Vincent wrote:
> On Wed, Aug 04, 2004 at 12:44:30PM -0400, David Relson wrote:
> > On Wed, 4 Aug 2004 12:27:19 -0400
> > Bob Vincent wrote:
> > > I don't understand why the suggested spam_cutoff is lower than the
> > > suggested ham_cutoff. Can anyone explain?
> <snip!>
> >
> > How large are your message samples (spam and non-spam)?
>
> About 2,400 each.
That's just over the minimum number of messages for running bogotune. I
like to have 10,000 each ham and spam. I use 20-30% of each for
creating a new wordlist and use the remaining 80-70% for tuning.
One scenario I've used is to divide the ham/spam into quartiles (like
dealing a deck of cards into 4 piles). Then I do 4 bogotune runs.
Run 1 uses ham.1 and spam.1 for the wordlist and uses *.[234] for
tuning.
Run 2 uses *.2 for wordlist and *.[134] for tuning.
Run 3 is *.3 for wordlist, *.[124] for tuning.
...
Ideally the same results would come from all 4 runs. Unfortunately,
I've not seen that happen. Instead I look at the 4 sets of results and
pick what looks good to me. It's not fully scientific, but it is
satisfying.
...[snip]...
> Thanks. Will double-check. Bogofilter seems have to peaked at about
> 99.5% accuracy. I'd like to exceed that, as 0.5% of 4000+ messages
> still means that I see roughtly 20 spams a day.
When I recheck, I usually find some messages I now consider
mis-classified. For example, in addition to spam about body parts,
mortgages, etc, I want bogofilter to catch virus messages and "bounced
message" messages from mailing lists.
With your volume of messages, you should be able to do a large bogotune
run :-)
David
--
David Relson Osage Software Systems, Inc.
relson at osagesoftware.com Ann Arbor, MI 48103
www.osagesoftware.com tel: 734.821.8800
More information about the Bogofilter
mailing list