Is bogotune helpful?

David Relson relson at osagesoftware.com
Wed Dec 3 00:47:06 CET 2003


On Tue, 02 Dec 2003 10:56:02 -0800
Bill Wohler <wohler at newt.com> wrote:

> David and Greg,
> 
> Thanks for the discussion.
> 
> As it so happens, I'm finishing up a perl script which, among other
> things, takes all the senders in my ham pile, as well as the
> recipients in my outbox and the addresses in my aliases file and
> compares them against the senders in my spam pile. This helped a lot
> to find false positives as well as false negatives which had been
> automatically filed away by procmail.
> 
> That script then builds a training set with 10,000 messages each of
> spam and ham. I'll be modifying it to set aside a set for tuning too.
> 
> Questions: I have about 30,000 messages in my ham corpus and an
> infinite number in my spam corpus (natch) and a script that does all
> the work. What numbers would you recommend I use for the
> $training_set_size and$tuning_set_size variables? Is there an optimum
> minimum number for$training_set_size? I think I remember Greg saying
> 10,000 which is why I've been using it. I imagine that the
> $tuning_set_size should be as large as my patience, correct? If my
> patience were infinite, is there a number where it would begin to have
> minimal returns? Also, is there a proportional relationship between
> the two sizes as well?

Hi Bill,

Greg and I are testing it with data sets of approx 40,000 messages (more
or less evenly divided between ham and spam).  For speed of processing
and privacy protection, the messages are converted to the "message
count" format (which is an alphabetized list of tokens with ham and spam
counts).  The conversion to .mc files is time consuming, but pays off in
lowered memory consumption.  

Greg and I are looking to generate a better set of default parameters
for bogofilter.  Given a number of 40,000 message data sets from a
variety of people, we think bogotune can find a better universal (one
size fits all - reasonably well) set of parameters.  Would you be
interested in contributing to that effort?

> > Bogotune can be run without a wordlist using the "-D" (no database)
> > option.  It reads in all the messages, splits them in half, and uses
> > the first half to build a wordlist (in ram) and uses the second half
> > for tuning.  This process is reasonably memory efficient, but too
> > many messages and too little ram can still cause problems.
> 
> *That's* what -D does. I wasn't sure. You should use the second
> sentence in the paragraph above to replace the first sentence in the
> description of -D in the manual. It is much more clear.

Thanks for the suggestion.  The man page now says:

    The -D option tells bogotune to build a wordlist in memory using
    the input messages.w The messages are read and divided into two
    groups.  The first group is used to build a wordlist (in ram) and
    the second is used for tuning.  To meet the minimum requirements
    of 2000 messages in the wordlist and 500 messages for testing,
    when -D is used, there must be 2500 non-spam and 2500 spam in the
    input files.  If there are enough messages (more than 4000), they
    will be split evenly between wordlist and testing.  Otherwise,
    they will be split proportionately.

Cheers,

David




More information about the Bogofilter mailing list