bogotune [was: Radical lexers]

David Relson relson at osagesoftware.com
Thu Dec 11 13:55:50 CET 2003


Organization: Osage Software Systems, Inc.
X-Mailer: Sylpheed version 0.9.7claws6 (GTK+ 1.2.10; i686-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Organization: Osage Software Systems, Inc.
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Organization: Osage Software Systems, Inc.
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On 11 Dec 2003 16:00:39 +1100
michael at optusnet.com.au wrote:

> David Relson <relson at osagesoftware.com> writes:
> [..]
> > Hi Michael,
> > 
> > Greg and I have just such a project for you to participate in :-) 
> > He's collecting corpora from several people and is planning to run
> > them all through bogotune.  The goal is to generate new default
> > parameters for bogofilter -- parameters that do a demonstrably good
> > job on a wide variety of messages.  Bogofilter's current defaults
> > are based solely on our (Greg's and mine) corpora of a year ago.  We
> > want something based on bogofilter's current parsing and scoring. 
> > Would you care to participate?  If so, I'll be glad to send you the
> > needed details.
> 
> Sure. I tried to use bogotune some time ago, but it took a lot of
> effort to get it running, and then it was looking at a multi-day run
> time so I skipped it. :)
> 
> If it's a bit faster now (read: Doesn't exec() a new process for each
> message!) then deal me in.
> 
> Michael.

Hi Michael,

Greg and I would be pleased to have you join us in our tuning effort :-)

bogotune is a _lot_ faster now.  It is now (more or less) a
super-charged version of bogofilter.   The perl script has been recoded
in C, so it has access to all the functionality and speed of bogofilter.
 On my 500 Mhz workstation it can make a single scoring pass over 20,000
ham and 20,000 spam in approx 8 seconds.   A bogotune run usually needs
330 to 500 passes over the messages so a 40,000 message tuning run takes
me about 45 minutes.

The tuning process consists of several steps.  First the wordlist is
read to calculate the initial robx value.  Then the messages are read. 
If they're in message-count format, for each message bogotune creates an
array that stores the ham and spam counts for each token of the message.
 Since just the counts are needed for computing the spamicity, bogotune
doesn't need to save the text of the token.  Then bogotune runs a coarse
scan (using 5 values of robs, 5 of robx, and 9 of min_dev - with large
delta between values), finds the best result, runs a fine scan (using
smaller deltas), finds the best result, and prints its recommendations.

Using the message-count format provides privacy and speed.  Converting
to message-count format involves parsing a message, looking up the
tokens in the wordlist, and printing the tokens and their ham and spam
counts.  The tokens are alphabetized, which effectively obscures the
meaning of the message and preserves privacy.  Since the file format is
simple and contains the counts, bogotune can read it quickly and saves
time by not having to lookup the tokens in the wordlist.  The conversion
to message-count format used to be done by a script and was somewhat
slow.  As of version 0.15.10 bogotune can now do the conversion and
generate properly formatted output.

To prepare a message collection for bogotune, start with about 40,000
each ham and spam.  Randomly split in half.  Use the first half to creat
the wordlist.  Convert the second half to message-count format.  Run
bogotune.  The commands for preparation are (roughly):

   mkdir tune
   split
   bogofilter -d tune -n -I ham.part1.mbx
   bogofilter -d tune -s -I spam.part1.mbx
   bogotune -M -I ham.part2.mbx > ham.mc
   bogotune -M -I spam.part2.mbx > spam.mc

The above commands assume mbox format.  For Maildir or MH folder, use
"-B directory" rather than "-I ...mbx".

To run bogotune, do:

   bogotune -d tune -n ham.mc -s spam.mc

To prepare the collection for Greg and me, do:

   tar jcf tune.tar.bz2 tune/wordlist.db ham.mc spam.mc

and let us know from whence we can ftp the tarball.

As you know, the C implementation of bogotune is fairly new.  It's
development stage is best described as "gamma test".  It's working and
working well and still undergoing refinement.  I'm sure you'll find it
usable and useful and you may encounter some glitches.

We're looking forward to hearing from you :-)

David





More information about the Bogofilter mailing list