Is bogotune helpful?

Bill Wohler wohler at newt.com
Tue Dec 2 19:56:02 CET 2003


David and Greg,

Thanks for the discussion.

As it so happens, I'm finishing up a perl script which, among other
things, takes all the senders in my ham pile, as well as the recipients
in my outbox and the addresses in my aliases file and compares them
against the senders in my spam pile. This helped a lot to find false
positives as well as false negatives which had been automatically filed
away by procmail.

That script then builds a training set with 10,000 messages each of spam
and ham. I'll be modifying it to set aside a set for tuning too.

Questions: I have about 30,000 messages in my ham corpus and an infinite
number in my spam corpus (natch) and a script that does all the work.
What numbers would you recommend I use for the $training_set_size and
$tuning_set_size variables? Is there an optimum minimum number for
$training_set_size? I think I remember Greg saying 10,000 which is why
I've been using it. I imagine that the $tuning_set_size should be as
large as my patience, correct? If my patience were infinite, is there a
number where it would begin to have minimal returns? Also, is there a
proportional relationship between the two sizes as well?

> Bogotune can be run without a wordlist using the "-D" (no database)
> option.  It reads in all the messages, splits them in half, and uses the
> first half to build a wordlist (in ram) and uses the second half for
> tuning.  This process is reasonably memory efficient, but too many
> messages and too little ram can still cause problems.

*That's* what -D does. I wasn't sure. You should use the second sentence
in the paragraph above to replace the first sentence in the description
of -D in the manual. It is much more clear.

> P.S.  If you're up for compiling source code and being a beta tester,
> let me know.

Hmmm, I'm falling behind in my other open source projects as it is. I
better not commit to more ;-).

-- 
Bill Wohler <wohler at newt.com>  http://www.newt.com/wohler/  GnuPG ID:610BD9AD
Maintainer of comp.mail.mh FAQ and MH-E. Vote Libertarian!
If you're passed on the right, you're in the wrong lane.




More information about the Bogofilter mailing list