training is SLOW

David Relson relson at osagesoftware.com
Sun Aug 10 19:30:59 CEST 2003


At 12:41 PM 8/10/03, Lane P. Lester wrote:
>"Rodney D. Myers" <rdmyers at pe.net> wrote:
> > Before ti would take a few hours to run through 60,000+ email, and
> > less than 1000 spam. It was still churning along after 24 hours, and
> > not done yet.
>
>It seems you're not using the re-training method I was shown:
>http://linux.oreillynet.com/lpt/a/3167

Rodney,

We need more details on what you're doing ...  While thinking about your 
problem, a possible cause came to mind:

The big change in 0.14.x is to use one database, i.e. wordlist.db, for 
holding both spam and ham tokens.  Previously, bogofilter always used two 
databases - spamlist.db and goodlist.db.

With the change, we've noticed that size of BerkeleyDB's cache can have a 
significant effect on system performance.  If you want to experiment, try 
something like:

#!/bin/sh
for cache in 4 8 12 16 ; do
   rm -f wordlist.db
   echo cache size: $cache
   time -p bogofilter -n -d . -k $cache < test.mbx
done

where test.mbx is a mailbox with 1,000 messages.   If the times are too 
low, try using 10,000 messages.

Anyhow, the script should help you find a cache size that works well for 
your machine.

David





More information about the Bogofilter mailing list