training is SLOW

Greg Louis glouis at dynamicro.on.ca
Sun Aug 10 20:09:15 CEST 2003


On 20030810 (Sun) at 1330:59 -0400, David Relson wrote:

> The big change in 0.14.x is to use one database, i.e. wordlist.db, for 
> holding both spam and ham tokens.  Previously, bogofilter always used two 
> databases - spamlist.db and goodlist.db.
> 
> With the change, we've noticed that size of BerkeleyDB's cache can have a 
> significant effect on system performance.  If you want to experiment, try 
> something like:
> 
> #!/bin/sh
> for cache in 4 8 12 16 ; do
>   rm -f wordlist.db
>   echo cache size: $cache
>   time -p bogofilter -n -d . -k $cache < test.mbx
> done
> 
> where test.mbx is a mailbox with 1,000 messages.   If the times are too 
> low, try using 10,000 messages.
> 
> Anyhow, the script should help you find a cache size that works well for 
> your machine.

The problem is that the optimal cache size relates to the size of the
wordlist (or the larger of spamlist and goodlist if you use two lists),
and since those lists are growing during retraining, it's quite
possible to get into the slow region unless you can set the cache size
just over the expected final size of the large list.  For example,
suppose you're making a 30-Mb wordlist: a cache of 11 Mb works well
with that, but for sizes between about 15 and 28 Mb, bogofilter can be
awfully slow.  So at least for retraining, it's better to set the cache
to 31 Mb.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list