training is SLOW

David Relson relson at osagesoftware.com
Sun Aug 10 21:33:55 CEST 2003


Hi Rodney,

How big has your wordlist.db grown to be?  Greg's suggestion of using "-k 
31" to set the cache size might be very helpful.

When bogofilter has a message to register, it parses the message, sorts the 
tokens (and removes duplicates), then reads/writes the database for each 
token.  Having sorted the tokens makes life easier for the database and 
makes the process much faster.

When registering a mailbox, after sorting each message's tokens, a 
cumulative (master) list of tokens is created.  At the end of the mailbox, 
the cumulative list is sorted and then bogofilter reads/writes the 
database.  This minimizes database access and improves performance.

Bogofilter's code for '-b' uses method 1 above.  It doesn't take advantage 
of the cumulative list of method 2.

So, the slowness you're seeing is probably caused by a bogofilter 
inefficiency (in not collecting _all_ the tokens and sorting them) and 
BerkelyDB look-up speed (which is suffering from insufficient cache).

What to do?

One idea is to create an on-the-fly mbox and feed it to bogofilter. I'm 
thinking along the following lines:

for dir in /home/rodney/Mail/Computer/* ; do
    find $dir -type f -exec "echo 'From ' ; cat {} ; echo " 
/usr/local/bin/bogofilter -b -nvvv -PI
done

In the above, the "find" command creates the on-the-fly mbox, which is 
piped to bogofilter.

David

By the way, the "-PI" is unnecessary since it specifies case-sensitive 
parsing and that's the default.





More information about the Bogofilter mailing list