training is SLOW
David Relson
relson at osagesoftware.com
Sun Aug 10 21:33:55 CEST 2003
Hi Rodney,
How big has your wordlist.db grown to be? Greg's suggestion of using "-k
31" to set the cache size might be very helpful.
When bogofilter has a message to register, it parses the message, sorts the
tokens (and removes duplicates), then reads/writes the database for each
token. Having sorted the tokens makes life easier for the database and
makes the process much faster.
When registering a mailbox, after sorting each message's tokens, a
cumulative (master) list of tokens is created. At the end of the mailbox,
the cumulative list is sorted and then bogofilter reads/writes the
database. This minimizes database access and improves performance.
Bogofilter's code for '-b' uses method 1 above. It doesn't take advantage
of the cumulative list of method 2.
So, the slowness you're seeing is probably caused by a bogofilter
inefficiency (in not collecting _all_ the tokens and sorting them) and
BerkelyDB look-up speed (which is suffering from insufficient cache).
What to do?
One idea is to create an on-the-fly mbox and feed it to bogofilter. I'm
thinking along the following lines:
for dir in /home/rodney/Mail/Computer/* ; do
find $dir -type f -exec "echo 'From ' ; cat {} ; echo "
/usr/local/bin/bogofilter -b -nvvv -PI
done
In the above, the "find" command creates the on-the-fly mbox, which is
piped to bogofilter.
David
By the way, the "-PI" is unnecessary since it specifies case-sensitive
parsing and that's the default.
More information about the Bogofilter
mailing list