training is SLOW

Rodney D. Myers rdmyers at pe.net
Sun Aug 10 22:21:47 CEST 2003


On Sun, 10 Aug 2003 15:33:55 -0400
David Relson <relson at osagesoftware.com> wrote:

> Hi Rodney,
> 
> How big has your wordlist.db grown to be?  Greg's suggestion of using
> "-k 31" to set the cache size might be very helpful.

When I aborted this morning, it's currently;

ls -la /home/rodney/.bogofilter/wordlist.db
-rw-r--r--    1 rodney   rodney    1470464 Aug 10 10:55   /home/rodney/.bogofilter/wordlist.db

Will try the "k -31" tonight.

> When bogofilter has a message to register, it parses the message,
> sorts the tokens (and removes duplicates), then reads/writes the
> database for each token.  Having sorted the tokens makes life easier
> for the database and makes the process much faster.
> 
> When registering a mailbox, after sorting each message's tokens, a 
> cumulative (master) list of tokens is created.  At the end of the
> mailbox, the cumulative list is sorted and then bogofilter
> reads/writes the database.  This minimizes database access and
> improves performance.
> 
> Bogofilter's code for '-b' uses method 1 above.  It doesn't take
> advantage of the cumulative list of method 2.
> 
> So, the slowness you're seeing is probably caused by a bogofilter 
> inefficiency (in not collecting _all_ the tokens and sorting them) and
> BerkelyDB look-up speed (which is suffering from insufficient cache).
> 
> What to do?
> 
> One idea is to create an on-the-fly mbox and feed it to bogofilter.
> I'm thinking along the following lines:
> 
> for dir in /home/rodney/Mail/Computer/* ; do
>     find $dir -type f -exec "echo 'From ' ; cat {} ; echo " 
> /usr/local/bin/bogofilter -b -nvvv -PI
> done

Created a "script" using the above method, but I get;

find: missing argument to `-exec'

not sure what it's missing, but I will tinker.
 
> In the above, the "find" command creates the on-the-fly mbox, which is
> piped to bogofilter.
> 
> David
> 
> By the way, the "-PI" is unnecessary since it specifies case-sensitive
> parsing and that's the default.

Must be a "new" feature". I removed it, and replaced it with the "k -31" argument above.

I will see what I can do about converting, for training, to mbox format.

Thanks

-- 
Rodney D. Myers <rdmyers at pe.net>	Registered Linux User #96112
ICQ#:     AIM#:       YAHOO:
18002350  mailman452  mailman42_5

They that can give up essential liberty to obtain a 
little temporary safety deserve neither liberty nor safety.
        Ben Franklin - 1759
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030810/ec80ba72/attachment.sig>


More information about the Bogofilter mailing list