training is SLOW
Rodney D. Myers
rdmyers at pe.net
Sun Aug 10 22:21:47 CEST 2003
On Sun, 10 Aug 2003 15:33:55 -0400
David Relson <relson at osagesoftware.com> wrote:
> Hi Rodney,
>
> How big has your wordlist.db grown to be? Greg's suggestion of using
> "-k 31" to set the cache size might be very helpful.
When I aborted this morning, it's currently;
ls -la /home/rodney/.bogofilter/wordlist.db
-rw-r--r-- 1 rodney rodney 1470464 Aug 10 10:55 /home/rodney/.bogofilter/wordlist.db
Will try the "k -31" tonight.
> When bogofilter has a message to register, it parses the message,
> sorts the tokens (and removes duplicates), then reads/writes the
> database for each token. Having sorted the tokens makes life easier
> for the database and makes the process much faster.
>
> When registering a mailbox, after sorting each message's tokens, a
> cumulative (master) list of tokens is created. At the end of the
> mailbox, the cumulative list is sorted and then bogofilter
> reads/writes the database. This minimizes database access and
> improves performance.
>
> Bogofilter's code for '-b' uses method 1 above. It doesn't take
> advantage of the cumulative list of method 2.
>
> So, the slowness you're seeing is probably caused by a bogofilter
> inefficiency (in not collecting _all_ the tokens and sorting them) and
> BerkelyDB look-up speed (which is suffering from insufficient cache).
>
> What to do?
>
> One idea is to create an on-the-fly mbox and feed it to bogofilter.
> I'm thinking along the following lines:
>
> for dir in /home/rodney/Mail/Computer/* ; do
> find $dir -type f -exec "echo 'From ' ; cat {} ; echo "
> /usr/local/bin/bogofilter -b -nvvv -PI
> done
Created a "script" using the above method, but I get;
find: missing argument to `-exec'
not sure what it's missing, but I will tinker.
> In the above, the "find" command creates the on-the-fly mbox, which is
> piped to bogofilter.
>
> David
>
> By the way, the "-PI" is unnecessary since it specifies case-sensitive
> parsing and that's the default.
Must be a "new" feature". I removed it, and replaced it with the "k -31" argument above.
I will see what I can do about converting, for training, to mbox format.
Thanks
--
Rodney D. Myers <rdmyers at pe.net> Registered Linux User #96112
ICQ#: AIM#: YAHOO:
18002350 mailman452 mailman42_5
They that can give up essential liberty to obtain a
little temporary safety deserve neither liberty nor safety.
Ben Franklin - 1759
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030810/ec80ba72/attachment.sig>
More information about the Bogofilter
mailing list