training is SLOW

David Relson relson at osagesoftware.com
Sun Aug 10 22:46:49 CEST 2003


At 04:21 PM 8/10/03, Rodney D. Myers wrote:
>On Sun, 10 Aug 2003 15:33:55 -0400
>David Relson <relson at osagesoftware.com> wrote:
>
> > Hi Rodney,
> >
> > How big has your wordlist.db grown to be?  Greg's suggestion of using
> > "-k 31" to set the cache size might be very helpful.
>
>When I aborted this morning, it's currently;
>
>ls -la /home/rodney/.bogofilter/wordlist.db
>-rw-r--r--    1 rodney   rodney    1470464 Aug 10 
>10:55   /home/rodney/.bogofilter/wordlist.db

That's pretty darn small.  I'd been expecting large - say 10's or 100's of 
megabytes.  I wonder if you got hung up on something.  Could you tell if 
bogofilter was actually running?  Could you hear the hard drive working 
away?  Does "top" indicate activity?  When I tested with your flags, 
specifically "-nvvv", bogofilter was printing a status message for every 
email.  Are those appearing.

>Will try the "k -31" tonight.

Good.

> > When bogofilter has a message to register, it parses the message,
> > sorts the tokens (and removes duplicates), then reads/writes the
> > database for each token.  Having sorted the tokens makes life easier
> > for the database and makes the process much faster.
> >
> > When registering a mailbox, after sorting each message's tokens, a
> > cumulative (master) list of tokens is created.  At the end of the
> > mailbox, the cumulative list is sorted and then bogofilter
> > reads/writes the database.  This minimizes database access and
> > improves performance.
> >
> > Bogofilter's code for '-b' uses method 1 above.  It doesn't take
> > advantage of the cumulative list of method 2.
> >
> > So, the slowness you're seeing is probably caused by a bogofilter
> > inefficiency (in not collecting _all_ the tokens and sorting them) and
> > BerkelyDB look-up speed (which is suffering from insufficient cache).
> >
> > What to do?
> >
> > One idea is to create an on-the-fly mbox and feed it to bogofilter.
> > I'm thinking along the following lines:
> >
> > for dir in /home/rodney/Mail/Computer/* ; do
> >     find $dir -type f -exec "echo 'From ' ; cat {} ; echo "
> > /usr/local/bin/bogofilter -b -nvvv -PI
> > done
>
>Created a "script" using the above method, but I get;
>
>find: missing argument to `-exec'
>
>not sure what it's missing, but I will tinker.
>
> > In the above, the "find" command creates the on-the-fly mbox, which is
> > piped to bogofilter.
> >
> > David
> >
> > By the way, the "-PI" is unnecessary since it specifies case-sensitive
> > parsing and that's the default.
>
>Must be a "new" feature". I removed it, and replaced it with the "k -31" 
>argument above.
>
>I will see what I can do about converting, for training, to mbox format.
>
>Thanks

Next idea (sweet and simple; undoubtedly can be improved):

#!/bin/sh
#
# file_to_mbx.sh
#
cat > tmp
echo "From " ; cat tmp ; echo ""

#!/bin/sh
#
# dir_to_mbx.sh
#
find $1 -type f | xarg file_to_mbx.sh

#!/bin/sh
#
# train.sh
#
for dir in ... ; do
    dir_to_mbx $dir | bogofilter -d . -nvvv
done






More information about the Bogofilter mailing list