ext3fs slowness -- how things proceed

David Relson relson at osagesoftware.com
Wed Feb 5 13:21:05 CET 2003


At 07:01 AM 2/5/03, Greg Louis wrote:


>When working with a 20-million-byte (19 Mb or so) database, I found
>that a cache of 17Mb was sufficient to support the minimum execution
>time of 18 seconds (building from scratch using bogoutil with tokens in
>random order).  16.5 Mb cache, 1 minute 5 seconds.  16.0 Mb cache, 3
>minutes odd.  15 Mb cache, six and a half minutes.  10 Mb, six and a
>half minutes (maybe a couple seconds longer than 15Mb).  256 Kb, the
>default, took 26 minutes.  All this on ext3, data=ordered.
>
>I suspect that without write ordering, the scatter is too great for
>anything but a near-full-size cache.  I'm running my production
>bogofilter at work with a 25Mb cache because the goodlist is over 30
>million bytes there.

Greg,

At present, bogofilter creates an unsorted set of words.  As it creates the 
word set, duplicates are discarded.  It then goes through this unordered 
word set to compute spam scores.  For each word bogofilter gets the count 
from the spamlist, then the good list, i.e. it alternates between 
wordlists.  It seems like this is a worst case scenario for the database.

At the moment, I'm thinking of a two part patch.  First, sort the 
tokens.  That will allow bogofilter to perform database access in an 
ordered manner.  Second, do all the work for the spamlist, then for the 
goodlist.  That should minimize cache needs.

If I build it, will you test it?

David







More information about the bogofilter-dev mailing list