ext3fs slowness -- how things proceed
David Relson
relson at osagesoftware.com
Wed Feb 5 13:21:05 CET 2003
At 07:01 AM 2/5/03, Greg Louis wrote:
>When working with a 20-million-byte (19 Mb or so) database, I found
>that a cache of 17Mb was sufficient to support the minimum execution
>time of 18 seconds (building from scratch using bogoutil with tokens in
>random order). 16.5 Mb cache, 1 minute 5 seconds. 16.0 Mb cache, 3
>minutes odd. 15 Mb cache, six and a half minutes. 10 Mb, six and a
>half minutes (maybe a couple seconds longer than 15Mb). 256 Kb, the
>default, took 26 minutes. All this on ext3, data=ordered.
>
>I suspect that without write ordering, the scatter is too great for
>anything but a near-full-size cache. I'm running my production
>bogofilter at work with a 25Mb cache because the goodlist is over 30
>million bytes there.
Greg,
At present, bogofilter creates an unsorted set of words. As it creates the
word set, duplicates are discarded. It then goes through this unordered
word set to compute spam scores. For each word bogofilter gets the count
from the spamlist, then the good list, i.e. it alternates between
wordlists. It seems like this is a worst case scenario for the database.
At the moment, I'm thinking of a two part patch. First, sort the
tokens. That will allow bogofilter to perform database access in an
ordered manner. Second, do all the work for the spamlist, then for the
goodlist. That should minimize cache needs.
If I build it, will you test it?
David
More information about the bogofilter-dev
mailing list