Growth [was: Unsures]

David Relson relson at osagesoftware.com
Thu Jun 24 14:37:38 CEST 2004


On 24 Jun 2004 08:13:43 -0400
Tom Anderson wrote:

...[snip]...

> While I'm sure this feature is great, I haven't upgraded to that
> version yet.  But I still find that my wordlist is growing more slowly
> with time.  It's currently 37M.  Two months ago it was 30M.  I started
> this wordlist about 8 months ago, so the average monthly growth is
> around 4.6M/month, while the current growth is around 3.5M/month. 
> That's a deceleration of about 0.14M/month.  If that rate continued,
> there'd be zero growth in about 25 months with a wordlist somewhere
> around 100M.  I believe the slowing of growth is due to the fact that
> many of the tokens are already in there and now they're mostly just
> getting their counts incremented.  I'd imagine that growth will
> eventually become asymptotic to some upper limit, which can be lowered
> via trimming of hapaxes, reordering the database, and whatnot.  With
> careful pruning, the upper limit may be around 50M.  With excessive
> pruning, perhaps much lower.
> 
> Tom

Hi Tom,

Wordlist growth is affected by several factors, notably additions to the
wordlist and whether it's "packed" or not.

Running "bogoutil -d wordlist.db | bogoutil -l wordlist.db.new" will
"pack" the wordlist.  The new list can easily be 1/3 smaller than the
old.

Bogofilter uses BerkeleyDB btrees.  Under the best circumstances, its
records are stored alphabetically in consecutive blocks with very little
wasted space per block.  The "pack" command creates such a wordlist.

When a word needs to be added to a full block, the block is split in two
so there's a place for the new entry.  The splitting leaves room for
additional entries in the new blocks.  Adding another word that's
alphabetically close will use the newly available space.  

In the worst case, the packing of the wordlist will approach 50%
efficiency.  Given a wordlist that's close to this level of
inefficiency, a lot of words can be added (in the available space)
without any apparent size change.

Looking at my wordlist size, it's grown from 73MB to 84MB since the
first of this month.  Packing reduces it to 51MB.

I think a large part of the decreased rate of growth is due to a large
number of splits and filling in the partial blocks.

Regards,

David



More information about the Bogofilter mailing list