Database maintenance with combined wordlist.

David Relson relson at osagesoftware.com
Sat Sep 27 14:53:23 CEST 2003


On Fri, 26 Sep 2003 21:00:23 -0700
"Greg McCann" <greg at cambria.com> wrote:

> On 9/26/2003 at 6:22 PM David Relson <relson at osagesoftware.com> wrote:
> 
> >An interesting idea -- one that never occurred to me.  I don't know
> >if what you suggest is possible.  With separate wordlists, a token
> >common to both wordlists has two timestamps.  With the combined
> >wordlist, there's only one.
> 
> Thanks for clearing that up, David.  I wasn't quite sure how that
> worked - whether the database contained a record for each occurrence
> of each token, or just one record per token, containing the total
> number of occurrences and the date of the most recent one.  So pruning
> the database with "-a30 -m" does not subtract everything older than 30
> days from the totals - it just deletes tokens that have not had any
> activity in the past 30 days?
> 
> I'll have to reconsider how I am maintaining the database, since I
> think I misunderstood how it operates.
> 
> >After a year of running bogofilter, my wordlist.db is 41MB,
> >containing 830,000 tokens, and 260,000 messages...
> 
> Doesn't having that large of a database put a drag on your mail
> server?  Bogofilter has to read 41MB every time you scan a new email,
> doesn't it?
> 
> 
> Greg

Not a problem :-)

BerkeleyDB is smart enough that it doesn't have to read the whole
database in order to lookup a couple hundred tokens (as might be found
in a typical message).  Also, with Linux's caching recently read pages
are likely to still be in memory.




More information about the Bogofilter mailing list