Database maintenance with combined wordlist.

David Relson relson at osagesoftware.com
Sat Sep 27 00:22:37 CEST 2003


On Fri, 26 Sep 2003 14:46:25 -0700
"Greg McCann" <greg at cambria.com> wrote:

> Greetings,
> 
> I recently converted from using separate spamlist.db/goodlist.db
> wordlists to using the combined wordlist, wordlist.db.  (I am now
> running bogofilter 0.15.4.)
> 
> A useful feature of the separate wordlists is that I could run a daily
> cron job to prune old entries, with different expiration times for
> each list.  Here is my cron job - bogomaint.cron:
> 
> # remove records older than 30 days from spamlist.db
> /usr/local/bin/bogoutil -a30 -m /home/bogofilter/spamlist.db
> # remove records older than 60 days from goodlist.db
> /usr/local/bin/bogoutil -a60 -m /home/bogofilter/goodlist.db
> 
> I do this to keep my wordlists fresh and to keep them from growing too
> large.  The reason I used different expiration times is that I have
> far more spam than ham in my wordlists.  I want to keep the wordlists
> roughly the same size, though even allowing ham to age twice as long
> as spam, the spam list was still twice as large as the ham list.
> 
> However with the combined wordlist, there does not seem to be any way
> to specify different expiration times for spam and ham.  Is there a
> feature I am not aware of that would allow me to do this?  If not,
> would it be possible/practical/reasonable to provide one?  In addition
> to the other selection criteria (-a, -c, -s), it would be nice if I
> could specify whether a maintenance action applied only to spam or
> only to ham (with "both" being the default).
> 
> 
> Greg McCann

Greg,

An interesting idea -- one that never occurred to me.  I don't know if
what you suggest is possible.  With separate wordlists, a token common
to both wordlists has two timestamps.  With the combined wordlist,
there's only one.

Have you considered using 'bogoutil -d' to create a text file, operating
on it with a shell script, and then using 'bogoutil -l' to create a new
database?  This would give you the maximum flexibility.

After a year of running bogofilter, my wordlist.db is 41MB, containing
830,000 tokens, and 260,000 messages.  When I run "bogoutil -d |
bogoutil -l" the database size shrinks to approx 30MB.  I've not felt
the need for maintenance, yet.  Of course your numbers will vary as your
usage is different.

HTH,

David




More information about the Bogofilter mailing list