massive disk space leak vs thresh_update

Tom Allison tallison at tacocat.net
Sat Dec 11 13:50:22 CET 2004


David Relson wrote:
> Greetings,
> 
> Debian bug #284452 brings up a problem of high disk usage when
> using'-u' (autoupdate) with 0.93.x versions of bogofilter.  When traffic
> is high, '-u' causes many wordlist updates and that causes lots of
> logfiles to be created.
> 


<snip>


> 
> As thresh_update only affects folks using '-u' and as it has distinct
> benefits, I've been thinking that "thresh_update=0.01" should become
> part of bogofilter's default configuration.
> 

I would leave it alone.  Bogofilter isn't broken.  Don't try to fix 
anything.

At the beginning of the bogofilter wordlist, you can use every input you 
can get to shore up the accuracy of your filtering process.  I think 
running no thresh_update and everything with '-u' is important for the 
first 1,000 emails at least.  This brings about high effectively ASAP.

I would rather suggest addressing the potential problem inherent with 
the database.  You have to routinely cleanup the logs and manage the 
database environment much more than you have in the past.  Thats the 
nature of the beast.  The option would be to disable transactions 
entirely and use something like procmail lock_files to manage things.

Now, if you turn on '-u' and thresh_update you should greatly reduce the 
amount of writing you do to the database anyways.  Personally, I only 
train on error now and since this is done on a periodic cronjob, I don't 
need transactional support at all.  There's only one writer at a time.

Rather than trying to adjust the bogofilter fundamentals to accomodate a 
database environment.  I think it would be more effective to utilize the 
database tools available to properly manage that database environment to 
begin with.  I fear you are starting on a slope of adjusting bogofilter 
to make up for shortcomings not in the database, but in the database 
environment when not properly cared for.

The decision was made to use database transactions.  Along with that 
comes the necessity (believe me I know!) of learning how to properly 
manage that environment.  If you were to switch to postgresql or oracle 
you would still have to contend with the database maintenance, which 
would have nothing to do with bogofilter.

Even though it has caused me great anguish learning the hard way to 
manage the new database.  What helped me most was Education.  Better 
docs would help.  I would go so far as to give some primitive examples 
of management strategies.

For example:
I still do a bogoutil -d periodically do dump my wordlist into a text 
file.  But I do it much less since I'm only training on error.

I've also added a crontab for 'db_archive -d' to clean up those logs.  I 
suppose this could be better refined to: db_verify && db_archive to keep 
things from getting really ugly.

In a way, I see much of this as a discussion similar to postgres VACUUM 
requirements.  Do it when you need to, if you never do it, don't expect 
much performance.



More information about the Bogofilter mailing list