massive disk space leak vs thresh_update
Tom Allison
tallison at tacocat.net
Sat Dec 11 13:50:22 CET 2004
David Relson wrote:
> Greetings,
>
> Debian bug #284452 brings up a problem of high disk usage when
> using'-u' (autoupdate) with 0.93.x versions of bogofilter. When traffic
> is high, '-u' causes many wordlist updates and that causes lots of
> logfiles to be created.
>
<snip>
>
> As thresh_update only affects folks using '-u' and as it has distinct
> benefits, I've been thinking that "thresh_update=0.01" should become
> part of bogofilter's default configuration.
>
I would leave it alone. Bogofilter isn't broken. Don't try to fix
anything.
At the beginning of the bogofilter wordlist, you can use every input you
can get to shore up the accuracy of your filtering process. I think
running no thresh_update and everything with '-u' is important for the
first 1,000 emails at least. This brings about high effectively ASAP.
I would rather suggest addressing the potential problem inherent with
the database. You have to routinely cleanup the logs and manage the
database environment much more than you have in the past. Thats the
nature of the beast. The option would be to disable transactions
entirely and use something like procmail lock_files to manage things.
Now, if you turn on '-u' and thresh_update you should greatly reduce the
amount of writing you do to the database anyways. Personally, I only
train on error now and since this is done on a periodic cronjob, I don't
need transactional support at all. There's only one writer at a time.
Rather than trying to adjust the bogofilter fundamentals to accomodate a
database environment. I think it would be more effective to utilize the
database tools available to properly manage that database environment to
begin with. I fear you are starting on a slope of adjusting bogofilter
to make up for shortcomings not in the database, but in the database
environment when not properly cared for.
The decision was made to use database transactions. Along with that
comes the necessity (believe me I know!) of learning how to properly
manage that environment. If you were to switch to postgresql or oracle
you would still have to contend with the database maintenance, which
would have nothing to do with bogofilter.
Even though it has caused me great anguish learning the hard way to
manage the new database. What helped me most was Education. Better
docs would help. I would go so far as to give some primitive examples
of management strategies.
For example:
I still do a bogoutil -d periodically do dump my wordlist into a text
file. But I do it much less since I'm only training on error.
I've also added a crontab for 'db_archive -d' to clean up those logs. I
suppose this could be better refined to: db_verify && db_archive to keep
things from getting really ugly.
In a way, I see much of this as a discussion similar to postgres VACUUM
requirements. Do it when you need to, if you never do it, don't expect
much performance.
More information about the Bogofilter
mailing list