Database maintenance with combined wordlist.

Greg McCann greg at cambria.com
Sat Sep 27 06:00:23 CEST 2003


On 9/26/2003 at 6:22 PM David Relson <relson at osagesoftware.com> wrote:

>An interesting idea -- one that never occurred to me.  I don't know if
>what you suggest is possible.  With separate wordlists, a token common
>to both wordlists has two timestamps.  With the combined wordlist,
>there's only one.

Thanks for clearing that up, David.  I wasn't quite sure how that worked - whether the database contained a record for each occurrence of each token, or just one record per token, containing the total number of occurrences and the date of the most recent one.  So pruning the database with "-a30 -m" does not subtract everything older than 30 days from the totals - it just deletes tokens that have not had any activity in the past 30 days?

I'll have to reconsider how I am maintaining the database, since I think I misunderstood how it operates.

>After a year of running bogofilter, my wordlist.db is 41MB, containing
>830,000 tokens, and 260,000 messages...

Doesn't having that large of a database put a drag on your mail server?  Bogofilter has to read 41MB every time you scan a new email, doesn't it?


Greg






More information about the Bogofilter mailing list