Database maintenance with combined wordlist.
David Relson
relson at osagesoftware.com
Sat Sep 27 14:53:23 CEST 2003
On Fri, 26 Sep 2003 21:00:23 -0700
"Greg McCann" <greg at cambria.com> wrote:
> On 9/26/2003 at 6:22 PM David Relson <relson at osagesoftware.com> wrote:
>
> >An interesting idea -- one that never occurred to me. I don't know
> >if what you suggest is possible. With separate wordlists, a token
> >common to both wordlists has two timestamps. With the combined
> >wordlist, there's only one.
>
> Thanks for clearing that up, David. I wasn't quite sure how that
> worked - whether the database contained a record for each occurrence
> of each token, or just one record per token, containing the total
> number of occurrences and the date of the most recent one. So pruning
> the database with "-a30 -m" does not subtract everything older than 30
> days from the totals - it just deletes tokens that have not had any
> activity in the past 30 days?
>
> I'll have to reconsider how I am maintaining the database, since I
> think I misunderstood how it operates.
>
> >After a year of running bogofilter, my wordlist.db is 41MB,
> >containing 830,000 tokens, and 260,000 messages...
>
> Doesn't having that large of a database put a drag on your mail
> server? Bogofilter has to read 41MB every time you scan a new email,
> doesn't it?
>
>
> Greg
Not a problem :-)
BerkeleyDB is smart enough that it doesn't have to read the whole
database in order to lookup a couple hundred tokens (as might be found
in a typical message). Also, with Linux's caching recently read pages
are likely to still be in memory.
More information about the Bogofilter
mailing list