t.bulkmode problem

David Relson relson at osagesoftware.com
Wed Nov 24 00:38:46 CET 2004


On Tue, 23 Nov 2004 15:39:25 +0100
Matthias Andree wrote:

> Tom Anderson <tanderso at oac-design.com> writes:
> 
> > I haven't really been following this too much, but I hope my
> > suggestion is useful.  Would it be possible to access seperate
> > environments sequentially instead of concurrently, and would that
> > solve the multiple locking problem?  In other words, do the changes
> > in multiple environments need to be a single atomic transaction, or
> > could it be split into one atomic transaction per environment? 
> > Maybe even have the program call itself recursively?
> 
> Good idea, thank you.
> 
> The general problem is, in short:
> 
> 1. a transactional or concurrent database file needs a _writable_
>    environment (__db.*, log.*, lockfile-*) even for read-only access
> 
> 2. we need to read a token and the corresponding .MSG_COUNT token in
> the
>    same transaction, or with a database that cannot change.
> 
> 3. we are currently reading all tokens, sorting them lexicographically
>    (to profit from B-whatever-tree locality of lexicographically
>    short-distance tokens, with proven significant benefits)
>    and then for each token trying all lists in order of their
>    preference. Lists at same preference are accessed in order of
>    appearance in the configuration, don't ask me if forward or
>    reverse, I haven't checked.
> 
> What you suggest would mean that we:
> 
> a. read all tokens and sort the list
> b. open the first wordlist/environment for reading
> c. gather spam/ham probabilities for all tokens listed in that list
>    and delete them from the sorted list
> d. close the wordlist
> e. repeat b - d for subsequent wordlists until the list is exhausted.
> f. if -u mode is effective, re-open first ("default") wordlist for
> update
> 
> This should be doable, but I cannot yet estimate if that would be more
> or less effort than finishing multiple-environment support.
> 
> For full multiple-environment support, the locking scheme will have to
> be rewritten to some extent, it currently supports exactly one
> environment per process.
> 
> David, your opinion is also solicited :)

Tom & Matthias,

It could be done.  I foresee two drawbacks.  

First, when scoring multiple messages, there will be multiple database
opens and closes.  This will affect performance, though the amount may
be insignificant.  Also, there would need to be a check of one database
vs several so single wordlists wouldn't suffer. 

Second, extra status info would be needed.  Given multiple wordlists,
there are multiple passes over the data.  For the second and subsequent
passes, bogofilter will need to check whether it has the info it needs
for a token (hence can avoid an extra database lookup).

It adds complication, but _might_ be simple than complex database
environment/locking code.

David



More information about the bogofilter-dev mailing list