t.bulkmode problem

Wed Nov 24 06:28:02 CET 2004

On Tue, 2004-11-23 at 18:38, David Relson wrote:
> On Tue, 23 Nov 2004 15:39:25 +0100
> Matthias Andree wrote:
> 
> > What you suggest would mean that we:
> > 
> > a. read all tokens and sort the list
> > b. open the first wordlist/environment for reading
> > c. gather spam/ham probabilities for all tokens listed in that list
> >    and delete them from the sorted list
> > d. close the wordlist
> > e. repeat b - d for subsequent wordlists until the list is exhausted.
> > f. if -u mode is effective, re-open first ("default") wordlist for
> > update
> > 
> First, when scoring multiple messages, there will be multiple database
> opens and closes.  This will affect performance, though the amount may
> be insignificant.  Also, there would need to be a check of one database
> vs several so single wordlists wouldn't suffer. 

I'd imagine this would be easily handled in step "e" above.  An
exhausted list would result in ending the loop, so a single database
would just use one pass of b - d.

> Second, extra status info would be needed.  Given multiple wordlists,
> there are multiple passes over the data.  For the second and subsequent
> passes, bogofilter will need to check whether it has the info it needs
> for a token (hence can avoid an extra database lookup).

Each token in the sorted tree has a corresponding spamicity, right? 
What's the default?  The "extra" info could simply be variation from the
default.  If default is null, then a non-null value indicates no further
lookup needed on that token.  No extra information required, just some
extra logic on the existing info.

> It adds complication, but _might_ be simple than complex database
> environment/locking code.

Well, I'll leave you guys to figure that out.  I'm glad my suggestion
was useful.

Tom