t.bulkmode problem
Tom Anderson
tanderso at oac-design.com
Wed Nov 24 06:28:02 CET 2004
On Tue, 2004-11-23 at 18:38, David Relson wrote:
> On Tue, 23 Nov 2004 15:39:25 +0100
> Matthias Andree wrote:
>
> > What you suggest would mean that we:
> >
> > a. read all tokens and sort the list
> > b. open the first wordlist/environment for reading
> > c. gather spam/ham probabilities for all tokens listed in that list
> > and delete them from the sorted list
> > d. close the wordlist
> > e. repeat b - d for subsequent wordlists until the list is exhausted.
> > f. if -u mode is effective, re-open first ("default") wordlist for
> > update
> >
> First, when scoring multiple messages, there will be multiple database
> opens and closes. This will affect performance, though the amount may
> be insignificant. Also, there would need to be a check of one database
> vs several so single wordlists wouldn't suffer.
I'd imagine this would be easily handled in step "e" above. An
exhausted list would result in ending the loop, so a single database
would just use one pass of b - d.
> Second, extra status info would be needed. Given multiple wordlists,
> there are multiple passes over the data. For the second and subsequent
> passes, bogofilter will need to check whether it has the info it needs
> for a token (hence can avoid an extra database lookup).
Each token in the sorted tree has a corresponding spamicity, right?
What's the default? The "extra" info could simply be variation from the
default. If default is null, then a non-null value indicates no further
lookup needed on that token. No extra information required, just some
extra logic on the existing info.
> It adds complication, but _might_ be simple than complex database
> environment/locking code.
Well, I'll leave you guys to figure that out. I'm glad my suggestion
was useful.
Tom
More information about the bogofilter-dev
mailing list