New "register if needed" feature?

Fri Jul 29 15:11:35 CEST 2005

Randall Nortman wrote:
> I'm considering implementing a new feature in bogofilter, but I
> figured I'd run it by you folks first.  Apologies if this has been
> discussed before; a quick scan through the archives didn't turn
> anything up.
> 
> Something I find myself (or rather, my scripts) doing a lot is to run
> a message (or a whole mailbox) known to be either ham or spam through
> bogofilter for classification, and then if it is misclassified, to
> register that message as ham or spam, as appropriate.  This is the
> basic "train on error" process.  To do this in a script requires
> bogofilter to be invoked at least once for each message, and twice on
> error, so if this is to be done on a large mailbox, the startup costs
> seem to dominate performance.
> 
> If bogofilter had a mode causing it to classify and then register only
> misclassified messages, I think it would be a major performance boost.

To get real performance gains, this would require some amount of code
rewrite to cache the tokens used for scoring, because "score"
(read-only) and "register" (read-write) are implemented separately, and
 we're currently just looking up a result and forgetting the result
right after we've stuffed it into our (iterative) calculation or after
we've updated the tokens.

Don't get me wrong, the required changes aren't groundshaking, but are
non-trivial and certainly not self-contained in a new function with a
set of options added.

> Obviously, you'd want to be able to combine this with the -M, -b, or
> -B flag in order to only need to invoke bogofilter once when doing
> this on a large collection of files.  It might work like this:
> 
> bogofilter -M -s --if-misclassified < spam.mbox
> bogofilter -M -n --if-misclassified < ham.mbox
> 
> The exit code and/or output should indicate the number of messages
> that were registered (and perhaps the message IDs/numbers/filenames,
> if a -v flag is given)

This doesn't work well as the exit code is (1) limited to {0, 1,
2...127} and (2) such a scheme would overthrow our existing exit code
conventions. Albeit convenient for this particular purpose, I'd rather
not fiddle with exit codes if I can help it.

> I haven't peeked inside the bogofilter code at all, so I don't know
> how difficult this would be to implement.  I also wonder if you (the
> maintainers) would consider it feature bloat, since this sort of thing
> is already handled by scripts (e.g.  contrib/bogominitrain.pl).  I
> notice that a libbogofilter.a gets built as part of bogofilter; if
> this has all the necessary core functions (mailbox reading, tokenize,
> classify, train), then it should be pretty easy to create a separate
> executable that handles this sort of training, to avoid bloat in the
> main executable.  In that case, it would be like writing a script to
> do it, except in C, and without spawning sub-processes.  It could be
> distributed in contrib.

libbogofilter.a is mostly for maintainer efficiency, we just lump
everything in the archive and let the linker pick the parts that it
needs. It's saved us a lot of Makefile.am tweaking while we tossed code
around in bogofilter.

All in all, I think the feature is doable with reasonable effort, can
bring performance gains if done right, and the update operation might
also benefit. AFAIR, we'd need to cache per-token results from the token
lookups done in score() so we save the lookup in the register() or
update() process. If it's a real big performance gain we'll only know
after implementing it, one might assume that the OS caches reasonably
the database pages. So the real hogs are 1. the seeks while looking up
tokens, 2. the synchronous writes when the database is updated. The
latter are nicely hidden by most databases in their journals (with
asynchronous updates of the actual database file from the journal ==
log) and we can't do much about the seeks.

It may however reduce contention if the system is under memory pressure
(so we get capacity-induced page faults) and many concurrent processes,
where script B's score process forces script A's database pages out of
the cache before script A can register the corrections.

I'm wondering if a separate "run-under-bogofilter-lock" kind of program
might be of more general use and fix this particular performance concern.

-- 
Matthias Andree