New "register if needed" feature?

Randall Nortman bogofilterlist at wonderclown.com
Fri Jul 29 14:53:53 CEST 2005


I'm considering implementing a new feature in bogofilter, but I
figured I'd run it by you folks first.  Apologies if this has been
discussed before; a quick scan through the archives didn't turn
anything up.

Something I find myself (or rather, my scripts) doing a lot is to run
a message (or a whole mailbox) known to be either ham or spam through
bogofilter for classification, and then if it is misclassified, to
register that message as ham or spam, as appropriate.  This is the
basic "train on error" process.  To do this in a script requires
bogofilter to be invoked at least once for each message, and twice on
error, so if this is to be done on a large mailbox, the startup costs
seem to dominate performance.

If bogofilter had a mode causing it to classify and then register only
misclassified messages, I think it would be a major performance boost.
Obviously, you'd want to be able to combine this with the -M, -b, or
-B flag in order to only need to invoke bogofilter once when doing
this on a large collection of files.  It might work like this:

bogofilter -M -s --if-misclassified < spam.mbox
bogofilter -M -n --if-misclassified < ham.mbox

The exit code and/or output should indicate the number of messages
that were registered (and perhaps the message IDs/numbers/filenames,
if a -v flag is given) so that a script can continue this process
until nothing is misclassified.  I suppose bogofilter could handle the
iterate-until-perfect part as well, but that, I think, would be going
too far.

I haven't peeked inside the bogofilter code at all, so I don't know
how difficult this would be to implement.  I also wonder if you (the
maintainers) would consider it feature bloat, since this sort of thing
is already handled by scripts (e.g.  contrib/bogominitrain.pl).  I
notice that a libbogofilter.a gets built as part of bogofilter; if
this has all the necessary core functions (mailbox reading, tokenize,
classify, train), then it should be pretty easy to create a separate
executable that handles this sort of training, to avoid bloat in the
main executable.  In that case, it would be like writing a script to
do it, except in C, and without spawning sub-processes.  It could be
distributed in contrib.

Comments?  Opinions?  Flames?

Randall



More information about the bogofilter-dev mailing list