New "register if needed" feature?

David Relson relson at osagesoftware.com
Sat Jul 30 01:27:27 CEST 2005


On Fri, 29 Jul 2005 15:11:35 +0200
Matthias Andree wrote:

> Randall Nortman wrote:
> > I'm considering implementing a new feature in bogofilter, but I
> > figured I'd run it by you folks first.  Apologies if this has been
> > discussed before; a quick scan through the archives didn't turn
> > anything up.
> > 
> > Something I find myself (or rather, my scripts) doing a lot is to run
> > a message (or a whole mailbox) known to be either ham or spam through
> > bogofilter for classification, and then if it is misclassified, to
> > register that message as ham or spam, as appropriate.  This is the
> > basic "train on error" process.  To do this in a script requires
> > bogofilter to be invoked at least once for each message, and twice on
> > error, so if this is to be done on a large mailbox, the startup costs
> > seem to dominate performance.
> > 
> > If bogofilter had a mode causing it to classify and then register only
> > misclassified messages, I think it would be a major performance boost.
> 
> To get real performance gains, this would require some amount of code
> rewrite to cache the tokens used for scoring, because "score"
> (read-only) and "register" (read-write) are implemented separately, and
>  we're currently just looking up a result and forgetting the result
> right after we've stuffed it into our (iterative) calculation or after
> we've updated the tokens.
> 
> Don't get me wrong, the required changes aren't groundshaking, but are
> non-trivial and certainly not self-contained in a new function with a
> set of options added.

Much of the needed logic is already present.  The autoupdate, i.e.
"-u", option is pretty similar.  It scores the message then, if the
score is ham or spam, adds the tokens to the wordlist.  Messages
scoring as unsures aren't registered.  Additionally, there's the
thresh_update option which further restricts autoupdating so that
"easy" messages aren't registered.  For example, with
"thresh_update=0.01" messages scoring below 0.01 or above 0.99 aren't
registered.  The proposed feature would (I suspect) fit nicely near the
thresh_update processing.

Regards,

David




More information about the bogofilter-dev mailing list