New "register if needed" feature?

Mon Aug 1 16:16:06 CEST 2005

Randall,

I wrote a script called bfproxy which does some training-on-error processing 
similar to what you are trying to achieve in bogofilter.  In my paradigm, I 
have two primary parameters -- R, which will register as ham (-n) anything 
previously classified as spam, and register as spam (-s) anything previously 
classified as ham -- and C, which does the same as R, but also unregisters 
the previous classification (-Sn, -Ns) if autoregistration is being used. 
Then I also allow R or C to be followed by a secondary parameter -- n to 
register unsures as ham, or s to register unsures as spam.  In this way, you 
could pass in a mailbox consisting of hams, spams, and unsures which were 
"errors" and have them corrected, with unsures going to either ham (-Rn,-Cn) 
or spam (-Rs,-Cs).

I've been using this method for over a year, and it's been working great. 
It is possible to begin training on error from scratch, with no previous 
registrations.  At first, all email will arrive as unsure -- simply correct 
it as spam or ham, then email will begin to be filtered one way or the 
other.  Just continue training on errors.  Bogofilter becomes reliable 
(>90%) within just a day or two.

In this script, I also provide recursive training on error (until unsures 
are classified as ham or spam), which tends to improve filtering results. 
This too might be integrated directly into bogofilter.

If you're interested, the code is here: 
http://orderamidchaos.com/bogofilter/bfproxy

Tom

----- Original Message ----- 
From: "Randall Nortman" <bogofilterlist at wonderclown.com>
To: <bogofilter-dev at bogofilter.org>
Sent: Sunday, July 31, 2005 9:36 PM
Subject: Re: New "register if needed" feature?

> I've created a *preliminary* patch implementing the feature I
> proposed.  The patch against 0.95.2 is attached; I haven't tried
> applying it to the CVS code yet.  The code is minimally tested, and I
> wasn't very careful about matching the existing indentation style,
> which I can fix.  As David Relson said, most of the logic works in
> much the same way the thresh_update logic works.  I added two options:
>
>  --train-on-error                  Train only if message is misclassified
>  --output-only-misclassified       Output only misclassified messages
>
> Any other opinions on option names are welcome.  I hate naming
> things.
>
> The --train-on-error option enables the train on error feature.  It
> must be combined with either -n or -s (and cannot be combined with -N,
> -S, or -u), and it works in single-message mode as well as with -M,
> -B, or -b.  If -v is specified, a line is printed to stderr (actually,
> dbgout) saying how many messages were misclassified.
>
> The --output-only-misclassified option is so that scripts can
> determine which messages in the set were misclassified (and
> subsequently registered in the wordlist).  If combined with -p on an
> mbox, the output will be an mbox with only the misclassified messages
> (I think; this is untested).  If combined with -v on a maildir or in
> bulk mode, you'll get the filenames of only the misclassified
> messages, plus the X-Bogosity information (just like -v normally
> does).
>
> While thinking of how I'm going to integrate this new feature into my
> training scripts, another new feature occurred to me: I'd like to add
> a REG_MIXED mode, which would be combined with bulk mode (-b or -B),
> to allow registration of both ham and spam in a single run.  Instead
> of just providing filenames of objects to be classified, you would
> provide "spam:<filename>" or "ham:<filename>".  Then, each time
> bogofilter moves on to the next object, it would switch into REG_SPAM
> or REG_GOOD mode as appropriate.  This could be combined with
> train-on-error or used independently.  This would primarily be useful
> for initial training, so that the training script can alternate
> between ham and spam messages.  From what I can tell, you need to
> register some of each before bogofilter will start classifying
> anything, right?  So if train-on-error is going to be useful for
> initial training, you need to be able to switch between ham and spam.
> (This is how bogominitrain.pl works.)  I intend to use this on
> maildirs, and give bogofilter one filename at a time on stdin using
> -b, picking files alternately from the ham and spam maildirs.
>
> I think this is rather easy to implement by having the _next_mailstore
> functions set the right bit (REG_SPAM or REG_GOOD) in the global
> run_type variable each time a mailstore is opened (after parsing out
> the "spam:" or "ham:" prefix).  You could also allow it to work in
> mbox mode (-M) rather than bulk mode if the messages had a header to
> tell bogofilter whether to register them as ham or spam, but that
> might be a bit trickier to implement.
>
> But before I do any more on this, I figured I'd share my work so far
> in case anybody thinks I'm taking the wrong approach.  Ultimately, I'd
> like to see my patch integrated into the mainstream code, if others
> consider it useful.  Comments welcome.
>
> Randall Nortman
>
> 
>

--------------------------------------------------------------------------------

> _______________________________________________
> Bogofilter-dev mailing list
> Bogofilter-dev at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter-dev
>