repetitive training

Tom Anderson tanderso at oac-design.com
Tue Mar 9 02:08:55 CET 2004


On Mon, 2004-03-08 at 10:12, Greg Louis wrote:
> Unlike Pi's results, mine seem to show that there are diminishing
> returns, at best, from repetitions of training on error with the same
> message sets, though perhaps one to four repetitions may boost accuracy
> somewhat.  The question is certainly not yet closed; Pi and I have
> discussed methodology and theory, but IMHO what's really needed is more
> experimentation.

FWIW, my experience with repetitive training has been excellent.  Mind
you, I don't necessarily follow quite the same procedure as previously
noted.  It is my philosophy that one oughtn't need to maintain a corpus
of thousands of emails in order to properly train bogofilter, therefore
all training is "spur of the moment", if you will.  I think this fits
with a more usable paradigm that average people can work with.  It's
like training a dog... you don't need to wipe its memory and start from
scratch every time, just correct bad behavior and future behavior will
improve.  Keep correcting until behavior does improve.  Ie, keep telling
him to "sit" until he does.

Following this analogy pretty closely, I set up my users with an empty
database (newborn puppy).  There is no learned behavior yet.  They may
fail to train for as long as they want, until they get fed up with the
spam (bad behavior) and start sending in some corrections
(reward/punishment).  To facilitate this, they forward incorrectly
classified emails as attachments to bfproxy
(http://www.orderamidchaos.com/bogofilter/bfproxy).  I built an
address-line parameter "x" (for exhaustive training) into bfproxy to
repeatedly correct until the desired behavior is achieved
("sit"->cookie, "sit"->cookie, etc ;).  If the first registration does
not move the classification into the cutoff zone, then it registers
again and again either until it does classify correctly, or until an
arbitrary maximum is reached (default 10 recursions) in case it never
converges.  BTW, it rarely reaches the rmax before going over the
cutoff.

So far, results have been great.  Unsures have been reduced
substantially.  It does not seem to have contributed to any false
positives, as I haven't received any myself, nor have had any reported. 
False negatives hover around 1-2 per day, unsures 8-10 (down from 30-40
a few weeks ago).  I haven't done any whole-corpus tests to determine
hard numbers, as I don't keep a corpus, so this may be construed as
hearsay, but it is my testimonial nonetheless.  I highly suggest it.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040308/c4020851/attachment.sig>


More information about the Bogofilter mailing list