repetitive-training experiments

Mon Mar 15 12:43:26 CET 2004

On 20040314 (Sun) at 1024:31 -0500, Greg Louis wrote:
(http://www.bgl.nu/bogofilter/reptrain2.html)

> I could find no way to get the false-positive count low enough without
> allowing far too many false negatives.  (One thing I didn't try, that
> might have helped, was to use a rather high spam_cutoff and low
> ham_cutoff during training, thus defining "unsure" more broadly than is
> done in production and so using more marginal-scoring messages for
> training.  I intend to try that and add the results to this writeup.)

This has been done, and it was found that this, plus setting the
minimum deviation large enough to exclude unknown tokens, greatly
improves performance of the pure train-on-error databases. 
Nevertheless, for comparable numbers of false positives, the numbers of
false negatives produced by pure training-on-error -- with or without
repetition -- were at best almost double those obtained when full training
was employed.

I intend to participate in no further investigation of pure training on
error.  Those who find it useful should not be discouraged by this, of
course, from characterizing it further.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |