training on error with Fisher

Mon Dec 9 20:02:23 CET 2002

I've done another comparison, with a much bigger training corpus, of
full training vs. training-on-error.  Summarizing, and generalizing
recklessly:

For small training corpora, as you might expect, full training gives
lower error rates; but as the corpus approaches 50,000 messages, the
error rates tend to converge.

Building a training database by full training on 10,000 spams and
10,000 nonspams, and then switching to training on error (by which I
mean training with both misclassified and uncertainly-classified
messages), may be more effective than continuing with full training
past that size.  This could be because we've collected most of the
characteristic tokens by that time, and more full training tends to
dilute them -- I don't really know.

The comparison is written up at
http://www.bgl.nu/bogofilter/training2.html

Coming soon: comparison of the Robinson-Fisher calculation method
with a modified version that Gary suggests may work better with
non-zero min_dev values.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |