repetitive-training experiments

Mon Mar 15 16:10:06 CET 2004

On 20040314 (Sun) at 1024:31 -0500, Greg Louis wrote:
> A new experiment comparing results I get with full training, half full
> and half train-on-error, and train-on-error with three different
> maximum registration limits is written up at
> http://www.bgl.nu/bogofilter/reptrain2.html
> 
> The previous experiment, described at
> http://www.bgl.nu/bogofilter/reptrain.html
> seemed to indicate that a limited number of rounds of repetitive
> training might help in the situation where training on error is used
> after an initial period of full training.  I may follow that up in a
> bit more detail as well.

This I have done, sort of.  Yet one more section has been added to
reptrain2.html in which I show that where training on error is used
after initial full training, message-by-message repetition adds no
value.  This contrasts with reptrain.html's report, where a different
method was used: the registration took place just once per
misclassified message, but the process was repeated several times with
the whole corpus; this seemed to be helpful, provided the number of
iterations was kept low.

I also corrected one experimental (rounding) error in the evaluation of
the results with full training.

I'm not advising anyone to use any form of repeated training in
production as yet; further investigation is needed.  From what I have
seen, I would advise against training on error "from scratch", with or
without repetition; however, others have reported better experience
with it than I have encountered.  (With phrase-based filters such as
CRM114, one has really no alternative, because of database size
considerations.)  As mentioned in my previous posting to the list, I'll
not be participating further in investigating training on error from
scratch, though I may do a bit more work on training on error after
initial full training.

For those who have lots of disk space and reasonably muscular
processors, it's worth noting that in my experience, as in the
experiment reported in reptrain2.html, full training (train once with
every message you get, after verifying classification) works best of
all.  To me, this is reassuring, because that's what the theory
predicts.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |