incomplete experiment re repetitive tuning (still longish)

Fri Feb 13 00:49:01 CET 2004

On 20040212 (Thu) at 0906:18 -0500, Greg Louis wrote:

> The incomplete experiment is being restarted but it will be several
> days before conclusive results are obtained.

Well, contrary to my initial fears, I was able to revive the moribund
training database and complete the experiment, so here is a revised
writeup.  Something definitive will go up on www.bgl.nu/bogofilter in
the next few days.

===================

It's not uncommon for longer-term bogofilter users to build new
training databases by first training with large and approximately equal
numbers of spam and nonspam ("full training"), and thereafter training
on errors and unsures (known as "training on error"), from time to time
topping up with spam or nonspam to keep the numbers roughly equal.  I
became interested in the question whether, after switching to training
on error, one should train "to exhaustion" or "repetitively" --
techniques in which, after each bout, one reclassifies the messages
that were used in training, and retrains with any messages that are
still wrongly classified.  Theoretically this enhances bogofilter's
ability to classify the messages used in training, at the expense of
degrading accuracy with respect to messages that are similar to them,
but not identical.  Some bogofilter users feel that there is value in
repetitive training.  I didn't, but I may have been wrong, as will be
seen below.

The purpose of the experimentation, therefore, was to determine
whether, after one has begun by building a good-sized training database
without repetition, training repetitively helps or harms bogofilter's
ability to classify hitherto unseen messages correctly.  Two separate
experiments were run, both using the following protocol:

The available spam and nonspam corpora were each split into two parts,
one of which contained a third of the messages and the other the
remaining two thirds.  This was done by "dealing out" the messages as
one would deal a deck of cards, putting two cards in one pile and one
in the other, then repeating the process till the cards had all been
dealt.  The smaller parts should thus have been closely representative
of the total spam and nonspam.

The smaller parts were used to create a training database by "full
training" and the optimal bogofilter parameters were determined using
bogotune, with the newly created training database and the larger parts
of the original corpora for testing.  These test messages remained
"new" in that none of them were ever used for training.

The following steps were then iterated: the smaller parts were
classified using the newly determined parameters, and the messages that
were classified wrongly or as unsure were used to train the database
further.  Then bogotune was run again (training usually alters the
optimal parameter values), still with the larger parts as test
messages.  At the beginning, and each time through this loop,
bogotune's expected false-negative count was recorded, using the cutoff
setting for the lowest false-positive count in the bogotune
recommendations.  Also recorded were the numbers of spams and nonspams
to be used in training for the next round.

According to the hypothesis under test, some initial improvement in
discrimination might be observed.  I would then expect the training
database to become too closely representative of the idiosyncracies of
the training set (the smaller parts of the corpora), at which point
bogofilter's classification accuracy should begin to degrade.  These
experiments were designed to test that latter expectation.

There were two experiments.  The first one used messages taken from my
personal mailbox, 30,198 nonspam and 30,155 spam.  The training set
thus comprised 10,066 of the nonspam and 10,051 of the spam.  There
were six iterations of repetitive training-on-error; the first four
showed the expected improvement in accuracy, but the fifth and sixth
displayed significant degradation. (The training leading to the fifth
round was carefully verified in the log to make sure there was no error
in manipulation.)  The results are as follows (the columns show the
number of iterations of training, the expected false negatives reported
by bogotune, and the numbers of nonspam and spam that were wrongly or
uncertainly classified and were therefore used in the next round of
training; the right-hand three columns show the same results in
percentage form):

 run testfn trainfp trainfn testfnpc trainfppc trainfnpc
   0    520       1     160     2.59   0.00993      1.59
   1    505       1     147     2.51   0.00993      1.46
   2    481       1     126     2.39   0.00993      1.25
   3    510       0     127     2.54   0.00000      1.26
   4    477       1     113     2.37   0.00993      1.12
   5    649      54     173     3.23   0.53646      1.72
   6    724       9     174     3.60   0.08941      1.73

The percentages of false negatives in test and training are shown on
the left panel of the figure.

So far, the working hypothesis seemed to be holding up.  While this
first experiment was running, a second one was being run as well, this
time with messages from an organization of about 80 users with diverse
interests.  There were 54,229 nonspam and 59,371 spam altogether, so
the training set included 18,076 nonspam and 19,791 spam.  The
discrimination was, in this case, superior to that observed in the
first experiment, though a higher proportion of false positives was
encountered; given that the starting training database was larger, and
the number of messages used in each round of training-on-error was
smaller, it was to be expected that this experiment would have to run
to more iterations before a conclusion was reached:

 run testfn trainfp trainfn testfnpc trainfppc trainfnpc
   0   1065      67      21     2.69     0.371     0.106
   1   1022      67      21     2.58     0.371     0.106
   2    941      67      21     2.38     0.371     0.106
   3    678      67      21     1.71     0.371     0.106
   4    619      67      21     1.56     0.371     0.106
   5    643      67      21     1.62     0.371     0.106
   6    589      67      21     1.49     0.371     0.106
   7    567      67      21     1.43     0.371     0.106
   8    576      67      21     1.46     0.371     0.106
   9    521      67      21     1.32     0.371     0.106
  10    521      67      21     1.32     0.371     0.106
  11    505     204      69     1.28     1.129     0.349
  12    544     204      69     1.37     1.129     0.349
  13    578     204      69     1.46     1.129     0.349

The right-hand panel of the figure shows the corresponding data for
this experiment.  The accuracy of classifying the test data at first
improved quite significantly (the error rate being more than halved),
but eventually degradation began to set in.

The conclusion can be drawn that some number of repetitions of training
-- perhaps four would be a safe choice -- can indeed "boost"
bogofilter's accuracy; there may, however, be an optimum that varies
with the size and homogeneity of the message corpus.

Note that the sensitivity of bogofilter's accuracy to overtraining
should decrease as the size of the training database increases.  I
would expect, though I have not tested this and some people report
otherwise, that training on error "ab initio" would be yet more
vulnerable to overtraining than the present results indicate.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reptrain.jpeg
Type: image/jpeg
Size: 23790 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040212/58baded0/attachment.jpeg>