incomplete experiment re repetitive tuning (longish)

Thu Feb 12 15:06:18 CET 2004

I'm sorry to report that there has been an accident causing my second
of two experiments on repetitive tuning to terminate early.  The
available results are consistent with, but inadequate to confirm, my
original working hypothesis.  The incomplete experiment is being
restarted but it will be several days before conclusive results are
obtained.  In the meantime, here's a report of what's been seen so far.

The purpose of the experimentation was to determine whether "training
to exhaustion" helps or harms bogofilter's ability to classify hitherto
unseen messages correctly.  Two separate experiments were begun, both
using the following protocol:

The available spam and nonspam corpora were each split into two parts,
one of which contained a third of the messages and the other the
remaining two thirds.  This was done by "dealing out" the messages as
one would deal a deck of cards, putting two cards in one pile and one
in the other, then repeating the process till the cards had all been
dealt.  The smaller parts should thus have been closely representative
of the total spam and nonspam.

The smaller parts were used to create a training database by "full
training" and the optimal bogofilter parameters were determined using
bogotune, with the newly created training database and the larger parts
of the original corpora for testing.  These test messages remained
"new" in that none of them were ever used for training.

The following steps were then iterated: the smaller parts were
classified using the newly determined parameters, and the messages that
were classified wrongly or as unsure were used to train the database
further.  Then bogotune was run again, still with the larger parts as
test messages.  At the beginning, and each time through this loop,
bogotune's expected false-negative count was recorded, using the cutoff
setting for the lowest false-positive count in the bogotune
recommendations.  Also recorded were the numbers of spams and nonspams
to be used in training for the next round.

According to the hypothesis under test, some initial improvement in
discrimination might be observed (pi has demonstrated this effect
clearly, and I've seen it as well).  I would then expect the training
database to become too closely representative of just the training set
(the smaller parts of the corpora), at which point bogofilter's
classification accuracy should begin to degrade.  These experiments
were designed to test that latter expectation.

There were two experiments.  The first one, which completed, used
messages taken from my personal mailbox, 30,198 nonspam and 30,155
spam.  The training set thus comprised 10,066 of the nonspam and 10,051
of the spam.  There were six iterations of repetitive
training-on-error; the first four showed the expected improvement in
accuracy, but the fifth and sixth displayed significant degradation.
(The training leading to the fifth round was carefully verified in the
log to make sure there was no error in manipulation.)  The results are
as follows (the columns show the number of iterations of training, the
expected false negatives reported by bogotune, and the numbers of
nonspam and spam that were wrongly or uncertainly classified and were
therefore used in the next round of training; the right-hand three
columns show the same results in percentage form):

 run testfn trainfp trainfn testfnpc trainfppc trainfnpc
   0    520       1     160     2.59   0.00993      1.59
   1    505       1     147     2.51   0.00993      1.46
   2    481       1     126     2.39   0.00993      1.25
   3    510       0     127     2.54   0.00000      1.26
   4    477       1     113     2.37   0.00993      1.12
   5    649      54     173     3.23   0.53646      1.72
   6    724       9     174     3.60   0.08941      1.73

The percentages of false negatives in test and training are shown on
the left panel of the attached jpeg.

So far, the working hypothesis seemed to be holding up.  While this
first experiment was running, however, a second one was being run as
well, this time with messages from an organization of about 80 users
with diverse interests.  There were 54,229 nonspam and 59,371 spam
altogether, so the training set included 18,076 nonspam and 19,791
spam.  The discrimination was, in this case, superior to that observed
in the first experiment, though a higher proportion of false positives
was encountered; given that the starting training database was
larger, and the number of messages used in each round of
training-on-error was smaller, it was to be expected that this
experiment would have to run to more iterations before a conclusion was
reached.  Unfortunately, a manipulation error corrupted the training db
after the ninth round, so it will have to be rebuilt from scratch
before the experiment can be continued.  Results obtained thus far are
shown in the next table:

 run testfn trainfp trainfn testfnpc trainfppc trainfnpc
   0   1065      67      21     2.69     0.371     0.106
   1   1022      67      21     2.58     0.371     0.106
   2    941      67      21     2.38     0.371     0.106
   3    678      67      21     1.71     0.371     0.106
   4    619      67      21     1.56     0.371     0.106
   5    643      67      21     1.62     0.371     0.106
   6    589      67      21     1.49     0.371     0.106
   7    567      67      21     1.43     0.371     0.106
   8    576      67      21     1.46     0.371     0.106
   9    521      67      21     1.32     0.371     0.106

The right-hand panel of the jpeg figure shows the corresponding data
for this experiment.  Note that nine iterations weren't enough even to
begin improving discrimination in the training set itself, though the
accuracy of classifying the test set improved impressively.

For the moment, then, it looks as though some number of repetitions of
training can indeed "boost" bogofilter's accuracy; there may, however,
be an optimum that varies with the size and homogeneity of the message
corpus.  Completion of the second experiment should elucidate this
point further.  Once I have the final results, I'll post a writeup on
the www.bgl.nu/bogofilter site and inform the list.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reptrain.jpeg
Type: image/jpeg
Size: 23322 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040212/35c2e296/attachment.jpeg>