Security margins in training (on error and to exhaustion)

Wed Dec 10 12:35:48 CET 2003

Hi!

I did an extensive test on using security margins for
training-on-error and training-to-exhaustion. The setup is simple. I
divided my mail collection into half for training and half for rating:
t.ns:15671
t.sp:13613
r.ns:15670
r.sp:13612

I trained using bogominitrain.pl (first table only one run which would
be representative for any train-on-error, second table using -fn for
training-to-exhaustion). My spam_cutoff is 0.5 and I increased the
security margin in both directions in steps of 0.1. Rating of course
works without that margin.

What the information in the tables mean:
The first row shows ham_cutoffs used for training, so the column
marked 0.3 lists results for ham_cutoff=0.3 and the spam_cutoff given
in the first column. Each table cell has three lines. The first lists
the size of the database, in the second table also the number of runs
needed. The second line gives the size of the database in messages
used for training (spam/ham). The third line finally gives false
negatives and false positives for my rate files.

Training one run of train-on-error:

sc\hc|   0.5   |   0.4   |   0.3   |   0.2   |   0.1
-----+---------+---------+---------+---------+---------
 0.5 |    1172k|    1672k|    1892k|    1960k|    2120k
w:s/h| 159/  87| 233/ 138| 243/ 181| 272/ 205| 279/ 238
fn/fp| 207/  78| 244/  28| 260/  17| 284/  13| 296/  10
-----+---------+---------+---------+---------+---------
 0.6 |    1548k|    2420k|    2372k|    2492k|    2712k
w:s/h| 228/ 123| 391/ 191| 370/ 229| 398/ 254| 420/ 307
fn/fp|  86/ 109| 137/  49| 166/  21| 146/  26| 150/  13
-----+---------+---------+---------+---------+---------
 0.7 |    1748k|    2572k|    2696k|    2752k|    2800k
w:s/h| 283/ 127| 441/ 208| 455/ 239| 469/ 270| 480/ 312
fn/fp|  79/ 100| 101/  26|  97/  19| 118/  34| 141/  17
-----+---------+---------+---------+---------+---------
 0.8 |    2100k|    2668k|    2848k|    2928k|    3064k
w:s/h| 362/ 142| 536/ 210| 537/ 242| 549/ 282| 557/ 308
fn/fp|  64/ 142| 100/  27|  93/  16|  92/  15|  93/  10
-----+---------+---------+---------+---------+---------
 0.9 |    2176k|    2912k|    2976k|    3048k|    3028k
w:s/h| 456/ 144| 651/ 212| 666/ 239| 655/ 270| 631/ 325
fn/fp|  58/ 125|  65/  30|  77/  17|  99/  16|  72/  19

Training-to-exhaustion:

sc\hc|   0.5   |   0.4   |   0.3   |   0.2   |   0.1
-----+---------+---------+---------+---------+---------
 0.5 |6   1540k|7   2116k|5   2308k|4   2384k|3   2548k
w:s/h| 302/ 137| 377/ 202| 388/ 234| 401/ 279| 423/ 308
fn/fp|  67/  39| 100/  16| 118/   9| 128/   8| 125/   4
-----+---------+---------+---------+---------+---------
 0.6 |4   2056k|4   2796k|4   2764k|5   3020k|5   3136k
w:s/h| 390/ 163| 530/ 234| 512/ 287| 546/ 308| 552/ 370
fn/fp|  32/  47|  64/  15|  58/  16|  75/  12|  83/   8
-----+---------+---------+---------+---------+---------
 0.7 |4   2228k|4   2904k|3   3068k|4   3208k|5   3308k
w:s/h| 458/ 173| 590/ 245| 613/ 277| 635/ 330| 648/ 383
fn/fp|  29/  50|  56/  21|  54/  12|  64/   9|  67/  12
-----+---------+---------+---------+---------+---------
 0.8 |5   2508k|6   3120k|4   3240k|4   3300k|4   3484k
w:s/h| 568/ 182| 721/ 258| 724/ 282| 729/ 333| 734/ 375
fn/fp|  39/  45|  47/  19|  56/  13|  55/  10|  71/   7
-----+---------+---------+---------+---------+---------
 0.9 |5   2656k|4   3352k|5   3496k|5   3552k|7   3488k
w:s/h| 674/ 182| 870/ 247| 876/ 286| 879/ 328| 858/ 382
fn/fp|  30/  52|  51/  18|  57/  17|  51/  17|  60/  10

Full training for comparison:
     27060k
13613/15671
  210/   16

Results:

1) Security margins reduce the error rate remarkably for both
   training-on-error and training-to-exhaustion.

2) In general, bigger is better. At some point we seem to reach a kind
   of saturation where not much changes anymore, so even a little
   drawback seems to be possible.

3) As you would expect if you increase the security margin to only one
   side, this side clearly benefits. The other side has to pay a
   price, though.

4) Clearly security margins do increase database size. For only
   train-on-error this can be a factor of up to about three, but
   compared with full training, the increase is still moderate even
   for huge margins. For training-to-exhaustion the relative size
   increase is even smaller.

5) Training-to-exhaustion performs significantly better than only
   training-on-error. For huge margins the difference becomes smaller.

6) The number of runs needed to close of seems unrelated to the choice
   of the margins. It may be a surprise that bigger margins don't
   require more runs.

7) Training-to-exhaustion is not much more expensive (database size)
   than training-on-error. In this experiment it seems to be constant
   (as a function of the margins) at 4-500k (it will depend on the
   mail collection). So for small margins it is about 40% bigger and
   for big margins only about 15%. For the number of messages used,
   the ratio is bigger. It is always below two and get smaller for
   bigger margins.

8) Best performance in this test (I would not want to generalize this,
   but it should give a hint about values):
   a) training-on-error:
      ham-cutoff = general-spam-cutoff minus 0.2 to 0.4
      spam-cutoff = general-spam-cutoff plus 0.3 to 0.4
      Explicitly: hc=0.1 and sc=0.8 or hc=0.3 and sc=0.9
   b) training-to-exhaustion:
      ham-cutoff = general-spam-cutoff minus 0.3 to 0.4
      spam-cutoff = general-spam-cutoff plus 0.3
      Explicitly: hc=0.1 and sc=0.8 or hc=0.2 and sc=0.8

pi