reg/unreg testing

David Relson relson at osagesoftware.com
Thu Sep 2 03:22:13 CEST 2004


Greetings,

After my recent message about using a unique message tag to prevent
duplicate message registration and invalid message unregistration, Greg
suggested an experiment:

Deliberately perform bad registrations and unregistrations, to create a
wordlist that's more b0rked than is likely to happen accidentally.  Then
score a bunch of message to measure the effect.

Here's what he actually wrote:

# The concept could be tested: take a wordlist with tokens from 20,000
or
# so messages, make two copies.  To one copy add a hundred randomly
# chosen messages three times each.  To the other, unregister a hundred
# randomly chosen messages three times each.  This is far more extreme
# corruption than should ever be encountered in practice.  Then measure
# fp and fn on five thousand more messages (all these numbers should be
# half spam and half nonspam) using each of the three wordlists, and see
              
# if the difference is significant.  I'd guess the multiple            
# unregistrations might hurt a bit but the multiple registrations will
be
# hardly noticeable.

The attached script, test.corruption.sh, implements such a test.  As the
script uses a random subset of my 2004 messages for each run, I've run
it three times (to test for variation based on different message
subsets).  The three runs are labeled 0901.1, 0901.2, and 0901.3.  The
1x, 2x, and 3x subsections indicate how many times the ham/spam messages
were registered/unregistered.

The rows in each group are labeled:
  orig - wordlist without changes
  reg  - wordlist with additional registrations
  unreg- wordlist after unregistering

The columns in each group are labeled:
  CNT - number of ham and spam in wordlist
  HH  - ham messages scoring as ham
  FP  - ham messages scoring as spam
  FN  - spam messages scoring as ham
  SS  - spam messages scoring as spam

Here are the results of the 3 runs:

######## 0901.1 ########

#### 1x ####
        CNT   HH   FP   FN   SS
orig  10000  5000   0  406  4594
reg   10050  5000   0  405  4595
unreg  9950  5000   0  406  4594

#### 2x ####
        CNT   HH   FP   FN   SS
orig  10000  5000   0  406  4594
reg   10100  5000   0  405  4595
unreg  9900  5000   0  408  4592

#### 3x ####
        CNT   HH   FP   FN   SS
orig  10000  5000   0  406  4594
reg   10150  5000   0  404  4596
unreg  9850  5000   0  411  4589

######## 0901.2 ########

#### 1x ####
        CNT   HH   FP   FN   SS
orig  10000  4999   1  390  4610
reg   10050  4999   1  387  4613
unreg  9950  4999   1  393  4607

#### 2x ####
        CNT   HH   FP   FN   SS
orig  10000  4999   1  390  4610
reg   10100  4999   1  388  4612
unreg  9900  4999   1  391  4609

#### 3x ####
        CNT   HH   FP   FN   SS
orig  10000  4999   1  390  4610
reg   10150  4999   1  389  4611
unreg  9850  4999   1  393  4607

######## 0901.3 ########

#### 1x ####
        CNT   HH   FP   FN   SS
orig  10000  5000   0  364  4636
reg   10050  5000   0  360  4640
unreg  9950  5000   0  364  4636

#### 2x ####
        CNT   HH   FP   FN   SS
orig  10000  5000   0  364  4636
reg   10100  5000   0  359  4641
unreg  9900  5000   0  370  4630

#### 3x ####
        CNT   HH   FP   FN   SS
orig  10000  5000   0  364  4636
reg   10150  5000   0  359  4641
unreg  9850  5000   0  377  4623

As you can see from the above, registering/unregistering didn't affect
the ham scoring.  The effect on scoring spam messages varied, but wasn't
really large.  The worst case was test 0901.3 at 3x where "reg" lowered
the FN by 5 and "unreg" increased by 13.  With 5000 messages in the
sample, these numbers are fractions of a percent.  Comparing to orig's
364 FN, the numbers (5 and 13) represent changes of approx 1.5% and 4%
in FN's.

Looking at these results, it's reasonable to say that protection against
bad registration and unregistrations does matter -- a little -- and
isn't critical.  

Regards,

David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.corruption.sh
Type: application/x-sh
Size: 3870 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040901/b60512dc/attachment.sh>


More information about the Bogofilter mailing list