reg/unreg testing
David Relson
relson at osagesoftware.com
Thu Sep 2 03:22:13 CEST 2004
Greetings,
After my recent message about using a unique message tag to prevent
duplicate message registration and invalid message unregistration, Greg
suggested an experiment:
Deliberately perform bad registrations and unregistrations, to create a
wordlist that's more b0rked than is likely to happen accidentally. Then
score a bunch of message to measure the effect.
Here's what he actually wrote:
# The concept could be tested: take a wordlist with tokens from 20,000
or
# so messages, make two copies. To one copy add a hundred randomly
# chosen messages three times each. To the other, unregister a hundred
# randomly chosen messages three times each. This is far more extreme
# corruption than should ever be encountered in practice. Then measure
# fp and fn on five thousand more messages (all these numbers should be
# half spam and half nonspam) using each of the three wordlists, and see
# if the difference is significant. I'd guess the multiple
# unregistrations might hurt a bit but the multiple registrations will
be
# hardly noticeable.
The attached script, test.corruption.sh, implements such a test. As the
script uses a random subset of my 2004 messages for each run, I've run
it three times (to test for variation based on different message
subsets). The three runs are labeled 0901.1, 0901.2, and 0901.3. The
1x, 2x, and 3x subsections indicate how many times the ham/spam messages
were registered/unregistered.
The rows in each group are labeled:
orig - wordlist without changes
reg - wordlist with additional registrations
unreg- wordlist after unregistering
The columns in each group are labeled:
CNT - number of ham and spam in wordlist
HH - ham messages scoring as ham
FP - ham messages scoring as spam
FN - spam messages scoring as ham
SS - spam messages scoring as spam
Here are the results of the 3 runs:
######## 0901.1 ########
#### 1x ####
CNT HH FP FN SS
orig 10000 5000 0 406 4594
reg 10050 5000 0 405 4595
unreg 9950 5000 0 406 4594
#### 2x ####
CNT HH FP FN SS
orig 10000 5000 0 406 4594
reg 10100 5000 0 405 4595
unreg 9900 5000 0 408 4592
#### 3x ####
CNT HH FP FN SS
orig 10000 5000 0 406 4594
reg 10150 5000 0 404 4596
unreg 9850 5000 0 411 4589
######## 0901.2 ########
#### 1x ####
CNT HH FP FN SS
orig 10000 4999 1 390 4610
reg 10050 4999 1 387 4613
unreg 9950 4999 1 393 4607
#### 2x ####
CNT HH FP FN SS
orig 10000 4999 1 390 4610
reg 10100 4999 1 388 4612
unreg 9900 4999 1 391 4609
#### 3x ####
CNT HH FP FN SS
orig 10000 4999 1 390 4610
reg 10150 4999 1 389 4611
unreg 9850 4999 1 393 4607
######## 0901.3 ########
#### 1x ####
CNT HH FP FN SS
orig 10000 5000 0 364 4636
reg 10050 5000 0 360 4640
unreg 9950 5000 0 364 4636
#### 2x ####
CNT HH FP FN SS
orig 10000 5000 0 364 4636
reg 10100 5000 0 359 4641
unreg 9900 5000 0 370 4630
#### 3x ####
CNT HH FP FN SS
orig 10000 5000 0 364 4636
reg 10150 5000 0 359 4641
unreg 9850 5000 0 377 4623
As you can see from the above, registering/unregistering didn't affect
the ham scoring. The effect on scoring spam messages varied, but wasn't
really large. The worst case was test 0901.3 at 3x where "reg" lowered
the FN by 5 and "unreg" increased by 13. With 5000 messages in the
sample, these numbers are fractions of a percent. Comparing to orig's
364 FN, the numbers (5 and 13) represent changes of approx 1.5% and 4%
in FN's.
Looking at these results, it's reasonable to say that protection against
bad registration and unregistrations does matter -- a little -- and
isn't critical.
Regards,
David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.corruption.sh
Type: application/x-sh
Size: 3870 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040901/b60512dc/attachment.sh>
More information about the Bogofilter
mailing list