partial wordlist test results
David Relson
relson at osagesoftware.com
Sat Feb 5 22:40:10 CET 2005
Greetings,
I've run my partial wordlist tests -- tests to see how scoring is
affected when part of the wordlist is removed through maintenance or
when the wordlist is b0rked and only part of it can be recovered.
>From my original wordlist (dating back to Oct 2002 and containing
1,491,699 tokens), I created the following wordlists:
All - using complete, original wordlist
1k - using first 1,000 tokens of wordlist
10k - using first 10,000 tokens of wordlist
100k - using first 100,000 tokens of wordlist
1000k - using first 1,000,000 tokens of wordlist
25pct - using first 25% of wordlist tokens
50pct - using first 25% of wordlist tokens
75pct - using first 25% of wordlist tokens
2yr - discarding tokens 2 yrs old (or older)
1yr - discarding tokens 1 yr old (or older)
hap - discarding hapaxes (tokens with
ham+spam count equal to 1)
I then scored 2000 ham and 2999 spam (from a bogotune test corpus I
already had) and got these results:
Counts:
Tokens HH HU HS SH SU SS
All 1491699 1990 0 10 1 0 2988
1k 1000 0 2000 0 0 2989 0
10k 10000 2 1913 85 21 2508 460
100k 100000 1363 588 49 40 825 2124
1000k 1000000 1988 2 10 6 8 2975
25pct 372924 1826 155 19 17 159 2813
50pct 745849 1987 3 10 7 9 2973
75pct 1118774 1987 3 10 1 4 2984
2y 1331912 1990 0 10 1 0 2988
1y 341461 1988 1 11 1 3 2985
hap 464932 1988 2 10 3 4 2982
Percents:
Tokens HH HU HS SH SU SS
All 1491699 99.50 0.00 0.50 0.03 0.00 99.97
1k 1000 0.00 100.00 0.00 0.00 100.00 0.00
10k 10000 0.10 95.65 4.25 0.70 83.91 15.39
100k 100000 68.15 29.40 2.45 1.34 27.60 71.06
1000k 1000000 99.40 0.10 0.50 0.20 0.27 99.53
25pct 372924 91.30 7.75 0.95 0.57 5.32 94.11
50pct 745849 99.35 0.15 0.50 0.23 0.30 99.46
75pct 1118774 99.35 0.15 0.50 0.03 0.13 99.83
2y 1331912 99.50 0.00 0.50 0.03 0.00 99.97
1y 341461 99.40 0.05 0.55 0.03 0.10 99.87
hap 464932 99.40 0.10 0.50 0.10 0.13 99.77
Conclusion:
With my site's wordlist and scoring parameters, deleting old
tokens, trimming up to 50% of the wordlist, or removing hapaxes
leaves the scoring results virtually unchanged.
Given a b0rked wordlist from which half (or better) can be
recovered, using the recovered tokens is nearly as good as using
the un-b0rked wordlist.
Additional Info:
Notes:
Column title identifies proper score and test score for messages, i.e.
HH indicates ham scoring as Ham,
HU indicates ham scoring as Unsure,
...
The poor results of the 1k and 10k tests are caused by:
1) first 1,880 wordlist tokens are money amounts
2) next 14,578 wordlist tokens are asian spam tokens
(processed with replace_nonascii_characters=Y)
scoring parameters used:
robs=0.0100
min_dev=0.090
robx=0.549006
sp_esf=0.487139
ns_esf=0.421875
ham_cutoff=0.20
spam_cutoff=0.65
block_on_subnets=yes
replace_nonascii_characters=Y
More information about the Bogofilter
mailing list