partial wordlist test results

Sat Feb 5 22:40:10 CET 2005

Greetings,

I've run my partial wordlist tests -- tests to see how scoring is
affected when part of the wordlist is removed through maintenance or
when the wordlist is b0rked and only part of it can be recovered.

>From my original wordlist (dating back to Oct 2002 and containing
1,491,699 tokens), I created the following wordlists:

    All    - using complete, original wordlist
    1k     - using first 1,000 tokens of wordlist
    10k    - using first 10,000 tokens of wordlist
    100k   - using first 100,000 tokens of wordlist
    1000k  - using first 1,000,000 tokens of wordlist
    25pct  - using first 25% of wordlist tokens
    50pct  - using first 25% of wordlist tokens
    75pct  - using first 25% of wordlist tokens
    2yr    - discarding tokens 2 yrs old (or older)
    1yr    - discarding tokens 1 yr  old (or older)
    hap    - discarding hapaxes (tokens with 
	     ham+spam count equal to 1)

I then scored 2000 ham and 2999 spam (from a bogotune test corpus I
already had) and got these results:

Counts:

	    Tokens      HH     HU     HS     SH     SU     SS
    All    1491699    1990      0     10      1      0   2988
    1k        1000       0   2000      0      0   2989      0
    10k      10000       2   1913     85     21   2508    460
    100k    100000    1363    588     49     40    825   2124
    1000k  1000000    1988      2     10      6      8   2975
    25pct   372924    1826    155     19     17    159   2813
    50pct   745849    1987      3     10      7      9   2973
    75pct  1118774    1987      3     10      1      4   2984
    2y     1331912    1990      0     10      1      0   2988
    1y      341461    1988      1     11      1      3   2985
    hap     464932    1988      2     10      3      4   2982

Percents:

	    Tokens      HH     HU     HS     SH     SU     SS
    All    1491699   99.50   0.00   0.50   0.03   0.00  99.97 
    1k        1000    0.00 100.00   0.00   0.00 100.00   0.00 
    10k      10000    0.10  95.65   4.25   0.70  83.91  15.39 
    100k    100000   68.15  29.40   2.45   1.34  27.60  71.06 
    1000k  1000000   99.40   0.10   0.50   0.20   0.27  99.53 
    25pct   372924   91.30   7.75   0.95   0.57   5.32  94.11 
    50pct   745849   99.35   0.15   0.50   0.23   0.30  99.46 
    75pct  1118774   99.35   0.15   0.50   0.03   0.13  99.83 
    2y     1331912   99.50   0.00   0.50   0.03   0.00  99.97 
    1y      341461   99.40   0.05   0.55   0.03   0.10  99.87 
    hap     464932   99.40   0.10   0.50   0.10   0.13  99.77 

Conclusion:

   With my site's wordlist and scoring parameters, deleting old
   tokens, trimming up to 50% of the wordlist, or removing hapaxes
   leaves the scoring results virtually unchanged.

   Given a b0rked wordlist from which half (or better) can be
   recovered, using the recovered tokens is nearly as good as using
   the un-b0rked wordlist.

Additional Info:

Notes:

    Column title identifies proper score and test score for messages, i.e.
    HH indicates ham scoring as Ham,
    HU indicates ham scoring as Unsure,
    ...

    The poor results of the 1k and 10k tests are caused by:
        1) first 1,880 wordlist tokens are money amounts
        2) next 14,578 wordlist tokens are asian spam tokens
           (processed with replace_nonascii_characters=Y)

scoring parameters used:
    robs=0.0100
    min_dev=0.090
    robx=0.549006
    sp_esf=0.487139
    ns_esf=0.421875
    ham_cutoff=0.20
    spam_cutoff=0.65
    block_on_subnets=yes
    replace_nonascii_characters=Y