testing partial wordlists

David Relson relson at osagesoftware.com
Sat Feb 5 19:51:43 CET 2005


Last week I did some counts of the tokens in my wordlist.  Of the 1.5M
tokens I have, approx 1/3 have timestamps more than 2 yrs old and
another 1/3 are more than 1 yr old.  I'm giving thought to removing some
(or all) of those oldies.

This morning an unrecoverable database error was reported and I
suggested making a new wordlist with as much as bogoutil can recover.
I've long thought that if, for example, a wordlist suddenly lost all
words beginning with A (or B, or C), there would be little net effect on
scoring.  Unfortunately I have no test results to verify if that's true
or not.

So, I've decide to run some tests.

I've got a test corpus of approx 2000 ham and 2000 spam I've used with
bogotune.  Using this test corpus, I can run a series of tests, with
each test scoring all the messages, then reporting ham scored as
ham/unsure/spam and spam scored as ham/unsure/spam.

As test wordlists I'm going to run the following wordlists:

1 - full current wordlist
2 - wordlist less tokens more than 2 yrs old
3 - wordlist less tokens more than 1 yr old
4 - first 3/4 of wordlist (based on bogoutil output)
5 - first 2/4 ...
6 - first 1/4 ...
7 - first 1000 tokens ...
8 - first 10,000 tokens
9 - first 100,000 tokens ...
10 - wordlist with all hapaxes removed, i.e. without tokens having 
     ham & spam counts of 1,0 or 0,1

I'll report on the results when I have them !



More information about the Bogofilter mailing list