testing partial wordlists
relson at osagesoftware.com
Sat Feb 5 13:51:43 EST 2005
Last week I did some counts of the tokens in my wordlist. Of the 1.5M
tokens I have, approx 1/3 have timestamps more than 2 yrs old and
another 1/3 are more than 1 yr old. I'm giving thought to removing some
(or all) of those oldies.
This morning an unrecoverable database error was reported and I
suggested making a new wordlist with as much as bogoutil can recover.
I've long thought that if, for example, a wordlist suddenly lost all
words beginning with A (or B, or C), there would be little net effect on
scoring. Unfortunately I have no test results to verify if that's true
So, I've decide to run some tests.
I've got a test corpus of approx 2000 ham and 2000 spam I've used with
bogotune. Using this test corpus, I can run a series of tests, with
each test scoring all the messages, then reporting ham scored as
ham/unsure/spam and spam scored as ham/unsure/spam.
As test wordlists I'm going to run the following wordlists:
1 - full current wordlist
2 - wordlist less tokens more than 2 yrs old
3 - wordlist less tokens more than 1 yr old
4 - first 3/4 of wordlist (based on bogoutil output)
5 - first 2/4 ...
6 - first 1/4 ...
7 - first 1000 tokens ...
8 - first 10,000 tokens
9 - first 100,000 tokens ...
10 - wordlist with all hapaxes removed, i.e. without tokens having
ham & spam counts of 1,0 or 0,1
I'll report on the results when I have them !
More information about the Bogofilter