training set growth
David Relson
relson at osagesoftware.com
Wed Nov 20 21:45:05 CET 2002
Given the thought of deleting tokens with counts of 0 or 1, I thought I'd
take a look at my wordlists. It appears that I could cut my storage by
560% or so by deleting tokens with counts of 0 or 1. Below are the counts:
all .ge.1 .ge.2
goodlist 344,860 323,626 141,575
spamlist 123,678 78,271 47,228
An earlier test indicated that my spamlist contains over 20,000 korean
tokens. Using an idea from spambayes, i.e. mapping unreadable characters
to '?', I determined I could cut the 20,000 tokens to about 500.
More information about the Bogofilter
mailing list