training set growth

David Relson relson at osagesoftware.com
Wed Nov 20 21:45:05 CET 2002


Given the thought of deleting tokens with counts of 0 or 1, I thought I'd 
take a look at my wordlists.  It appears that I could cut my storage by 
560% or so by deleting tokens with counts of 0 or 1.  Below are the counts:

		     all      .ge.1    .ge.2
	goodlist   344,860  323,626  141,575
	spamlist   123,678   78,271   47,228

An earlier test indicated that my spamlist contains over 20,000 korean 
tokens.  Using an idea from spambayes, i.e. mapping unreadable characters 
to '?', I determined I could cut the 20,000 tokens to about 500.





More information about the Bogofilter mailing list