older training sets

Wed Nov 20 14:20:24 CET 2002

Now that bogofilter's been around a while, some training sets may be
beginning to show their age.  I suspect (haven't thought this through
in detail yet, though) that with time, training sets fill up with old
tokens that take up space without contributing much, if anything, to
current classification.  It would probably be smart to apply some sort
of ceiling to training-set size, and start throwing out tokens when the
limit is reached.  My current training set (9360 spam and 7462 nonspam
this morning) has roughly 288,000 tokens in the spamlist and 106,000 in
the goodlist; it might make sense to cap them at 300,000 or so.

Ideally, one would throw away the least-recently-accessed token when
adding a new token to a "full" list.  That, however, would require a
database write for every token checked, clearly not a practical
solution.  Throwing out a token with a low count wouldn't be the wisest,
either; those could be the newest ones.  Throwing out the oldest token
in the list that has a p(w) close to 0.5 sounds attractive, but those
are likely to be words like "you" that will promptly get added back. 
Maybe we could throw away the oldest token in the list that had a count
below some threshold, or the one with an age greater than some minimum
and a p(w) deviation less than some maximum that had the lowest count.
That last choice sounds good except there'd be a heck of an overhead
involved in finding the token to discard.

Or we could try Mother Nature's approach: kill a token at random. 
Sounds drastic, but in fact important tokens will quickly get
re-established, and less important ones are... well... less important.
That's a mechanism animal memory uses to eliminate corrupt and/or
disused information (reckless oversimplification here) and it's been
shown to work fairly well.  The nicest thing about it is not having to
worry about adding timestamps and other such overhead to the data
storage mechanism.  Less nice is that an unhelpful token like 10.0.1.1
(the spamlist has a lot of such, all with counts in the single digits)
have as good a chance as any of living on, in the short term at least.
(Animal memory has that problem too: I can still remember tunes from
a few TV commercial jingles of the fifties :(

On the other hand, precisely because there are so many of those
unhelpful low-count tokens, random deletion has a significant
probability of hitting one.  Look at this histogram printout of my
lists (the right columns give the number of tokens with counts in the
ranges given on the left):

		  spam	nonspam
1:		220137	  63244		
2-10:		 55322    30340
11-100:		 10448	   9854
101-1000:	  2096	   2300
1001-10000:	   257	    206
total tokens:	288260	 105834

If I do random deletions on the spamlist, I'm going to hit a token with
a count of 1 more than 3 times out of 4, and one with a count under 11
more than 19 times out of 20.  The goodlist is a bit riskier, but still
not bad: 60% and 88% respectively.

(Mind you, it might be as productive just to purge all the tokens
with counts of 1 every so often ;-)

Ideas or suggestions, anyone?
-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |