what happens if I discard tokens that occur only once?

David Relson relson at osagesoftware.com
Fri Jun 3 23:47:04 CEST 2005


On Fri, 3 Jun 2005 07:19:48 -0500
Bill McClain wrote:

> On Thu, 2 Jun 2005 20:47:58 -0700
> "Chris Fortune" <cfortune at telus.net> wrote:
> 
> > What's the effect of discarding tokens that occur only once?  My
> > assumption is that this would make the wordlist much lighter without
> > impacting the classifications much.  What are your findings?  Reasons?
> 
> I researched this last year without coming to with any definitive
> conclusions. My method was to watch the rate of hapax (tokens with count
> = 1) decay over time. That is: how often do single count tokens become
> registered at least one more time? But this says nothing about how often
> the token is being read. It may have been registered only once but still
> be providing useful information in the calculation.
> 
> Just looking at the update side of things I saw that even very old (=
> many months) hapaxes sometimes become popular again. There is a "secret
> life of spam" with patterns that come and go over time, with which we
> are largely unaware.
> 
> So I decided not to delete old or single-count tokens. What I may do
> instead is simply recreate the worldlist every two years or so from my
> recent ham and spam archives. 
> 
> -Bill

Bill,

Good report.  Thanks.

Hapax importance depends (in part) on how registration is handled.  If
_every_ message goes into the wordlist and there's a hapax that is
(say) timestamped 18 months ago then you _know_ the token hasn't
appeared more recently.  (If it had appeared more recently, the
timestamp would be more recent.)  In that case, one knows well the
meaning of discarding count==1 and date<today-18m.  If only some
messages get registered, then one has no additional info about the
hapax.

Regards,

David




More information about the Bogofilter mailing list