what happens if I discard tokens that occur only once?

Bill McClain wmcclain at salamander.com
Fri Jun 3 14:19:48 CEST 2005


On Thu, 2 Jun 2005 20:47:58 -0700
"Chris Fortune" <cfortune at telus.net> wrote:

> What's the effect of discarding tokens that occur only once?  My
> assumption is that this would make the wordlist much lighter without
> impacting the classifications much.  What are your findings?  Reasons?

I researched this last year without coming to with any definitive
conclusions. My method was to watch the rate of hapax (tokens with count
= 1) decay over time. That is: how often do single count tokens become
registered at least one more time? But this says nothing about how often
the token is being read. It may have been registered only once but still
be providing useful information in the calculation.

Just looking at the update side of things I saw that even very old (=
many months) hapaxes sometimes become popular again. There is a "secret
life of spam" with patterns that come and go over time, with which we
are largely unaware.

So I decided not to delete old or single-count tokens. What I may do
instead is simply recreate the worldlist every two years or so from my
recent ham and spam archives. 

-Bill
-- 
Sattre Press                                    In the Quarter
http://sattre-press.com/                 by Robert W. Chambers
info at sattre-press.com         http://sattre-press.com/itq.html



More information about the Bogofilter mailing list