what happens if I discard tokens that occur only once?
David Relson
relson at osagesoftware.com
Fri Jun 3 23:47:04 CEST 2005
On Fri, 3 Jun 2005 07:19:48 -0500
Bill McClain wrote:
> On Thu, 2 Jun 2005 20:47:58 -0700
> "Chris Fortune" <cfortune at telus.net> wrote:
>
> > What's the effect of discarding tokens that occur only once? My
> > assumption is that this would make the wordlist much lighter without
> > impacting the classifications much. What are your findings? Reasons?
>
> I researched this last year without coming to with any definitive
> conclusions. My method was to watch the rate of hapax (tokens with count
> = 1) decay over time. That is: how often do single count tokens become
> registered at least one more time? But this says nothing about how often
> the token is being read. It may have been registered only once but still
> be providing useful information in the calculation.
>
> Just looking at the update side of things I saw that even very old (=
> many months) hapaxes sometimes become popular again. There is a "secret
> life of spam" with patterns that come and go over time, with which we
> are largely unaware.
>
> So I decided not to delete old or single-count tokens. What I may do
> instead is simply recreate the worldlist every two years or so from my
> recent ham and spam archives.
>
> -Bill
Bill,
Good report. Thanks.
Hapax importance depends (in part) on how registration is handled. If
_every_ message goes into the wordlist and there's a hapax that is
(say) timestamped 18 months ago then you _know_ the token hasn't
appeared more recently. (If it had appeared more recently, the
timestamp would be more recent.) In that case, one knows well the
meaning of discarding count==1 and date<today-18m. If only some
messages get registered, then one has no additional info about the
hapax.
Regards,
David
More information about the Bogofilter
mailing list