Hapax decay (was: Re: the importance of robx)
Bill McClain
wmcclain at salamander.com
Sun Feb 29 15:26:39 CET 2004
On Sat, 28 Feb 2004 18:51:23 -0500
David Relson <relson at osagesoftware.com> wrote:
> Given that hapaxes
> are so numerous, one can conclude that many words never get beyond
> their hapax/robx value.
By the way, I've started collecting data on the rate of hapax decay.
(Sounds like particle physics, doesn't it?) This is with full (spam)
training.
It's probably useless knowledge, but someone here may find it
interesting. I was considering periodically purging my wordlist of
tokens older than D days which also have counts less than or equal to N,
say D = 90 days and N = 1. Which suggested the questions: how useful are
hapaxes, and are older ones less useful than newer?
It will be months before I have data to report. So far it looks like
recent hapaxes decay at about 0.6% per day, the rate diminishing as they
age.
-Bill
--
Sattre Press History of Astronomy
http://sattre-press.com/ During the 19th Century
info at sattre-press.com by Agnes M. Clerke
http://sattre-press.com/han.html
More information about the Bogofilter
mailing list