Hapax decay (was: Re: the importance of robx)

Bill McClain wmcclain at salamander.com
Sun Feb 29 15:26:39 CET 2004


On Sat, 28 Feb 2004 18:51:23 -0500
David Relson <relson at osagesoftware.com> wrote:

> Given that hapaxes
> are so numerous, one can conclude that many words never get beyond
> their hapax/robx value.

By the way, I've started collecting data on the rate of hapax decay.
(Sounds like particle physics, doesn't it?) This is with full (spam)
training.

It's probably useless knowledge, but someone here may find it
interesting. I was considering periodically purging my wordlist of
tokens older than D days which also have counts less than or equal to N,
say D = 90 days and N = 1. Which suggested the questions: how useful are
hapaxes, and are older ones less useful than newer?

It will be months before I have data to report. So far it looks like
recent hapaxes decay at about 0.6% per day, the rate diminishing as they
age.

-Bill
-- 
Sattre Press                              History of Astronomy 
http://sattre-press.com/               During the 19th Century
info at sattre-press.com                       by Agnes M. Clerke
                              http://sattre-press.com/han.html




More information about the Bogofilter mailing list