Hapax survival over time

Tom Anderson tanderso at oac-design.com
Wed Mar 24 05:09:39 CET 2004


On Tue, 2004-03-23 at 10:05, Bill McClain wrote:
> None yet. I'm trying to determine how valuable old hapaxes are, to see
> if they could be purged without harm. It will be interesting to see if
> the decline levels off over a longer period of time.

I'm not sure a longer period of time is really necessary.  Clearly if a
token has been seen only once in 20-30 days, it does not play a very
large roll in classifying the vast bulk of your messages.  Therefore, it
could not possibly hurt to delete it and then score it at robx on day
31+.  How strong of an indicator could it be if it is seen so
infrequently?

This leads me to propose a different study... how many of those hapaxes
are outside of your min_dev range?  How many further registrations does
it take to move them into an influential scoring range?

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040323/a70a860d/attachment.sig>


More information about the Bogofilter mailing list