Hapax survival over time

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Mar 24 08:17:32 CET 2004


Tom Anderson <tanderso at oac-design.com> wrote:

>I'm not sure a longer period of time is really necessary.  Clearly if a
>token has been seen only once in 20-30 days, it does not play a very
>large roll in classifying the vast bulk of your messages.  Therefore, it
>could not possibly hurt to delete it and then score it at robx on day
>31+.  

And then again put it in the database. This certainly gives
a wrong value.

>How strong of an indicator could it be if it is seen so
>infrequently?

A pretty high one. There are typical mails which don't show
up most of the time. Examples are: Regular mailing list
reminders, Easter greetings etc.

>This leads me to propose a different study... how many of those hapaxes
>are outside of your min_dev range?  How many further registrations does
>it take to move them into an influential scoring range?

That is the key question with the above problem.

pi




More information about the Bogofilter mailing list