Hapax survival over time
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Mar 24 08:17:32 CET 2004
Tom Anderson <tanderso at oac-design.com> wrote:
>I'm not sure a longer period of time is really necessary. Clearly if a
>token has been seen only once in 20-30 days, it does not play a very
>large roll in classifying the vast bulk of your messages. Therefore, it
>could not possibly hurt to delete it and then score it at robx on day
>31+.
And then again put it in the database. This certainly gives
a wrong value.
>How strong of an indicator could it be if it is seen so
>infrequently?
A pretty high one. There are typical mails which don't show
up most of the time. Examples are: Regular mailing list
reminders, Easter greetings etc.
>This leads me to propose a different study... how many of those hapaxes
>are outside of your min_dev range? How many further registrations does
>it take to move them into an influential scoring range?
That is the key question with the above problem.
pi
More information about the Bogofilter
mailing list