mailing lists and hapaxes

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Thu Sep 25 08:46:29 CEST 2003


michael at optusnet.com.au wrote:

>We could add 'time last written' for each token ( a 4 byte per token
>increase! ) and delete hapaxes where the create time is older
>than X ( 1 month? ) 

Actually read instead of written would be more useful, but
really expensive.

>My thinking here is that randomly deleting hapaxes is dangerous, because
>you don't know if they're about to turn into real tokens. But if
>they've remained an hapax for a month, it's pretty unlikely you'll see
>another one of them, so you can fairly safely kill it.

So if you don't train with this token, because it was good
enough, this would get the token removed. Not so good.

pi




More information about the Bogofilter mailing list