mailing lists and hapaxes
David Relson
relson at osagesoftware.com
Thu Sep 25 11:43:22 CEST 2003
On Thu, 25 Sep 2003 08:46:30 +0100
"Peter Bishop" <pgb at adelard.com> wrote:
> On 25 Sep 2003 at 8:46, Boris 'pi' Piwinger wrote:
>
> > >My thinking here is that randomly deleting hapaxes is dangerous,
> > >because you don't know if they're about to turn into real tokens.
> > >But if they've remained an hapax for a month, it's pretty unlikely
> > >you'll see another one of them, so you can fairly safely kill it.
> >
> > So if you don't train with this token, because it was good
> > enough, this would get the token removed. Not so good.
> >
> Yes indeed,
> Maintenance based on date or count assumes that
> *all* messages will be added to the database.
>
> For those of us who build minimal databases
> e.g. via train-on-error, bogominitrain or whatever)
> this is not the case .
>
> An old, singleton token could be doing a fine job
> - but there is no easy way of finding out
It's up to the site administrator to determine the policy for
bogofilter. Using '-u' for auto-updating is one policy. Train-on-error
is another policy. A maintenance policy for discarding singletons after
N days that may be appropriate for the for the former but not the
latter. 'Tis up to the site administrator to determine what works for
his/her site!
More information about the Bogofilter
mailing list