mailing lists and hapaxes

David Relson relson at osagesoftware.com
Thu Sep 25 11:43:22 CEST 2003


On Thu, 25 Sep 2003 08:46:30 +0100
"Peter Bishop" <pgb at adelard.com> wrote:

> On 25 Sep 2003 at 8:46, Boris 'pi' Piwinger wrote:
> 
> > >My thinking here is that randomly deleting hapaxes is dangerous,
> > >because you don't know if they're about to turn into real tokens.
> > >But if they've remained an hapax for a month, it's pretty unlikely
> > >you'll see another one of them, so you can fairly safely kill it.
> > 
> > So if you don't train with this token, because it was good
> > enough, this would get the token removed. Not so good.
> > 
> Yes indeed,
> Maintenance based on date or count assumes that
> *all* messages will be added to the database.
> 
> For those of us who build minimal databases
> e.g. via train-on-error, bogominitrain or whatever)
> this is not the case .
> 
> An old, singleton token could be doing a fine job 
> - but there is no easy way of finding out

It's up to the site administrator to determine the policy for
bogofilter.  Using '-u' for auto-updating is one policy.  Train-on-error
is another policy.  A maintenance policy for discarding singletons after
N days that may be appropriate for the for the former but not the
latter.  'Tis up to the site administrator to determine what works for
his/her site!





More information about the Bogofilter mailing list