mailing lists and hapaxes
Peter Bishop
pgb at adelard.com
Thu Sep 25 09:46:30 CEST 2003
On 25 Sep 2003 at 8:46, Boris 'pi' Piwinger wrote:
> >My thinking here is that randomly deleting hapaxes is dangerous, because
> >you don't know if they're about to turn into real tokens. But if
> >they've remained an hapax for a month, it's pretty unlikely you'll see
> >another one of them, so you can fairly safely kill it.
>
> So if you don't train with this token, because it was good
> enough, this would get the token removed. Not so good.
>
Yes indeed,
Maintenance based on date or count assumes that
*all* messages will be added to the database.
For those of us who build minimal databases
e.g. via train-on-error, bogominitrain or whatever)
this is not the case .
An old, singleton token could be doing a fine job
- but there is no easy way of finding out
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list