mailing lists and hapaxes

Peter Bishop pgb at adelard.com
Thu Sep 25 19:35:39 CEST 2003


On 25 Sep 2003 at 5:43, David Relson wrote:

> > An old, singleton token could be doing a fine job 
> > - but there is no easy way of finding out
> 
> It's up to the site administrator to determine the policy for
> bogofilter.  Using '-u' for auto-updating is one policy.  Train-on-error
> is another policy.  A maintenance policy for discarding singletons after
> N days that may be appropriate for the for the former but not the
> latter.  'Tis up to the site administrator to determine what works for
> his/her site!

Well I have thought of a hard way to do it.

- modifiy procmail so that tokens from all emails are stored in a second 
database to count "token hits"
- the main database is still used to decide the classification of the email

Then periodically scan the second database and delete tokens from the main
that do not appear in the "recent hit database" 

The problem with this is that the "recent hit database" could get bigger 
than the main database - and disk space was one of the reasons having a 
minimal database in the first place..

I would like an easier way to do it that does not involve second databases
extra space, etc,etc.

For example if you could submit an email in "Hit" mode it would simply
change the date of a token in the database if:
1) the token s in the new email
2) AND the token is already present in the database

Then I could use the existing bogoutil tool
to clean out tokens using the count and date selection features.


-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list