mailing lists and hapaxes

David Relson relson at osagesoftware.com
Thu Sep 25 20:06:33 CEST 2003


On Thu, 25 Sep 2003 18:35:39 +0100
"Peter Bishop" <pgb at adelard.com> wrote:

> On 25 Sep 2003 at 5:43, David Relson wrote:
> 
> > > An old, singleton token could be doing a fine job 
> > > - but there is no easy way of finding out
> > 
> > It's up to the site administrator to determine the policy for
> > bogofilter.  Using '-u' for auto-updating is one policy. 
> > Train-on-error is another policy.  A maintenance policy for
> > discarding singletons after N days that may be appropriate for the
> > for the former but not the latter.  'Tis up to the site
> > administrator to determine what works for his/her site!
> 
> Well I have thought of a hard way to do it.
> 
> - modifiy procmail so that tokens from all emails are stored in a
> second database to count "token hits"
> - the main database is still used to decide the classification of the
> email
> 
> Then periodically scan the second database and delete tokens from the
> main that do not appear in the "recent hit database" 
> 
> The problem with this is that the "recent hit database" could get
> bigger than the main database - and disk space was one of the reasons
> having a minimal database in the first place..
> 
> I would like an easier way to do it that does not involve second
> databases extra space, etc,etc.
> 
> For example if you could submit an email in "Hit" mode it would simply
> change the date of a token in the database if:
> 1) the token s in the new email
> 2) AND the token is already present in the database
> 
> Then I could use the existing bogoutil tool
> to clean out tokens using the count and date selection features.

Peter,

Some ideas ...

Bogofilter's default YYYYMMDD timestamps take up 4 bytes per token. 
Since a database entry already has the token, its ham and spam counts,
plus standard DB overhead, these 4 bytes are a minor part of the space
used.  The timestamp is updated whenever a token's ham or spam count is
updated.  The easiest way of keeping track of all tokens is let
bogofilter register all messages.  Alternatively, you could create a
"recent_hits" directory and just have bogofilter register all incoming
tokens in _that_ wordlist.  Then you'd need some sort of script to dump
the working wordlist and trim according to the content of the
recent_hits list. 

Enjoy!

David




More information about the Bogofilter mailing list