mailing lists and hapaxes
David Relson
relson at osagesoftware.com
Thu Sep 25 20:06:33 CEST 2003
On Thu, 25 Sep 2003 18:35:39 +0100
"Peter Bishop" <pgb at adelard.com> wrote:
> On 25 Sep 2003 at 5:43, David Relson wrote:
>
> > > An old, singleton token could be doing a fine job
> > > - but there is no easy way of finding out
> >
> > It's up to the site administrator to determine the policy for
> > bogofilter. Using '-u' for auto-updating is one policy.
> > Train-on-error is another policy. A maintenance policy for
> > discarding singletons after N days that may be appropriate for the
> > for the former but not the latter. 'Tis up to the site
> > administrator to determine what works for his/her site!
>
> Well I have thought of a hard way to do it.
>
> - modifiy procmail so that tokens from all emails are stored in a
> second database to count "token hits"
> - the main database is still used to decide the classification of the
> email
>
> Then periodically scan the second database and delete tokens from the
> main that do not appear in the "recent hit database"
>
> The problem with this is that the "recent hit database" could get
> bigger than the main database - and disk space was one of the reasons
> having a minimal database in the first place..
>
> I would like an easier way to do it that does not involve second
> databases extra space, etc,etc.
>
> For example if you could submit an email in "Hit" mode it would simply
> change the date of a token in the database if:
> 1) the token s in the new email
> 2) AND the token is already present in the database
>
> Then I could use the existing bogoutil tool
> to clean out tokens using the count and date selection features.
Peter,
Some ideas ...
Bogofilter's default YYYYMMDD timestamps take up 4 bytes per token.
Since a database entry already has the token, its ham and spam counts,
plus standard DB overhead, these 4 bytes are a minor part of the space
used. The timestamp is updated whenever a token's ham or spam count is
updated. The easiest way of keeping track of all tokens is let
bogofilter register all messages. Alternatively, you could create a
"recent_hits" directory and just have bogofilter register all incoming
tokens in _that_ wordlist. Then you'd need some sort of script to dump
the working wordlist and trim according to the content of the
recent_hits list.
Enjoy!
David
More information about the Bogofilter
mailing list