bogofilter Digest 25 Sep 2003 21:41:21 -0000 Issue 187

David Relson relson at osagesoftware.com
Fri Sep 26 19:03:39 CEST 2003


On 26 Sep 2003 12:53:25 -0400
Tom Anderson <tanderso at oac-design.com> wrote:

> > It's up to the site administrator to determine the policy for
> > bogofilter.  Using '-u' for auto-updating is one policy. 
> > Train-on-error is another policy.  A maintenance policy for
> > discarding singletons after N days that may be appropriate for the
> > for the former but not the latter.  'Tis up to the site
> > administrator to determine what works for his/her site!
> 
> I have two possible methods:
> 
> 1) Use a timestamp for last-read consisting of 30 epoch days in a
> bitwise format... that would require only 5 bits per token (assuming
> the other format is turned off).
> 
> 2-a) Store a single "time_since_last_purged" and simply purge ALL
> hapaxes after some arbitrary number of days.  The ones that happend to
> be added on the day of purging, if very important in identifying an
> email as spam or not, will appear more than once in the subsequent
> non-purge period (assuming -u).  Since purging could possibly be an
> expensive operation, bogofilter could fork a copy of itself to the
> background under low priority in order to do this.
> 
> 2-b) Same as "2-a", but don't store any date at all, and simply purge
> on the 1st of every month (or the 1st and 15th, etc.).

Alternately, use a single "modified" bit.  Start by clearing it for all
tokens and set it when a count is modified, then periodically delete all
tokens it set, then clear for all, ...

> BTW, for whoever asked how "hapaxes" came about, I know that "hap"
> means one, as in "haploid cells"... the suffix I don't know.

Check the Merriam Webster site (www.m-w.com).  It's from a greek phrase
referring to singleton words in a corpus.




More information about the Bogofilter mailing list