bogofilter Digest 25 Sep 2003 21:41:21 -0000 Issue 187
David Relson
relson at osagesoftware.com
Fri Sep 26 19:03:39 CEST 2003
On 26 Sep 2003 12:53:25 -0400
Tom Anderson <tanderso at oac-design.com> wrote:
> > It's up to the site administrator to determine the policy for
> > bogofilter. Using '-u' for auto-updating is one policy.
> > Train-on-error is another policy. A maintenance policy for
> > discarding singletons after N days that may be appropriate for the
> > for the former but not the latter. 'Tis up to the site
> > administrator to determine what works for his/her site!
>
> I have two possible methods:
>
> 1) Use a timestamp for last-read consisting of 30 epoch days in a
> bitwise format... that would require only 5 bits per token (assuming
> the other format is turned off).
>
> 2-a) Store a single "time_since_last_purged" and simply purge ALL
> hapaxes after some arbitrary number of days. The ones that happend to
> be added on the day of purging, if very important in identifying an
> email as spam or not, will appear more than once in the subsequent
> non-purge period (assuming -u). Since purging could possibly be an
> expensive operation, bogofilter could fork a copy of itself to the
> background under low priority in order to do this.
>
> 2-b) Same as "2-a", but don't store any date at all, and simply purge
> on the 1st of every month (or the 1st and 15th, etc.).
Alternately, use a single "modified" bit. Start by clearing it for all
tokens and set it when a count is modified, then periodically delete all
tokens it set, then clear for all, ...
> BTW, for whoever asked how "hapaxes" came about, I know that "hap"
> means one, as in "haploid cells"... the suffix I don't know.
Check the Merriam Webster site (www.m-w.com). It's from a greek phrase
referring to singleton words in a corpus.
More information about the Bogofilter
mailing list