what happens if I discard tokens that occur only once?
relson at osagesoftware.com
Fri Jun 3 20:03:47 EDT 2005
On Fri, 3 Jun 2005 15:51:01 -0700
Chris Fortune wrote:
> > = 1) decay over time. That is: how often do single count tokens become
> > registered at least one more time? But this says nothing about how often
> > the token is being read. It may have been registered only once but still
> > be providing useful information in the calculation.
> It would be good to know the last time a token was read, that way "useless" tokens could be timed out automatically. There is also
> the theoretical possibility of optimizing the database to query 'most read' tokens first. Of course the processing/disc overhead
> associated with this type of house-keeping would have to be weighed against the benefits.
As has been pointed out, the timestamp indicates "last modified". When
scoring messages, bogofilter presently only needs to open the wordlist
for reading, which is a low overhead operation. Just reading the
wordlist allows it to be read-only which might be important in some
Having it be "last used" would require a wordlist update for each token
used. This requires read-write access to the wordlist (which may lead
to permission problems). Reading and writing is also more disk
intensive, hence slower -- thus performance changes.
It could be done, but is it worth it? Honestly, I don't know.
As to optimizing for "most read", Berkeley DB stores tokens
alphabetically. For best performance, bogofilter does a similar sort
(after parsing) so that disk access is minimized. AFAIK, Berkely DB
only has the one sort order, so optimizing for "most read" isn't a
possibility. Sorry :-<
More information about the Bogofilter