db maintenance "delete oldest least used tokens, but maintain
count of x"
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Mar 17 08:09:24 EST 2004
David Relson wrote:
>> The danger seems to be that once the tokens return, there
>> values will be horribly wrong.
> Rather than "horribly wrong", how about "severely biased". Remember
> that the effect depends on how the wordlist is built/maintained.
Certainly it does. But imagine one of those tokens was
hammish and not suddenly becomes spammish (or vice versa).
Normally it would just become insignificant. This way it is
> Using auto-update, I train on everything. So I have many tokens and
> lots of hapaxes (tokens only seen once). The old hapaxes are tokens
> that have not reappeared (else they'd have counts greater than 1). If I
> decided to trim my wordlist, I'd probably choose that category, i.e. old
> hapaxes, with the expectation of minimal side effects.
With huge wordlists it won't matter, right.
> Using train-on-error, the wordlists are much smaller and token counts
> are much lower. Removing any tokens is more likely to affect scoring (I
After all there is no reason to remove tokens. Once in a
while one might want to rebuild the list (like I will with
0.17.3 due to the lexer change).
> It's a good thing bogofilter has so many ways to be used. It leads to
> interesting discussions.
And lots of confusions for those who don't really understand
the key concepts of the calculations;->
More information about the Bogofilter