hash table vs Judy array; word ordering
eds at reric.net
Wed Sep 18 18:49:53 CEST 2002
On Wed, Sep 18, 2002 at 10:57:34AM -0400, David Relson wrote:
> The problem here is that the contents of the extrema array is dependant on
> the order in which tokens are processed. The Judy array maintains the
> order of words in the message, while the hash list does not. Given that
> 0.01 words are the same as 0.99 words for placement in the extrema array,
> it is possible for one word order to select only 0.01 words, while a
> different word order selects only 0.99 words.
> Given this situation, it would be nice to have select_indicators() be
> independent of word order. I have an idea of how to do this. I'm not sure
> it's a great idea, but I'm going to throw it out for comment ...
I think if we have 30 words with spamicity either 0.99 or 0.01 we're
already in big trouble. I think we should start thinking of our algorithm
as broken in that case.
- adjust KEEPERS per message. Since long messages are more likely to have
larger numbers of interesting words, scale KEEPERS to keep up.
- work on the per-word spamicity algorithm. Make it less eager to return
max-values. Of those words now getting 0.99 or 0.01 we can probably find
some that are more interesting than others. Is there a curve we can use
to weight these?
... more later, gotta go to a meeting.
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter