hash table vs Judy array; word ordering

Wed Sep 18 18:49:53 CEST 2002

On Wed, Sep 18, 2002 at 10:57:34AM -0400, David Relson wrote:
> 
> The problem here is that the contents of the extrema array is dependant on 
> the order in which tokens are processed.  The Judy array maintains the 
> order of words in the message, while the hash list does not.  Given that 
> 0.01 words are the same as 0.99 words for placement in the extrema array, 
> it is possible for one word order to select only 0.01 words, while a 
> different word order selects only 0.99 words.
> 
> Given this situation, it would be nice to have select_indicators() be 
> independent of word order.  I have an idea of how to do this.  I'm not sure 
> it's a great idea, but I'm going to throw it out for comment ...

I think if we have 30 words with spamicity either 0.99 or 0.01 we're 
already in big trouble.  I think we should start thinking of our algorithm 
as broken in that case.

Possible fixes:

- adjust KEEPERS per message.  Since long messages are more likely to have 
larger numbers of interesting words, scale KEEPERS to keep up.

- work on the per-word spamicity algorithm.  Make it less eager to return 
max-values.  Of those words now getting 0.99 or 0.01 we can probably find 
some that are more interesting than others.  Is there a curve we can use 
to weight these?

... more later, gotta go to a meeting.

For summay digest subscription: bogofilter-digest-subscribe at aotto.com