hash table vs Judy array; word ordering

Michael Elkins me at sigpipe.org
Tue Sep 24 08:44:28 CEST 2002


Gyepi SAM wrote:
> So where does this stand now?
> It seems that without some kind of resolution, we can't use hash tables
> or other tuned data structure. Even without replacing Judy, this seems like a good way
> to improve spamicity calculations.

I'm not sure if my previous suggestion on this manner got lost in the
noise, so I'll repost it here.  You can solve weighting of all the
tokens with the .01/.99 values by instead assigning the probablity for a
token which appears only in one or the other as such:

1) for nonspam messages, p(w) = 0.5 / n, where n is the number of times
you've seen the word.
2) for spam messages, p(w) = 1.0 - 0.5/n where n is the number of times
you've seen the word

This forces the old algorithm to sort a word which has appeared 100
times higher than a word that has only appeared 10 times.  And you don't
have to do any special weighting--its built into the probability
already.

I've tested this and it works very well.

For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list