hash table vs Judy array; word ordering
me at sigpipe.org
Tue Sep 24 02:44:28 EDT 2002
Gyepi SAM wrote:
> So where does this stand now?
> It seems that without some kind of resolution, we can't use hash tables
> or other tuned data structure. Even without replacing Judy, this seems like a good way
> to improve spamicity calculations.
I'm not sure if my previous suggestion on this manner got lost in the
noise, so I'll repost it here. You can solve weighting of all the
tokens with the .01/.99 values by instead assigning the probablity for a
token which appears only in one or the other as such:
1) for nonspam messages, p(w) = 0.5 / n, where n is the number of times
you've seen the word.
2) for spam messages, p(w) = 1.0 - 0.5/n where n is the number of times
you've seen the word
This forces the old algorithm to sort a word which has appeared 100
times higher than a word that has only appeared 10 times. And you don't
have to do any special weighting--its built into the probability
I've tested this and it works very well.
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter