hash table vs Judy array; word ordering

Ben Rosengart br at panix.com
Wed Sep 18 20:35:26 CEST 2002


On Wed, Sep 18, 2002 at 11:49:53AM -0500, Eric Seppanen wrote:
> 
> - work on the per-word spamicity algorithm.  Make it less eager to return 
> max-values.  Of those words now getting 0.99 or 0.01 we can probably find 
> some that are more interesting than others.  Is there a curve we can use 
> to weight these?

http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

Further Improvement 2
Another way to potentially improve the calculation, at a further cost
in complexity, is to scale the probability guesstimates (that is the
uncombined p1, p2,...,pn numbers above). This improvement is worth
trying, if you have the time, whether or not the guesstimates are
derived from Paul counting approach or from the Bayesian approach
described above (that is, whether you want to use p(w) or f(w)).
However, f(w) is preferred.

Simply sort the words in your database (not the words in a particular
email) by their probability guesstimate from greatest to smallest
(i.e. if there are m words in your database, the highest p corresponds
to a rank of m, and the lowest to 1).

If there are ties, assign a number in the middle of the tied range.
I.e., if 3 words are tied for the ranks of 111, 112, and 113, assign
them all to 112.

Let r(w) be the rank of a particular word. Then our new probability
guesstimate for the word, g(w), is:


g(w) = r(w) / (m + 1)


-- 
Ben Rosengart     (212) 741-4400 x215

Microsoft has argued that open source is bad for business, but you
have to ask, "Whose business?  Theirs, or yours?"    --Tim O'Reilly

For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list