hash table vs Judy array; word ordering
Michael Elkins
me at sigpipe.org
Wed Sep 18 17:49:47 CEST 2002
David Relson wrote:
> My idea is to make select_indicators() a two pass algorithm. On the first
> pass, the 15 weakest indicators (closest to 0) would go into a "good" array
> and the 15 strongest indicators (closest to 1) would go into a "spam"
> array. On the second pass, the extrema array would be filled -
> alternatively using "good" array entries and "spam" array entries. This
> method would have two results. First if all of the "good" words and all of
> the "spam" words are equidistant from 0.5, the resultant array would be
> evenly divided between good and spam. Second, if any of the words are
> closer to 0.5, the merge would allow it to be replace by a word further
> than 0.5. (For best results, the number of entries in each of these lists
> should be an even number, not an odd number.)
I believe you can still have a single pass to do this.
select_indicators() can just have some code like:
if (spamProbability >= 0.5)
{
// stuff into good array. select n closest to 1
}
else
{
// stuff into bad array. select n closest to 0
}
Then you can just compute the spamicity using these 2n values to give a
more "fair" estimate of the content of the message.
me
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list