hash table vs Judy array; word ordering

Michael Elkins me at sigpipe.org
Wed Sep 18 17:49:47 CEST 2002


David Relson wrote:
> My idea is to make select_indicators() a two pass algorithm.  On the first 
> pass, the 15 weakest indicators (closest to 0) would go into a "good" array 
> and the 15 strongest indicators (closest to 1) would go into a "spam" 
> array.  On the second pass, the extrema array would be filled - 
> alternatively using "good" array entries and "spam" array entries.  This 
> method would have two results.  First if all of the "good" words and all of 
> the "spam" words are equidistant from 0.5, the resultant array would be 
> evenly divided between good and spam.  Second, if any of the words are 
> closer to 0.5, the merge would allow it to be replace by a word further 
> than 0.5.  (For best results, the number of entries in each of these lists 
> should be an even number, not an odd number.)

I believe you can still have a single pass to do this.
select_indicators() can just have some code like:

	if (spamProbability >= 0.5)
	{
		// stuff into good array. select n closest to 1
	}
	else
	{
		// stuff into bad array.  select n closest to 0
	}

Then you can just compute the spamicity using these 2n values to give a
more "fair" estimate of the content of the message.

me

For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list