flaw in spamicity calculation

Thu Sep 19 23:25:23 CEST 2002

Doug Beardsley wrote:
> It seems to me that the main source of this problem is the presence of
> the values .99 and .01.  These values are assigned because the word
> occurred only in one of the lists.  I think the problem is making these
> values more descriptive.  I don't think it's necessarily a problem with
> combining the probabilities into a spamicity, but more of a problem with
> generating the original probabilities in the first place.
> 
> The second option is to do something to generate a more accurate
> probability in the absence of information from both corpuses.

I was thinking along the lines of assigning a value like:

for nonspam words:

	p = 0.5 / ngood

for spam words:

	p = 1.0 - 0.5 / nbad

That way, words which you have seen more often that only occur in one
corpus end up weighted wrt to eachother.  So for words that are only
seen in nonspam, you assign probabilities like:

	p(word i've seen five times) = 0.1
	p(word i've seen six times) = 0.08
	p(word i've seen seven times) = 0.07
	p(word i've seen ten times) = 0.005
	p(word i've seen a hundred times) = 0.0005

The actual function doesn't matter so much, as long as it weights the
number of times you've seen a word differently.  I want to know that
while the word "foo" has appeard in my nonspam corpus 10 times, i have
more trust in a word "bar" that has occurrent only in my spam corpus 100
times.

For summay digest subscription: bogofilter-digest-subscribe at aotto.com