flaw in spamicity calculation

Wed Sep 18 22:01:12 CEST 2002

It seems to me that the main source of this problem is the presence of
the values .99 and .01.  These values are assigned because the word
occurred only in one of the lists.  I think the problem is making these
values more descriptive.  I don't think it's necessarily a problem with
combining the probabilities into a spamicity, but more of a problem with
generating the original probabilities in the first place.

As I see it, two possibilities present themselves.  The first is to
ignore words that don't appear in both spam repositories.  This
obviously loses a wealth of information.  A good spamicity may be able
to be computed without this information, but there must be some other
option.

The second option is to do something to generate a more accurate
probability in the absence of information from both corpuses.

Let's take the words "houses" and "mentoring" from the below example.  Both
words have never been seen in spam email.  However, the word "houses"
probably occurs much more frequently in non-spam than "mentoring".  It
would seem logical to assume that "houses" is probably a stronger
indicator of non-spam than "mentoring".  Something should be done in the
probability computations of these words to account for this.

At this point, I'm not prepared to recommend a solution, but I'm sure
some of the statisticians out there can help with this.  Maybe the
answer is to use Gary Robinson's "Further Improvement 1" to calculate
the individual probabilities.  It's worth some study.

Doug Beardsley

On Wed, Sep 18, 2002 at 03:37:23PM -0400, David Relson wrote:
> Greetings,
> 
> At the moment, I have a sort/merge implemented so that filling the extrema 
> array no longer depends on word order.  Testing has shown some results that 
> I didn't expect.
> 
> When there are significant numbers of "extremely good" and "extremely spam" 
> words, i.e. those with spam/non-spam probabilities of 0.01 and 0.99, the 
> extrema array will be evenly split between the two values.  The result of 
> the spamicity calculation is then highly dependant on the size of the 
> extrema array.
> 
> The guts of compute_spamicity() currently look like:
> 
>     product = invproduct = spamicity = 1.0f;
>     for (idx = 0; idx < sizeof(stats->extrema)/sizeof(*stats->extrema); 
>     idx++)
>     {
> 	discrim_t *pp = &stats->extrema[idx];
> 	product *= pp->prob;
> 	invproduct *= (1 - pp->prob);
> 	spamicity = product / (product + invproduct);
> 	if (verbose)
> 	    printf("# %2d:  %f  %f  %f  %15.12f  %s\n", idx, pp->prob, 
> 	    product, invproduct, spamicity, pp->key);
>     }
>     if (verbose)
> 	printf("# Spamicity of %f\n", spamicity);
> 
> Below are the results with 8 and 16 array entries with an even 0.01/0.99 
> split.  Notice how the final spamicity depends on the word count:
> 
> #  0:  0.010000  0.010000  0.990000   0.010000000000  anxious
> #  1:  0.010000  0.000100  0.980100   0.000102019996  mentoring
> #  2:  0.010000  0.000001  0.970299   0.000001030609  mso-outline-level
> #  3:  0.010000  0.000000  0.960596   0.000000010410  smarttags
> #  4:  0.010000  0.000000  0.950990   0.000000000105  smarttagtype
> #  5:  0.010000  0.000000  0.941480   0.000000000001  tailored
> #  6:  0.990000  0.000000  0.009415   0.000000000105  edit-time-data
> #  7:  0.990000  0.000000  0.000094   0.000000010410  i1026
> #  8:  0.990000  0.000000  0.000001   0.000001030609  i1028
> #  9:  0.990000  0.000000  0.000000   0.000102019996  line-height
> # 10:  0.990000  0.000000  0.000000   0.010000000000  mso-line-height-alt
> # 11:  0.990000  0.000000  0.000000   0.500000000000  reoccurring
> # Spamicity of 0.500000
> 
> #  0:  0.010000  0.010000  0.990000   0.010000000000  anxious
> #  1:  0.010000  0.000100  0.980100   0.000102019996  doesn
> #  2:  0.010000  0.000001  0.970299   0.000001030609  houses
> #  3:  0.010000  0.000000  0.960596   0.000000010410  imaginary
> #  4:  0.010000  0.000000  0.950990   0.000000000105  mentoring
> #  5:  0.010000  0.000000  0.941480   0.000000000001  smarttags
> #  6:  0.010000  0.000000  0.932065   0.000000000000  smarttagtype
> #  7:  0.010000  0.000000  0.922745   0.000000000000  st1
> #  8:  0.010000  0.000000  0.913517   0.000000000000  upstate
> #  9:  0.990000  0.000000  0.009135   0.000000000000  edit-time-data
> # 10:  0.990000  0.000000  0.000091   0.000000000000  editdata.mso
> # 11:  0.990000  0.000000  0.000001   0.000000000001  i1026
> # 12:  0.990000  0.000000  0.000000   0.000000000105  i1027
> # 13:  0.990000  0.000000  0.000000   0.000000010410  i1028
> # 14:  0.990000  0.000000  0.000000   0.000001030609  mso-line-height-alt
> # 15:  0.990000  0.000000  0.000000   0.000102019996  opm
> # Spamicity of 0.000102
> 
> Methinks it's time to take a serious look at some alternate ideas for 
> converting individual word probabilities into a messages spamicity.
> 
> David
> --------------------------------------------------------
> David Relson                   Osage Software Systems, Inc.
> relson at osagesoftware.com       Ann Arbor, MI 48103
> www.osagesoftware.com          tel:  734.821.8800
> 
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summay digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com

For summay digest subscription: bogofilter-digest-subscribe at aotto.com