flaw in spamicity calculation
Doug Beardsley
dgbeards at southern.edu
Wed Sep 18 22:01:12 CEST 2002
It seems to me that the main source of this problem is the presence of
the values .99 and .01. These values are assigned because the word
occurred only in one of the lists. I think the problem is making these
values more descriptive. I don't think it's necessarily a problem with
combining the probabilities into a spamicity, but more of a problem with
generating the original probabilities in the first place.
As I see it, two possibilities present themselves. The first is to
ignore words that don't appear in both spam repositories. This
obviously loses a wealth of information. A good spamicity may be able
to be computed without this information, but there must be some other
option.
The second option is to do something to generate a more accurate
probability in the absence of information from both corpuses.
Let's take the words "houses" and "mentoring" from the below example. Both
words have never been seen in spam email. However, the word "houses"
probably occurs much more frequently in non-spam than "mentoring". It
would seem logical to assume that "houses" is probably a stronger
indicator of non-spam than "mentoring". Something should be done in the
probability computations of these words to account for this.
At this point, I'm not prepared to recommend a solution, but I'm sure
some of the statisticians out there can help with this. Maybe the
answer is to use Gary Robinson's "Further Improvement 1" to calculate
the individual probabilities. It's worth some study.
Doug Beardsley
On Wed, Sep 18, 2002 at 03:37:23PM -0400, David Relson wrote:
> Greetings,
>
> At the moment, I have a sort/merge implemented so that filling the extrema
> array no longer depends on word order. Testing has shown some results that
> I didn't expect.
>
> When there are significant numbers of "extremely good" and "extremely spam"
> words, i.e. those with spam/non-spam probabilities of 0.01 and 0.99, the
> extrema array will be evenly split between the two values. The result of
> the spamicity calculation is then highly dependant on the size of the
> extrema array.
>
> The guts of compute_spamicity() currently look like:
>
> product = invproduct = spamicity = 1.0f;
> for (idx = 0; idx < sizeof(stats->extrema)/sizeof(*stats->extrema);
> idx++)
> {
> discrim_t *pp = &stats->extrema[idx];
> product *= pp->prob;
> invproduct *= (1 - pp->prob);
> spamicity = product / (product + invproduct);
> if (verbose)
> printf("# %2d: %f %f %f %15.12f %s\n", idx, pp->prob,
> product, invproduct, spamicity, pp->key);
> }
> if (verbose)
> printf("# Spamicity of %f\n", spamicity);
>
> Below are the results with 8 and 16 array entries with an even 0.01/0.99
> split. Notice how the final spamicity depends on the word count:
>
> # 0: 0.010000 0.010000 0.990000 0.010000000000 anxious
> # 1: 0.010000 0.000100 0.980100 0.000102019996 mentoring
> # 2: 0.010000 0.000001 0.970299 0.000001030609 mso-outline-level
> # 3: 0.010000 0.000000 0.960596 0.000000010410 smarttags
> # 4: 0.010000 0.000000 0.950990 0.000000000105 smarttagtype
> # 5: 0.010000 0.000000 0.941480 0.000000000001 tailored
> # 6: 0.990000 0.000000 0.009415 0.000000000105 edit-time-data
> # 7: 0.990000 0.000000 0.000094 0.000000010410 i1026
> # 8: 0.990000 0.000000 0.000001 0.000001030609 i1028
> # 9: 0.990000 0.000000 0.000000 0.000102019996 line-height
> # 10: 0.990000 0.000000 0.000000 0.010000000000 mso-line-height-alt
> # 11: 0.990000 0.000000 0.000000 0.500000000000 reoccurring
> # Spamicity of 0.500000
>
> # 0: 0.010000 0.010000 0.990000 0.010000000000 anxious
> # 1: 0.010000 0.000100 0.980100 0.000102019996 doesn
> # 2: 0.010000 0.000001 0.970299 0.000001030609 houses
> # 3: 0.010000 0.000000 0.960596 0.000000010410 imaginary
> # 4: 0.010000 0.000000 0.950990 0.000000000105 mentoring
> # 5: 0.010000 0.000000 0.941480 0.000000000001 smarttags
> # 6: 0.010000 0.000000 0.932065 0.000000000000 smarttagtype
> # 7: 0.010000 0.000000 0.922745 0.000000000000 st1
> # 8: 0.010000 0.000000 0.913517 0.000000000000 upstate
> # 9: 0.990000 0.000000 0.009135 0.000000000000 edit-time-data
> # 10: 0.990000 0.000000 0.000091 0.000000000000 editdata.mso
> # 11: 0.990000 0.000000 0.000001 0.000000000001 i1026
> # 12: 0.990000 0.000000 0.000000 0.000000000105 i1027
> # 13: 0.990000 0.000000 0.000000 0.000000010410 i1028
> # 14: 0.990000 0.000000 0.000000 0.000001030609 mso-line-height-alt
> # 15: 0.990000 0.000000 0.000000 0.000102019996 opm
> # Spamicity of 0.000102
>
> Methinks it's time to take a serious look at some alternate ideas for
> converting individual word probabilities into a messages spamicity.
>
> David
> --------------------------------------------------------
> David Relson Osage Software Systems, Inc.
> relson at osagesoftware.com Ann Arbor, MI 48103
> www.osagesoftware.com tel: 734.821.8800
>
>
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summay digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list