flaw in spamicity calculation

Doug Beardsley dgbeards at southern.edu
Wed Sep 18 22:48:04 CEST 2002


I coded a quick patch that should make solving our spamicity calculation
problems easier.  I created a function calc_word_probability(char *)
which calculates the probability value for the given word.  Then I took
the part of select_indicators() that did this job and put it in my new
function.  Now I just call calc_word_probability() from
select_indicators().  This makes it really easy for us to try different
methods of calculating the probabilities.

I recommend that an implementation of Gary Robinson's equation
f(w) = (a + s) / ((a*b) + c).

While we're on this subject, I have one other idea that I would like to
propose.  I think it would be useful to be able to calculate spam
probabilities without the presence of a non-spam corpus.  The idea here
is that the probability tells you how close a given message is to
matching the characteristics of a specific set.  This is useful for
generating a general characteristic of spam.  When non-spam is also
included in the calculation, the job becomes more customized to a
per-user basis and not as useful for general implementation by an email
provider.  Eventually, I think it would be good for bogofilter to be
used by email providers to filter spam.  If the provider determines the
message to be spam, then it would not send the message to the intended
recipient, but would bounce a response to the sender saying that the
message was classified as spam and was not delivered.  This will be much
more effective at putting the spammers out of business.

Anyway, I hope I have provked some thought on this subject.

Doug Beardsley

P.S. Here's the diff file for my trivial patch.


475,477c475
< bogostat_t *select_indicators(void  *PArray)
< // selects the best spam/nonspam indicators and
< // populates the stats structure.
---
> double calc_word_probability(char *word)
479,484d476
<     void	**loc;
<     char	tokenbuffer[BUFSIZ];
< 
<     discrim_t *pp, *hit;
<     static bogostat_t stats;
<     
490,500c482
<     for (pp = stats.extrema; pp < stats.extrema+sizeof(stats.extrema)/sizeof(*stats.extrema); pp++)
<     {
<  	pp->prob = 0.5f;
<  	pp->key[0] = '\0';
<     }
<  
<     yytext = tokenbuffer;
<     for (loc  = JudySLFirst(PArray, tokenbuffer, 0);
< 	 loc != (void *) NULL;
< 	 loc  = JudySLNext(PArray, tokenbuffer, 0))
<     {
---
> 	double hamness, spamness;
502,503d483
< 	double dev;
< 	double hamness, spamness, slotdev, hitdev;
505,506c485,486
< 	hamness = getcount(yytext, &ham_list);
< 	spamness  = getcount(yytext, &spam_list);
---
> 	hamness = getcount(word, &ham_list);
> 	spamness  = getcount(word, &spam_list);
540a521,549
> 	return prob;
> }
> 
> bogostat_t *select_indicators(void  *PArray)
> // selects the best spam/nonspam indicators and
> // populates the stats structure.
> {
>     void	**loc;
>     char	tokenbuffer[BUFSIZ];
> 
>     discrim_t *pp, *hit;
>     static bogostat_t stats;
>     
>     for (pp = stats.extrema; pp < stats.extrema+sizeof(stats.extrema)/sizeof(*stats.extrema); pp++)
>     {
>  	pp->prob = 0.5f;
>  	pp->key[0] = '\0';
>     }
>  
>     yytext = tokenbuffer;
>     for (loc  = JudySLFirst(PArray, tokenbuffer, 0);
> 	 loc != (void *) NULL;
> 	 loc  = JudySLNext(PArray, tokenbuffer, 0))
>     {
> 	double prob;
> 	double dev;
> 	double slotdev, hitdev;
> 
> 	prob = calc_word_probability(yytext);


For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list