[PATCH] Modify 0.7.5beta to do Robinson calculations

Sat Oct 19 17:23:14 CEST 2002

Although ymmv, a version of bogofilter 0.7.4 modified to calculate f(w)
and S as suggested by Gary Robinson
(http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)
has proven itself much superior to unmodified bogofilter for my
incoming email, yielding fewer false negatives and essentially no false
positives.  I submitted that patch to this list some time ago.

This morning I redid the patch for 0.7.5beta (the MD5 sum of the
original tarball is 589cb669eb4976bd0767b3c5ce29737d) and here it is in
case anyone else is interested.  It hasn't been extensively tested yet
but it does compile and run; for reasons not yet known to me, it tends
to produce slightly lower spamicity values than 0.7.4 for the same
emails.  I rebuilt my training db from scratch, and some nonspam emails
had been lost since incorporation into the 0.7.4 database, which may
have contributed to this effect.

Note that this version doesn't need to save extrema.  There is a macro
MIN_DEV that can be used to make the program ignore uncharacteristic
words (words with f(w) close to 0.5), but I get best results if I just
leave it set at 0 so all words are taken into account.

Note also that I've reduced MAX_REPEATS to 1; again, that works better
for me (and seems statistically justified; see Robinson's paper and the
related discussions on the spambayes list for details).

--- bogofilter.1.orig	2002-10-19 09:46:44.000000000 -0400
+++ bogofilter.1	2002-10-19 09:46:44.000000000 -0400
@@ -35,7 +35,7 @@
 .SH "THEORY OF OPERATION"
 
 .PP
-Bogofilter treats its input as a bag of tokens. Each token is checked against "good" and "bad" wordlists, which maintain counts of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute the probability that a mail in which the token occurs is spam. After probabilities for all input tokens have been computed, a fixed number of the probabilities that deviate furtherest from average are combined using Bayes's theorem on conditional probabilities. If the computed probability that the input is spam exceeds 0.9, bogofilter returns 0, otherwise 1.
+Bogofilter treats its input as a bag of tokens. Each token is checked against "good" and "bad" wordlists, which maintain counts of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute the probability that a mail in which the token occurs is spam. After probabilities for all input tokens have been computed, a fixed number of the probabilities that deviate furtherest from average are combined using Bayes's theorem on conditional probabilities. If the computed probability that the input is spam exceeds 0.52, bogofilter returns 0, otherwise 1.
 
 .PP
 While this method sounds crude compared to the more usual pattern-matching approach, it turns out to be extremely effective. Paul Graham's paper A Plan For Spam: \fIhttp://www.paulgraham.com/spam.html\fR is recommended reading.
@@ -44,6 +44,9 @@
 This program substantially improves on Paul's proposal by doing smarter lexical analysis. In particular, hostames and IP addresses are retained as regognition features rather than broken up. Various kinds of MTA cruft such as dates and message-IDs are discarded so as not to bloat the word lists. Lex's Swiss-army-knife nature rises again.
 
 .PP
+Another seeming improvement is that this program implements Gary Robinson's suggested modifications (S and f(w) but not g(w)) to the calculations.
+
+.PP
 The input may be one message or many. Messages are broken up on From_ lines. The algorithm is relatively insensitive to message miscounts.
 
 .PP
--- bogofilter.c.orig	2002-10-19 09:19:35.000000000 -0400
+++ bogofilter.c	2002-10-19 10:42:23.000000000 -0400
@@ -14,6 +14,16 @@
 
 I do the lexical analysis slightly differently, however.
 
+MOD: (Greg Louis <glouis at dynamicro.on.ca>) This version implements Gary
+    Robinson's proposed modifications to the "spamicity" calculation and
+    uses his f(w) individual probability calculation.  See
+
+    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
+    
+    In addition, this version does not store "extrema."  Instead it accumulates
+    Robinson's P and Q using all words deemed "characteristic," i.e. having
+    a deviation (fabs (0.5f - prob)) >= MIN_DEV, currently set to 0.0.
+
 ******************************************************************************/
 #include <stdio.h>
 #include <math.h>
@@ -27,17 +37,18 @@
 #include "wordhash.h"
 
 // constants for the Graham formula 
-#define KEEPERS		15		// how many extrema to keep
-#define MINIMUM_FREQ	5		// minimum freq
-#define UNKNOWN_WORD	0.4f		// odds that unknown word is spammish
-#define SPAM_CUTOFF	0.9f		// if it's spammier than this...
-#define MAX_REPEATS	4		// cap on word frequency per message
-
-#define MAX_PROB	0.99f		// max probability value used
-#define MIN_PROB	0.01f		// min probability value used
-#define EVEN_ODDS	0.5f		// used for words we want to ignore
-#define DEVIATION(n)	fabs((n) - EVEN_ODDS)		// deviation from average
-
+#define SPAM_CUTOFF	0.52f		// if it's spammier than this...
+#define EVEN_ODDS	0.5f
+#define MIN_DEV		0.0f		// if nonzero, use characteristic words
+#define MAX_REPEATS	1		// cap on word frequency per message
+
+#define ROBS 0.001f                    // Robinson's s
+#define ROBX 0.415f                    // Robinson's x
+
+#define max(x, y)      (((x) > (y)) ? (x) : (y))
+#define min(x, y)      (((x) < (y)) ? (x) : (y))
+#define        PLURAL(count) ((count == 1) ? "" : "s")
+   
 extern char msg_register[];
 
 static void wordprop_init(void *vwordprop){
@@ -122,7 +133,8 @@
   sprintf(msg_register, "register-%c, %d words, %d messages\n", ch, wordcount, msgcount);
 
   if (verbose)
-    fprintf(stderr, "# %d words, %d messages\n", wordcount, msgcount);
+    fprintf(stderr, "# %d word%s, %d message%s\n", 
+    wordcount, PLURAL(wordcount), msgcount, PLURAL(msgcount));
 
   good_list.active = spam_list.active = FALSE;
 
@@ -199,92 +211,6 @@
   wordhash_free(h);
 }
 
-typedef struct 
-{
-    char        key[MAXTOKENLEN+1];
-    double      prob;
-}
-discrim_t;
-
-typedef struct
-{
-    discrim_t extrema[KEEPERS];
-}
-bogostat_t;
-
-#define SIZEOF(array)	((size_t)(sizeof(array)/sizeof(array[0])))
-
-static bogostat_t bogostats;
-
-int compare_extrema(const void *id1, const void *id2)
-{ 
-    const discrim_t *d1 = id1;
-    const discrim_t *d2 = id2;
-    return ( (d1->prob > d2->prob) ||
-	     ((d1->prob == d2->prob) && (strcmp(d1->key, d2->key) > 0)));
-}
-
-void init_bogostats(bogostat_t *bogostats)
-{
-    int idx;
-
-    for (idx = 0; idx < SIZEOF(bogostats->extrema); idx++)
-    {
-	discrim_t *pp = &bogostats->extrema[idx];
-	pp->prob = EVEN_ODDS;
-	pp->key[0] = '\0';
-    }
-}
-
-void populate_bogostats( bogostat_t *bogostats, char *text, double prob, int count )
-// if  the new word,prob pair is a better indicator.
-// add them to the bogostats structure, 
-{
-    size_t idx;
-    double dev;
-    double slotdev, hitdev;
-    discrim_t *pp, *hit;
-
-    // update the list of tokens with maximum deviation
-    dev = DEVIATION(prob);
-    hit = NULL;
-    hitdev=1;
-
-    for (idx = 0; idx < SIZEOF(bogostats->extrema); idx++)
-    {
-	pp = &bogostats->extrema[idx];
-	if (pp->key[0] == '\0' )
-	{
-	    hit = pp;
-	    break;
-	}
-	else
-	{
-	    slotdev=DEVIATION(pp->prob);
-
-	    if (dev>slotdev && hitdev>slotdev)
-	    {
-		hit=pp;
-		hitdev=slotdev;
-	    }
-	}
-    }
-    if (hit) 
-    { 
-	hit->prob = prob;
-	strncpy(hit->key, text, MAXTOKENLEN);
-    }
-}
-
-void print_bogostats( FILE *fp )
-{
-    discrim_t *pp;
-    for (pp = bogostats.extrema; pp < bogostats.extrema+SIZEOF(bogostats.extrema); pp++)
-    {
-	fprintf(fp, "#  %f  %s\n", pp->prob, pp->key);
-    }
-}
-
 typedef struct {
     double good;
     double bad;
@@ -305,7 +231,8 @@
 
 double wordprob_result(wordprob_t* wordstats)
 {
-    double prob = wordstats->bad/(wordstats->good + wordstats->bad);
+    double prob = (ROBS * ROBX + wordstats->bad) /
+	(ROBS + wordstats->good + wordstats->bad);
     return (prob);
 }
 
@@ -348,62 +275,46 @@
 	    wordprob_add(&wordstats, prob, list->bad);
 	}
     }
-    if (totalcount < MINIMUM_FREQ)
-	prob=UNKNOWN_WORD;
-    else {
-	prob=wordprob_result(&wordstats);
-	prob = min(MAX_PROB, prob);
-	prob = max(MIN_PROB, prob);
-    }
+    prob=wordprob_result(&wordstats);
     return prob;
 }
 
-bogostat_t *select_indicators(wordhash_t *wordhash)
+double compute_spamicity(wordhash_t *wordhash)
 // selects the best spam/nonspam indicators and
-// populates the bogostats structure.
+// calculates Robinson's S
 {
     hashnode_t *node;
-
-    init_bogostats(&bogostats);
+    double invproduct = 0.0;   // Robinson's P
+    double product = 0.0;      // Robinson's Q
+    double spamicity, invn;
+    int robn = 0;
 
     for(node = wordhash_first(wordhash); node != NULL; node = wordhash_next(wordhash))
     {
 	char *token = node->key;
 	double prob = compute_probability( token );
 
-	populate_bogostats( &bogostats, token, prob, 1 );
-    }
-
-    return (&bogostats);
-}
-
-double compute_spamicity(bogostat_t *bogostats)
-// computes the spamicity of the words in the bogostat structure
-// returns:  the spamicity
-{
-    double product, invproduct;
-    double spamicity = 0.0;
-
-    discrim_t *pp;
-
+        // Robinson's P and Q; accumulation step
+        // P = 1 - ((1-p1)*(1-p2)*...*(1-pn))^(1/n)     [spamminess]
+        // Q = 1 - (p1*p2*...*pn)^(1/n)                 [non-spamminess]
+        if (fabs(0.5 - prob) >= MIN_DEV) {
+            invproduct += log(1.0 - prob);
+            product += log(prob);
+            robn ++;
+        }
+    }
+
+    // Robinson's P, Q and S
+    // S = (P - Q) / (P + Q)                        [combined indicator]
+    if (robn) {
+        invn = (double)robn;
+        invproduct = 1.0 - exp(invproduct / invn);
+        product = 1.0 - exp(product / invn);
+        spamicity =
+            (1.0 + (invproduct - product) / (invproduct + product)) / 2.0;
+    } else spamicity = ROBX;
     if (verbose)
-    {
-	// put the bogostats in ascending order by probability and alphabet
-	qsort(bogostats->extrema, SIZEOF(bogostats->extrema), sizeof(discrim_t), compare_extrema);
-    }
-
-    // Bayes' theorem.
-    // For discussion, see <http://www.mathpages.com/home/kmath267.htm>.
-    product = invproduct = 1.0f;
-    for (pp = bogostats->extrema; pp < bogostats->extrema+SIZEOF(bogostats->extrema); pp++)
-	if (pp->prob != 0)
-	{
-	    product *= pp->prob;
-	    invproduct *= (1 - pp->prob);
-	    spamicity = product / (product + invproduct);
-	    if (verbose>1)
-		fprintf(stderr, "#  %f  %f  %s\n", pp->prob, spamicity, pp->key);
-	}
+	fprintf(stderr, "# %f\n", spamicity);
 
     return spamicity;
 }
@@ -414,7 +325,6 @@
     rc_t	status;
     double 	spamicity;
     wordhash_t  *wordhash;
-    bogostat_t	*bogostats;
     int		wordcount, msgcount;
 
 //  tokenize input text and save words in a wordhash.
@@ -427,21 +337,19 @@
     good_list.msgcount = db_getcount(good_list.dbh);
     spam_list.msgcount = db_getcount(spam_list.dbh);
 
-//  select the best spam/nonspam indicators.
-    bogostats = select_indicators(wordhash);
-
-    db_lock_release_list(word_lists);
 
 //  computes the spamicity of the spam/nonspam indicators.
-    spamicity = compute_spamicity(bogostats);
+    spamicity = compute_spamicity(wordhash);
 
+    db_lock_release_list(word_lists);
     status = (spamicity > SPAM_CUTOFF) ? RC_SPAM : RC_NONSPAM;
 
     if (xss != NULL)
         *xss = spamicity;
     
     if (run_type == RUN_UPDATE)
-      register_words((status==RC_SPAM) ? REG_SPAM : REG_GOOD, wordhash, msgcount, wordcount);
+	register_words((status==RC_SPAM) ? REG_SPAM : REG_GOOD,
+	    wordhash, msgcount, wordcount);
 
     wordhash_free(wordhash);
 
--- bogofilter.h.orig	2002-10-19 09:21:27.000000000 -0400
+++ bogofilter.h	2002-10-19 09:21:27.000000000 -0400
@@ -7,7 +7,7 @@
 #include <lexer.h>
 #include <wordlists.h>
 
-#define GOOD_BIAS	2		// give good words more weight
+#define GOOD_BIAS	1		// don't give good words more weight
 
 typedef enum rc_e {RC_SPAM=0, RC_NONSPAM=1}  rc_t;
 


-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |