what happens if I discard tokens that occur only once?

Sat Jun 4 16:05:08 CEST 2005

On Sat, 4 Jun 2005 08:25:05 -0500
Bill McClain wrote:

> On Fri, 3 Jun 2005 17:47:04 -0400
> David Relson <relson at osagesoftware.com> wrote:
> 
> > If only some
> > messages get registered, then one has no additional info about the
> > hapax.
> 
> Right, I'm using thresh_update, so only about 10% of recognized spam
> is registered.
> 
> I have an example of the value of hapaxes. In March I wrote that I
> thought replace-nonascii-characters had stopped working. I was
> mistaken; I was for the first time seeing 8-bit chars in my wordlist,
> but this was because a previously unseen type of cyrillic spam had
> started arriving. 
> 
> Since then I have seen hundreds of these spams, but all have been
> properly classified and the wordlist has 469 8-bit tokens which I
> believe came from 4 messages. Now, the interesting bit: 9 of these
> tokens have count=2, the other 460 are all hapaxes. I can't say for sure
> which are being used, but somehow this set of tokens is 100% effective
> in detecting the cyrillic spam.
> 
> This is an extreme example because of the exotic nature of the tokens
> -- in my case; I don't get any legitimate mail that would include them.
> But a large number of spam tokens are in some way "exotic" and the
> bayesian method makes good use of them. No matter how old my cyrillic
> hapaxes become, it would be a mistake to purge them. (Well, I'd just
> have to register new copies).

Interesting!  Glad to hear it's working.

Hapaxes can be dangerous as well.  We all know about spam with
collections of random collections of words.  A long while back I had
one with "dartmouth" in it.  Months later this caused a false positive.

> With a touch more time and ambition I might patch bogofilter to report
> the wordlist entries it is reading, sending the data to a background
> process or, more simply, just logging it to a file for later analysis.
> Run that for a few weeks and see how much of the wordlist is actually
> used, what percentage of hapaxes are checked, etc.

You can accomplish this using bogofilter's debug capabilities.  In
token.c DEBUG_LEVEL(1) writes tokens to dbgout (which is normally
stderr).  Try adding "-x t -vv -q 2&> dbgout" to your command line.  If
it's not exactly what you want, it's close!  (Note: you'll need 0.94.13
which has the "-q (quiet)" option, or the attached patch (to implement
the option)).

> As an aside, I find bayesian classification fascinating because it is
> the first example of what might be called "statistical intelligence"
> that I have spent any time with and I would like to understand it
> better. (Non-statistically!)

"statistical intelligence"!  I like it.

Enjoy,

David
-------------- next part --------------
diff -u -r --exclude-from=diff.excl 09412/src/bogoconfig.c 09413/src/bogoconfig.c

--- 09412/src/bogoconfig.c	2005-05-30 12:51:31.000000000 -0400
+++ 09413/src/bogoconfig.c	2005-05-28 14:14:56.000000000 -0400
@@ -314,6 +310,7 @@
     "info options:\n",
     "  -t, --terse               - set terse output mode.\n",
     "  -T, --fixed-terse-format  - set invariant terse output mode.\n",
+    "  -q, --quiet               - suppress token statistics.\n",
     "  -U, --report-unsure       - print statistics if spamicity is 'unsure'.\n",
     "  -v, --verbosity           - set debug verbosity level.\n",
     "  -y, --timestamp-date      - set date for token timestamps.\n",
@@ -547,6 +544,10 @@
 	passthrough = true;
 	break;
 
+    case 'q':
+	quiet = true;
+	break;
+
     case 'Q':
 	if (pass == PASS_1_CLI)
 	    query += 1;
diff -u -r --exclude-from=diff.excl 09412/src/score.c 09413/src/score.c
--- 09412/src/score.c	2005-03-15 19:25:29.000000000 -0500
+++ 09413/src/score.c	2005-05-26 18:38:10.000000000 -0400
@@ -73,6 +73,9 @@
 
     (void)fp;
 
+    if (quiet)
+	return;
+
     if (Rtable || unsure || verbose >= 2)
 	rstats_print(unsure);
 }