what happens if I discard tokens that occur only once?
David Relson
relson at osagesoftware.com
Sat Jun 4 16:05:08 CEST 2005
On Sat, 4 Jun 2005 08:25:05 -0500
Bill McClain wrote:
> On Fri, 3 Jun 2005 17:47:04 -0400
> David Relson <relson at osagesoftware.com> wrote:
>
> > If only some
> > messages get registered, then one has no additional info about the
> > hapax.
>
> Right, I'm using thresh_update, so only about 10% of recognized spam
> is registered.
>
> I have an example of the value of hapaxes. In March I wrote that I
> thought replace-nonascii-characters had stopped working. I was
> mistaken; I was for the first time seeing 8-bit chars in my wordlist,
> but this was because a previously unseen type of cyrillic spam had
> started arriving.
>
> Since then I have seen hundreds of these spams, but all have been
> properly classified and the wordlist has 469 8-bit tokens which I
> believe came from 4 messages. Now, the interesting bit: 9 of these
> tokens have count=2, the other 460 are all hapaxes. I can't say for sure
> which are being used, but somehow this set of tokens is 100% effective
> in detecting the cyrillic spam.
>
> This is an extreme example because of the exotic nature of the tokens
> -- in my case; I don't get any legitimate mail that would include them.
> But a large number of spam tokens are in some way "exotic" and the
> bayesian method makes good use of them. No matter how old my cyrillic
> hapaxes become, it would be a mistake to purge them. (Well, I'd just
> have to register new copies).
Interesting! Glad to hear it's working.
Hapaxes can be dangerous as well. We all know about spam with
collections of random collections of words. A long while back I had
one with "dartmouth" in it. Months later this caused a false positive.
> With a touch more time and ambition I might patch bogofilter to report
> the wordlist entries it is reading, sending the data to a background
> process or, more simply, just logging it to a file for later analysis.
> Run that for a few weeks and see how much of the wordlist is actually
> used, what percentage of hapaxes are checked, etc.
You can accomplish this using bogofilter's debug capabilities. In
token.c DEBUG_LEVEL(1) writes tokens to dbgout (which is normally
stderr). Try adding "-x t -vv -q 2&> dbgout" to your command line. If
it's not exactly what you want, it's close! (Note: you'll need 0.94.13
which has the "-q (quiet)" option, or the attached patch (to implement
the option)).
> As an aside, I find bayesian classification fascinating because it is
> the first example of what might be called "statistical intelligence"
> that I have spent any time with and I would like to understand it
> better. (Non-statistically!)
"statistical intelligence"! I like it.
Enjoy,
David
-------------- next part --------------
diff -u -r --exclude-from=diff.excl 09412/src/bogoconfig.c 09413/src/bogoconfig.c
--- 09412/src/bogoconfig.c 2005-05-30 12:51:31.000000000 -0400
+++ 09413/src/bogoconfig.c 2005-05-28 14:14:56.000000000 -0400
@@ -314,6 +310,7 @@
"info options:\n",
" -t, --terse - set terse output mode.\n",
" -T, --fixed-terse-format - set invariant terse output mode.\n",
+ " -q, --quiet - suppress token statistics.\n",
" -U, --report-unsure - print statistics if spamicity is 'unsure'.\n",
" -v, --verbosity - set debug verbosity level.\n",
" -y, --timestamp-date - set date for token timestamps.\n",
@@ -547,6 +544,10 @@
passthrough = true;
break;
+ case 'q':
+ quiet = true;
+ break;
+
case 'Q':
if (pass == PASS_1_CLI)
query += 1;
diff -u -r --exclude-from=diff.excl 09412/src/score.c 09413/src/score.c
--- 09412/src/score.c 2005-03-15 19:25:29.000000000 -0500
+++ 09413/src/score.c 2005-05-26 18:38:10.000000000 -0400
@@ -73,6 +73,9 @@
(void)fp;
+ if (quiet)
+ return;
+
if (Rtable || unsure || verbose >= 2)
rstats_print(unsure);
}
More information about the bogofilter
mailing list