what happens if I discard tokens that occur only once?

David Relson relson at osagesoftware.com
Sat Jun 4 16:05:08 CEST 2005


On Sat, 4 Jun 2005 08:25:05 -0500
Bill McClain wrote:

> On Fri, 3 Jun 2005 17:47:04 -0400
> David Relson <relson at osagesoftware.com> wrote:
> 
> > If only some
> > messages get registered, then one has no additional info about the
> > hapax.
> 
> Right, I'm using thresh_update, so only about 10% of recognized spam
> is registered.
> 
> I have an example of the value of hapaxes. In March I wrote that I
> thought replace-nonascii-characters had stopped working. I was
> mistaken; I was for the first time seeing 8-bit chars in my wordlist,
> but this was because a previously unseen type of cyrillic spam had
> started arriving. 
> 
> Since then I have seen hundreds of these spams, but all have been
> properly classified and the wordlist has 469 8-bit tokens which I
> believe came from 4 messages. Now, the interesting bit: 9 of these
> tokens have count=2, the other 460 are all hapaxes. I can't say for sure
> which are being used, but somehow this set of tokens is 100% effective
> in detecting the cyrillic spam.
> 
> This is an extreme example because of the exotic nature of the tokens
> -- in my case; I don't get any legitimate mail that would include them.
> But a large number of spam tokens are in some way "exotic" and the
> bayesian method makes good use of them. No matter how old my cyrillic
> hapaxes become, it would be a mistake to purge them. (Well, I'd just
> have to register new copies).

Interesting!  Glad to hear it's working.

Hapaxes can be dangerous as well.  We all know about spam with
collections of random collections of words.  A long while back I had
one with "dartmouth" in it.  Months later this caused a false positive.

> With a touch more time and ambition I might patch bogofilter to report
> the wordlist entries it is reading, sending the data to a background
> process or, more simply, just logging it to a file for later analysis.
> Run that for a few weeks and see how much of the wordlist is actually
> used, what percentage of hapaxes are checked, etc.

You can accomplish this using bogofilter's debug capabilities.  In
token.c DEBUG_LEVEL(1) writes tokens to dbgout (which is normally
stderr).  Try adding "-x t -vv -q 2&> dbgout" to your command line.  If
it's not exactly what you want, it's close!  (Note: you'll need 0.94.13
which has the "-q (quiet)" option, or the attached patch (to implement
the option)).

> As an aside, I find bayesian classification fascinating because it is
> the first example of what might be called "statistical intelligence"
> that I have spent any time with and I would like to understand it
> better. (Non-statistically!)

"statistical intelligence"!  I like it.

Enjoy,

David
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.0604.bogoconfig.quiet
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20050604/1ba2afc1/attachment.ksh>


More information about the Bogofilter mailing list