what happens if I discard tokens that occur only once?

Bill McClain wmcclain at salamander.com
Sat Jun 4 15:25:05 CEST 2005


On Fri, 3 Jun 2005 17:47:04 -0400
David Relson <relson at osagesoftware.com> wrote:

> If only some
> messages get registered, then one has no additional info about the
> hapax.

Right, I'm using thresh_update, so only about 10% of recognized spam
is registered.

I have an example of the value of hapaxes. In March I wrote that I
thought replace-nonascii-characters had stopped working. I was
mistaken; I was for the first time seeing 8-bit chars in my wordlist,
but this was because a previously unseen type of cyrillic spam had
started arriving. 

Since then I have seen hundreds of these spams, but all have been
properly classified and the wordlist has 469 8-bit tokens which I
believe came from 4 messages. Now, the interesting bit: 9 of these
tokens have count=2, the other 460 are all hapaxes. I can't say for sure
which are being used, but somehow this set of tokens is 100% effective
in detecting the cyrillic spam.

This is an extreme example because of the exotic nature of the tokens
-- in my case; I don't get any legitimate mail that would include them.
But a large number of spam tokens are in some way "exotic" and the
bayesian method makes good use of them. No matter how old my cyrillic
hapaxes become, it would be a mistake to purge them. (Well, I'd just
have to register new copies).

With a touch more time and ambition I might patch bogofilter to report
the wordlist entries it is reading, sending the data to a background
process or, more simply, just logging it to a file for later analysis.
Run that for a few weeks and see how much of the wordlist is actually
used, what percentage of hapaxes are checked, etc.

As an aside, I find bayesian classification fascinating because it is
the first example of what might be called "statistical intelligence"
that I have spent any time with and I would like to understand it
better. (Non-statistically!)

-Bill
-- 
Sattre Press                              History of Astronomy 
http://sattre-press.com/               During the 19th Century
info at sattre-press.com                       by Agnes M. Clerke
                              http://sattre-press.com/han.html



More information about the Bogofilter mailing list