Robinson algorithm

Mon Nov 11 13:37:33 CET 2002

On 20021111 (Mon) at 0709:29 -0500, David Relson wrote:
> At 05:06 AM 11/11/02, Alexander Wasmuth wrote:

> >is there any difference between "bogofilter -s/-n" and "bogofilter -s/-n
> >-r"?

> There is a small difference between the Graham and Robinson algorithm
> in how messages are scanned and entered into the word lists.  Using
> the Graham algorithm when the message is parsed and words are
> counted, each word is allowed a maximum count of 4.

Historical note:  Graham's method called for a maximum count of
infinity.  The limit of 4 was something added (by esr?) on empirical
grounds.

Theoretical note:  Robinson recommended a limit of one based on the
following argument:  A word that's already appeared in a message is
more likely than any random word to appear again in the remaining text
of that message; therefore, if your limit is anything other than one,
you're in violation of the assumption (on which the method relies) that
the occurrences of tokens are independent.  While this is true,
Robinson also notes that the violation may not have very severe
consequences; he recommends that we choose whatever limit performs
best.  (A word that's appeared in a message is more likely than any
random word to appear again in another message from the same source, so
the independence assumption is violated by reality in any event; just
less so if you set the maximum count per message to one.)

So what does work best in practice?  I'm using 1 with Robinson's
method; it might even make a slight positive difference to use 1 with
Graham's.  My testing to date has not been extensive enough to support
a conclusion.  To put that another way, if there's any difference it
may be small, and David's probably right in suggesting it doesn't
matter much.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |