question about robinson algorithm

David Relson relson at osagesoftware.com
Thu Oct 31 05:12:28 CET 2002


At 10:11 PM 10/30/02, Graham Wilson wrote:

>does the number of messages in each corpus affect the robinson
>algorithm?
>
>--
>gram

Gram,

I'm CC'ing Greg Louis on this because he's the Robinson expert and 
coder.  He will correct anything incorrect in what I say below.

What I recall from my emails with him is that there are some special 
numbers in Robinson's equations.  Precise values for the "X" factor can be 
calculated through computing the probability associated with each word in 
the spam list and some summing and averaging operations.

The size of the corpus isn't two important.  Rather it's the relative sizes 
(a.k.a. the message counts) of the two corpuses that is more important.  If 
the two corpuses double in size, the factors won't change significantly.

Also, he and I have been doing some tests and it seems that the values 
chosen for the Robinson "S" and "X" factors are not that critical.  Within 
limits, changing the values used doesn't have much effect on the 
classification accuracy.

I hope this helps.  If you're intrested, you should read the Graham and 
Robinson papers.

David





More information about the Bogofilter mailing list