question about robinson algorithm
David Relson
relson at osagesoftware.com
Thu Oct 31 05:12:28 CET 2002
At 10:11 PM 10/30/02, Graham Wilson wrote:
>does the number of messages in each corpus affect the robinson
>algorithm?
>
>--
>gram
Gram,
I'm CC'ing Greg Louis on this because he's the Robinson expert and
coder. He will correct anything incorrect in what I say below.
What I recall from my emails with him is that there are some special
numbers in Robinson's equations. Precise values for the "X" factor can be
calculated through computing the probability associated with each word in
the spam list and some summing and averaging operations.
The size of the corpus isn't two important. Rather it's the relative sizes
(a.k.a. the message counts) of the two corpuses that is more important. If
the two corpuses double in size, the factors won't change significantly.
Also, he and I have been doing some tests and it seems that the values
chosen for the Robinson "S" and "X" factors are not that critical. Within
limits, changing the values used doesn't have much effect on the
classification accuracy.
I hope this helps. If you're intrested, you should read the Graham and
Robinson papers.
David
More information about the Bogofilter
mailing list