question about robinson algorithm
Greg Louis
glouis at dynamicro.on.ca
Thu Oct 31 13:07:34 CET 2002
On 20021030 (Wed) at 2111:41 -0600, Graham Wilson wrote:
> does the number of messages in each corpus affect the robinson
> algorithm?
Well, let's take a look at the calculation method.
If cb is the count of a given token in the spamlist, and
cg is the count of a given token in the goodlist multiplied by the
number of messages in the spam list divided by the number of
messages in the goodlist, and
x is the f(w) value we assign to a token that's in neither training
list, and
s is a factor indicating how strongly we wish to trust x when there
are few instances of a token in the training list, then
For each token w of a message containing n tokens,
f(w) = (s * x + cb) / (s + cg + cb)
Let P = product(1-f(w)) for w in 1..n
Q = product(f(w)) for w in 1..n
S = (P-Q) / (P+Q)
spamicity = (1+S)/2 (so it ranges from 0 to 1 instead of -1 to 1)
We ask, where does the number of messages in each corpus have impact?
1. If there are very few messages in either or both corpora, the
discriminatory power of the algorithm will be small. In
particular, when large numbers of tokens go unrecognized, or have
low counts in one list and zero in the other, careful tuning of the
value of s is needed to optimize discrimination, and the result may
still be unsatisfactory. (Not a surprising observation, this: all
it says is, you need training to make bogofilter work well.)
2. The fact that cg is scaled to match the size of the spamlist
reduces the effect of a lopsided training set (one where there are
many more messages in the normal corpus than in the spam corpus, or
vice versa). Nevertheless, extreme lopsidedness should probably be
avoided (I can't prove this, but my feeling from experience to date
is that it helps to keep the ratio of spams to norms between 0.75
and 1.5 or so, though anything from 0.1 to 10 is probably ok as
long as the smaller list still has plenty of messages in it). What
I suspect is that a law of diminishing returns sets in as the sizes
diverge: you have a bigger database to search, and you don't get as
much improvement in discriminatory power.
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
More information about the Bogofilter
mailing list