question about robinson algorithm

Thu Oct 31 13:07:34 CET 2002

On 20021030 (Wed) at 2111:41 -0600, Graham Wilson wrote:
> does the number of messages in each corpus affect the robinson
> algorithm?

Well, let's take a look at the calculation method.

If cb is the count of a given token in the spamlist, and
   cg is the count of a given token in the goodlist multiplied by the
      number of messages in the spam list divided by the number of
      messages in the goodlist, and
   x  is the f(w) value we assign to a token that's in neither training
      list, and
   s  is a factor indicating how strongly we wish to trust x when there
      are few instances of a token in the training list, then

For each token w of a message containing n tokens,
  f(w) = (s * x + cb) / (s + cg + cb)

Let P = product(1-f(w)) for w in 1..n
    Q = product(f(w)) for w in 1..n
    S = (P-Q) / (P+Q)
    spamicity = (1+S)/2  (so it ranges from 0 to 1 instead of -1 to 1)

We ask, where does the number of messages in each corpus have impact?

1.  If there are very few messages in either or both corpora, the
    discriminatory power of the algorithm will be small.  In
    particular, when large numbers of tokens go unrecognized, or have
    low counts in one list and zero in the other, careful tuning of the
    value of s is needed to optimize discrimination, and the result may
    still be unsatisfactory.  (Not a surprising observation, this: all
    it says is, you need training to make bogofilter work well.)

2.  The fact that cg is scaled to match the size of the spamlist
    reduces the effect of a lopsided training set (one where there are
    many more messages in the normal corpus than in the spam corpus, or
    vice versa).  Nevertheless, extreme lopsidedness should probably be
    avoided (I can't prove this, but my feeling from experience to date
    is that it helps to keep the ratio of spams to norms between 0.75
    and 1.5 or so, though anything from 0.1 to 10 is probably ok as
    long as the smaller list still has plenty of messages in it).  What
    I suspect is that a law of diminishing returns sets in as the sizes
    diverge: you have a bigger database to search, and you don't get as
    much improvement in discriminatory power.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |