flag settings

Sat Nov 2 13:26:45 CET 2002

> A good question.  One of the differences between Graham and Robinson is the 
> counts used in updating the word lists.  Graham has a max of 4 (per word 
> per message), while Robinson uses 1.  So, to be totally kosher, one should 
> build a database using the same algorithm as will be used when classifying 
> messages.

> I don't think there's been any research to measure what happens when you 
> mix algorithms.

In the big test of G vs R that we're getting ready to release, I did
all the training _with_ -r, even for the Graham training set.  The
choice of 1 vs 4 isn't, strictly speaking, G/R specific; you can run
Graham with 1 and Robinson with 4 and it doesn't invalidate either.
I've run Graham with both and it makes (with my data) very little
difference.  Gary said 1 might be better, but see below.

> FWIW, my word lists were built with Graham and updated using Graham for 
> several weeks until I switched to Robinson

Same here.  I eventually rebuilt them, but not for this reason.  The
mixed lists (which were, of course, all Graham when I first started
using Robinson's method) worked ok, and I didn't notice any significant
performance change after the rebuild.  I suspect it would be a bad
thing to allow a very large MAX_REPEATS, but I haven't tested that.

> I still have all the incoming email, so I could rebuild the 
> word lists, but I've been too lazy.

What I like about building lists is the _computer_ gets to do all the
work ;)

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |