glouis at dynamicro.on.ca
Sat Nov 2 07:26:45 EST 2002
> A good question. One of the differences between Graham and Robinson is the
> counts used in updating the word lists. Graham has a max of 4 (per word
> per message), while Robinson uses 1. So, to be totally kosher, one should
> build a database using the same algorithm as will be used when classifying
> I don't think there's been any research to measure what happens when you
> mix algorithms.
In the big test of G vs R that we're getting ready to release, I did
all the training _with_ -r, even for the Graham training set. The
choice of 1 vs 4 isn't, strictly speaking, G/R specific; you can run
Graham with 1 and Robinson with 4 and it doesn't invalidate either.
I've run Graham with both and it makes (with my data) very little
difference. Gary said 1 might be better, but see below.
> FWIW, my word lists were built with Graham and updated using Graham for
> several weeks until I switched to Robinson
Same here. I eventually rebuilt them, but not for this reason. The
mixed lists (which were, of course, all Graham when I first started
using Robinson's method) worked ok, and I didn't notice any significant
performance change after the rebuild. I suspect it would be a bad
thing to allow a very large MAX_REPEATS, but I haven't tested that.
> I still have all the incoming email, so I could rebuild the
> word lists, but I've been too lazy.
What I like about building lists is the _computer_ gets to do all the
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
More information about the Bogofilter