parameter experiment repeated with more data
Greg Louis
glouis at dynamicro.on.ca
Fri Apr 18 01:50:31 CEST 2003
On 20030417 (Thu) at 2116:12 +0100, Peter Bishop wrote:
> On 17 Apr 2003 at 13:28, Greg Louis wrote:
>
> > The latest of my attempts to characterize the effects of varying
> > Robinson's s and the minimum deviation parameter in bogofilter is a
> > repeat of the previous one, with many more data.
> Could you clarify whether the training corpara were different from the
> test corpora, i.e. did you split the spam and ham into half and use one for
> training and the other for testing?
Half, 3:2, whatever. Yes, they were split.
> Using different sets for training and testing might be more realistic
> as new spam won't be identical to the old spam.
Indeed.
If you look at the Appendix, you'll see a short shell script called
distrib, which I use to "deal" (as in dealing cards) spam and nonspam
messages into training and test files. To answer your question
accurately, a large corpus of spam messages was split among test
and training files, and the same was done for nonspam; although there
may be a few messages that appear in both due to duplication, the vast
majority of training messages are _not_ found in the test files.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list