parameter experiment repeated with more data

Fri Apr 18 01:50:31 CEST 2003

On 20030417 (Thu) at 2116:12 +0100, Peter Bishop wrote:
> On 17 Apr 2003 at 13:28, Greg Louis wrote:
> 
> > The latest of my attempts to characterize the effects of varying
> > Robinson's s and the minimum deviation parameter in bogofilter is a
> > repeat of the previous one, with many more data.

> Could you clarify whether the training corpara were different from the
> test corpora, i.e. did you split the spam and ham into half and use one for 
> training and the other for testing?

Half, 3:2, whatever.  Yes, they were split.

> Using different sets for training and testing might be more realistic
> as new spam won't be identical to the old spam.

Indeed.

If you look at the Appendix, you'll see a short shell script called
distrib, which I use to "deal" (as in dealing cards) spam and nonspam
messages into training and test files.  To answer your question
accurately, a large corpus of spam messages was split among test
and training files, and the same was done for nonspam; although there
may be a few messages that appear in both due to duplication, the vast
majority of training messages are _not_ found in the test files.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |