parameter experiment repeated with more data

Thu Apr 17 22:49:21 CEST 2003

At 04:16 PM 4/17/03, Peter Bishop wrote:

>On 17 Apr 2003 at 13:28, Greg Louis wrote:
>
> > The latest of my attempts to characterize the effects of varying
> > Robinson's s and the minimum deviation parameter in bogofilter is a
> > repeat of the previous one, with many more data.  The writeup at
> > http://www.bgl.nu/bogofilter/smindev3.html has been updated
> > accordingly.  It begins to appear as though it would be generally good
> > for bogofilter to ship with s set to 0.1 and the minimum deviation as
> > high as 0.44 -- though these settings may require a well-trained
> > database (several thousand each of spam and nonspam messages) to be
> > effective.
>
>Could you clarify whether the training corpara were different from the
>test corpora, i.e. did you split the spam and ham into half and use one for
>training and the other for testing?
>Using different sets for training and testing might be more realistic
>as new spam won't be identical to the old spam.

Peter,

Greg's scripts are present on his web site, though somewhat scattered 
around the various articles.  A "distrib" script is used to divide ham/spam 
corpora (by 1/6ths) into 4 files (t.xx,r0.xx,r1.xx,r2.xx), where "xx" is 
"ns" for non-spam and is "sp" for spam.  The sixths are created by counting 
messages during distribution.  Messages 1,3,5,7,9,11,13,15,17,... go into 
t.xx; messages 2,8,14,... into r0.xx; 4,10,16,... into r1.xx; and 
6,12,18,... into r2.xx.  Thus 1/2 is used for training and 1/2 is used for 
testing.  This effectively mixes old and new and negates the temporal effect.

David