parameter experiment repeated with more data
David Relson
relson at osagesoftware.com
Thu Apr 17 22:49:21 CEST 2003
At 04:16 PM 4/17/03, Peter Bishop wrote:
>On 17 Apr 2003 at 13:28, Greg Louis wrote:
>
> > The latest of my attempts to characterize the effects of varying
> > Robinson's s and the minimum deviation parameter in bogofilter is a
> > repeat of the previous one, with many more data. The writeup at
> > http://www.bgl.nu/bogofilter/smindev3.html has been updated
> > accordingly. It begins to appear as though it would be generally good
> > for bogofilter to ship with s set to 0.1 and the minimum deviation as
> > high as 0.44 -- though these settings may require a well-trained
> > database (several thousand each of spam and nonspam messages) to be
> > effective.
>
>Could you clarify whether the training corpara were different from the
>test corpora, i.e. did you split the spam and ham into half and use one for
>training and the other for testing?
>Using different sets for training and testing might be more realistic
>as new spam won't be identical to the old spam.
Peter,
Greg's scripts are present on his web site, though somewhat scattered
around the various articles. A "distrib" script is used to divide ham/spam
corpora (by 1/6ths) into 4 files (t.xx,r0.xx,r1.xx,r2.xx), where "xx" is
"ns" for non-spam and is "sp" for spam. The sixths are created by counting
messages during distribution. Messages 1,3,5,7,9,11,13,15,17,... go into
t.xx; messages 2,8,14,... into r0.xx; 4,10,16,... into r1.xx; and
6,12,18,... into r2.xx. Thus 1/2 is used for training and 1/2 is used for
testing. This effectively mixes old and new and negates the temporal effect.
David
More information about the Bogofilter
mailing list