[bogofilter] Test sets, accuracy and other things
Kevin D. Clark
kclark at CetaceanNetworks.com
Wed Sep 11 20:47:57 CEST 2002
Jonathan Buzzard <jonathan at buzzard.org.uk> writes:
> In a rather fortunate move I have historically kept a copy of all the email
> I receive, including all the spam. I have been using this over the last
> week or so to test bogofilter, to see who good it really is and what
> changes lead to an improvement of the classification.
...
> Clearly I will not be making my test sets available for downloading as
> they contain confidential and personal information.
I don't know if this is helpful to anybody, but I happen to have a
file with 6200+ email messages in it (mbox format); most reasonable
people would consider all of these to be spam. I collected these from
around July 2001 to July 2002.
I'd be willing to make this file available if this would help
facilitate testing or help advance the project. I think that
SourceForge would be an ideal place to host this file (I don't have a
great way to make this file generally available).
I realize that bogofilter is inherently Bayesian, and that one
person's important email is another person's spam, but perhaps a
sample set like this would help others examine the underlying
algorithms in more detail?
Regards,
--kevin
--
Kevin D. Clark / Cetacean Networks / Portsmouth, N.H. (USA)
cetaceannetworks.com!kclark (GnuPG ID: B280F24E)
alumni.unh.edu!kdc
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter
mailing list