[bogofilter] Test sets, accuracy and other things

Kevin D. Clark kclark at CetaceanNetworks.com
Wed Sep 11 20:47:57 CEST 2002


Jonathan Buzzard <jonathan at buzzard.org.uk> writes:

> In a rather fortunate move I have historically kept a copy of all the email
> I receive, including all the spam. I have been using this over the last
> week or so to test bogofilter, to see who good it really is and what
> changes lead to an improvement of the classification.

...
> Clearly I will not be making my test sets available for downloading as
> they contain confidential and personal information.

I don't know if this is helpful to anybody, but I happen to have a
file with 6200+ email messages in it (mbox format); most reasonable
people would consider all of these to be spam.  I collected these from
around July 2001 to July 2002.

I'd be willing to make this file available if this would help
facilitate testing or help advance the project.  I think that
SourceForge would be an ideal place to host this file (I don't have a
great way to make this file generally available).

I realize that bogofilter is inherently Bayesian, and that one
person's important email is another person's spam, but perhaps a
sample set like this would help others examine the underlying
algorithms in more detail?

Regards,

--kevin
-- 
Kevin D. Clark / Cetacean Networks / Portsmouth, N.H. (USA)
cetaceannetworks.com!kclark (GnuPG ID: B280F24E)
alumni.unh.edu!kdc



For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list