[bogofilter] Test sets, accuracy and other things

Eric Seppanen eds at reric.net
Wed Sep 11 21:28:20 CEST 2002


On Wed, Sep 11, 2002 at 02:47:57PM -0400, Kevin D. Clark wrote:
> 
> I don't know if this is helpful to anybody, but I happen to have a
> file with 6200+ email messages in it (mbox format); most reasonable
> people would consider all of these to be spam.  I collected these from
> around July 2001 to July 2002.
> 
> I'd be willing to make this file available if this would help
> facilitate testing or help advance the project.  I think that
> SourceForge would be an ideal place to host this file (I don't have a
> great way to make this file generally available).

I think that a handful of big "spam archives" would be a great help, 
but you probably need to take precautions first, because your spam will 
likely include headers (and perhaps even body text) that will cause your 
name, email, domain, and mx machines to look like spam to bogofilter.  If 
somebody uses your file to seed a production bogofilter installation, 
you'll be effectively blacklisted.

If you filter out any headers that match your name or mail address, as 
well as any that reference your machines or domain then edit by hand any 
message bodies that contain your name, email, or machine or domain names, 
I think it'll be a great help.

Any scripts you write to do such trimming would be useful too.  Maybe 
others would contribute big spam archives if they had a straightforward 
way of trimming them.


For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list