[bogofilter] Test sets, accuracy and other things
Kevin D. Clark
kclark at CetaceanNetworks.com
Wed Sep 11 15:50:58 EDT 2002
Eric Seppanen <eds at reric.net> writes:
> I think that a handful of big "spam archives" would be a great help,
> but you probably need to take precautions first, because your spam will
> likely include headers (and perhaps even body text) that will cause your
> name, email, domain, and mx machines to look like spam to bogofilter. If
> somebody uses your file to seed a production bogofilter installation,
> you'll be effectively blacklisted.
Oh, yes, I plan on changing a bunch of stuff like this. Thanks for
pointing this out.
Also, I plan on recording what all of my changes were (for example,
I'd change my name to "Joe Random User" -- this way people could look
at the results and recognize why "Joe", "Random" and "User" appeared
in their wordlist a lot.
Hmm...let me think about this a little bit more...
I do have to point out that, IIRC, most (all?) of the email in this
particular file isn't even addressed to me directly (I'm sure
everybody on this list has seen this). I'll have to check to see who
else this email is addressed to -- I'd hate to have innocent people
get blocked. Hmm...
> If you filter out any headers that match your name or mail address, as
> well as any that reference your machines or domain then edit by hand any
> message bodies that contain your name, email, or machine or domain names,
> I think it'll be a great help.
> Any scripts you write to do such trimming would be useful too. Maybe
> others would contribute big spam archives if they had a straightforward
> way of trimming them.
I'll probably end up hacking something together with "formail" and/or
Perl. Nothing too complicated, I hope. We'll see...
Kevin D. Clark / Cetacean Networks / Portsmouth, N.H. (USA)
cetaceannetworks.com!kclark (GnuPG ID: B280F24E)
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
More information about the Bogofilter