spam pooling

Tue Dec 17 18:31:17 CET 2002

> Hi,
>
> I know there are some folks on the net who save up their spam and make
> it available, but is any attempt being made to create a pool that
> everyone can contribute to? I certainly am building up a large corpus
> that I'd like to share. It would be great to set my mail client up so
> that a single keypress would send the current message (or all tagged
> messages or whatever) to the central repository, where it would be
> sorted, duplicates removed, and perhaps eyeballed by volunteers before
> passing into the public archive.
>
> Eyeballing could be as simple as this procedure: if an email has never
> been reported before, the system bounces it automatically to some
> number of people in the volunteer pool. End of story. If the system
> receives another copy of the email, either from them or someone else,
> it will say, "ok, two random people agree on this, it must be spam,"
> and add it to the archive. If two is too small a number, the bar could
> be set higher.
>
> The identities of the volunteers, as well as the identity of the person
> who originally sent in the sample, would be remembered. Then whenever
> someone reported, "hey, this piece isn't spam at all!" that could be
> counted against those people, and used to weed out spammers trying to
> degrade the system. I know there are people who would enjoy (well,
> maybe that's not the right word) looking over the entire corpus, and
> reporting on false positives.
>
> Be well,
> Zack
>
> --
> Zack Brown
>
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com

Sounds like RAZOR to me.

You could use a strategy like this:

For all users, store a MD5SUM of the BODY and a record of it's status
(spam/ham).
You could use each correction (-S/-N) to store a md5sum(BODY).
Any subsequent email coming into the server would be compared against this
md5sum value and if they match, just process them according to -s/-n and
/dev/null the spam.
Known email could be processed (-s/-n) in parallel to the email delivery,
allowing for a better performance for servers with many accounts.
I'm assuming that a md5sum is faster to create than a bogofilter result.