[jm at jmason.org: fully-public corpus of mail available]

Mark M. Hoffman mhoffman at lightlink.com
Thu Oct 10 05:44:39 CEST 2002


----- Forwarded message from Justin Mason <jm at jmason.org> -----

From: jm at jmason.org (Justin Mason)
To: SpamAssassin-talk at lists.sourceforge.net
Cc: SpamAssassin-devel at lists.sourceforge.net,
	Steve Atkins <steve at blighty.com>, ion at aueb.gr, donatespam at archub.org,
	spambayes at python.org
Date: Wed, 09 Oct 2002 13:21:11 +0100
Subject: fully-public corpus of mail available

(Please feel free to forward this message to other possibly-interested
parties.)

Hi all,

One of the big problems working with spam classification, is finding good
mail to test with.  There are few public corpora available; Ion
Androutsopoulos' "Ling-spam" corpus is one (hi Ion!), but unfortunately
this does not contain all of the mail message data, so would not be useful
to a SpamAssassin-style system (which relies heavily on header data), for
example.

Another effect of not having a common, shared corpus, is the difficulty
this introduces in comparing accuracy rates between spam filter software;
since everyone tests using different corpora, statistics can be unportable
as a result.

Building public corpora is difficult, as it typically involves saving your
own (classified) mail.  This brings privacy problems, as your mail senders
may not wish to see this made public.

But what the heck, that's what I've done anyway ;)  Here's a public corpus
I've assembled from my own corpora, removing messages which were not
public in the first place.  Please feel free to download it and use
it for spam-filter development.

It's quite small, but should be big enough for use as a reference corpus,
at least, so that hit-rate statistics can be compared across tools.
Hope it helps.

It lives here:

  http://spamassassin.org/publiccorpus/


and here's the README.txt:

Welcome to the SpamAssassin public mail corpus.  This is a selection of mail
messages, suitable for use in testing spam filtering systems.  Pertinent
points:

  - All headers are reproduced in full.  Some address obfuscation has taken
    place; hostnames in some cases have been replaced with "example.com",
    which should have a valid MX record (if I recall correctly).  In most
    cases though, the headers appear as they were received.

  - All of these messages were posted to public fora, were sent to me in the
    knowledge that they may be made public, were sent by me, or originated as
    newsletters from public news web sites.

  - Copyright for the text in the messages remains with the original senders.


OK, now onto the corpus description.  It's split into three parts, as follows:

  - spam: 500 spam messages, all received from non-spam-trap sources.

  - easy_ham: 350 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

The corpora are prefixed with "200210", because that's the date when I
assembled it, so it's as good a version string as anything else ;) . They are
compressed using "bzip2".

This corpus lives at http://spamassassin.org/publiccorpus/ .  Mail
jm - public - corpus AT jmason dot org if you have questions, or to donate
mail.

(Oct  9 2002 jm)

----- End forwarded message -----



More information about the bogofilter-dev mailing list