invalid html warfare
John McCain
jmccain at layer3al.com
Wed May 28 17:46:27 CEST 2003
I keep everything. Especially evil stuff, for just this sort of purpose. The
problem, really, is age.
The absolute biggest corpus I could put together is about 2000 good messages
and something like 1000 spams, but that includes messages so old that I
question their scientific usefulness. I've seen spam change quite a bit over
the time I've been collecting it. My production system bogofilter corpus
consists of about half that amount (the more current half).
.MSGCOUNT yields 0 for both spam and good - ?
spamlist wc yields 108906
goodlist wc yields 137097
On Wednesday 28 May 2003 10:34 am, David Relson wrote:
> At 10:42 AM 5/28/03, John McCain wrote:
> >I'd love to pitch in, but I don't have access to enough test data.
>
> John,
>
> Do you save all your email, including the evil stuff? How much do you
> have? How big are your wordlists?
>
> Here're some "size" commands to run:
>
> bogoutil -w $BOGODIR .MSGCOUNT
> bogoutil -d $BOGODIR/spamlist.db | wc -l
> bogoutil -d $BOGODIR/goodlist.db | wc -l
>
> If you've got a few thousand messages, you could experiment and develop
> some tests. There are some people on the list with large corpora who might
> well be interested in testing any scripts you might develop.
>
> David
>
>
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
More information about the Bogofilter
mailing list