invalid html warfare

John McCain jmccain at layer3al.com
Wed May 28 17:46:27 CEST 2003


I keep everything.  Especially evil stuff, for just this sort of purpose.  The 
problem, really, is age.

The absolute biggest corpus I could put together is about 2000 good messages 
and something like 1000 spams, but that includes messages so old that I 
question their scientific usefulness.  I've seen spam change quite a bit over 
the time I've been collecting it.  My production system bogofilter corpus 
consists of about half that amount (the more current half).

.MSGCOUNT yields 0 for both spam and good   -   ?

spamlist wc yields  108906
goodlist wc yields  137097

On Wednesday 28 May 2003 10:34 am, David Relson wrote:
> At 10:42 AM 5/28/03, John McCain wrote:
> >I'd love to pitch in, but I don't have access to enough test data.
>
> John,
>
> Do you save all your email, including the evil stuff?  How much do you
> have?  How big are your wordlists?
>
> Here're some "size" commands to run:
>
> bogoutil -w $BOGODIR .MSGCOUNT
> bogoutil -d $BOGODIR/spamlist.db | wc -l
> bogoutil -d $BOGODIR/goodlist.db | wc -l
>
> If you've got a few thousand messages, you could experiment and develop
> some tests.  There are some people on the list with large corpora who might
> well be interested in testing any scripts you might develop.
>
> David
>
>
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com





More information about the Bogofilter mailing list