testing with public-corpus

David Relson relson at osagesoftware.com
Sat Oct 12 15:52:04 CEST 2002


Hello,

I've had the body of messages of the public-corpus sitting on my hard drive 
(gathering dust) and have been wondering what to do with them.  I'd been 
thinking of testing all the messages with _my_ wordlists.  That might be 
interesting, but it didn't seem too useful and it's not something any of 
you could reproduce since each of you has his own lists.

Last night I had an idea :-)

Split each of the three groups (easy_ham, hard_ham, and spam) into 
half.  Use 3 of the halves to build wordlists and then test bogofilter 
using the other three halves.  The first test sequence will just do 
classification, i.e. "bogofilter -p" or something similar.  The second test 
sequence will do classification and updating, i.e. "bogofilter -p -u".

I'll report on the results when I have them.  FWIW, it takes awhile to 
process 175 easy_ham, 175 hard_ham, and 250 spam messages - especially when 
I'm testing several versions of the spamicity computation for each message.

David


For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list