contributing datasets, was: Is bogotune helpful?

Thu Dec 11 15:43:50 CET 2003

On 20031211 (Thu) at 0909:47 -0500, David Relson wrote:

> > > Questions:
> > > 
> > > 1) Is it important how training was done?
> 
> We don't know for sure.  Bogotune _does_ check the message counts in the
> wordlist and will balk if there are too few messages (less than 2000 ham
> and 2000 spam).  Using the '-F' (force) flag tells it to ignore such
> details and go ahead with tuning.  

For the big parameter-determination run, -F will not be acceptable.

> 
> > > 2) Do you need those messagecountfile seperated for training
> > > and testing? (Which would then mean a new training is needed
> > > on only a part of the messages.)
> 
> Yes.  Split in half - alternating 1 for training 1 for testing works 
> well.

I would prefer that they _not_ be split, because the big run will need
a training db built from the same population as the overall test
corpus, and I don't expect to use anything like half the message pool
for that training db.  However, if training and test messages are
provided separately, I can of course recombine them :)

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |