contributing datasets, was: Is bogotune helpful?
Greg Louis
glouis at dynamicro.on.ca
Thu Dec 11 15:43:50 CET 2003
On 20031211 (Thu) at 0909:47 -0500, David Relson wrote:
> > > Questions:
> > >
> > > 1) Is it important how training was done?
>
> We don't know for sure. Bogotune _does_ check the message counts in the
> wordlist and will balk if there are too few messages (less than 2000 ham
> and 2000 spam). Using the '-F' (force) flag tells it to ignore such
> details and go ahead with tuning.
For the big parameter-determination run, -F will not be acceptable.
>
> > > 2) Do you need those messagecountfile seperated for training
> > > and testing? (Which would then mean a new training is needed
> > > on only a part of the messages.)
>
> Yes. Split in half - alternating 1 for training 1 for testing works
> well.
I would prefer that they _not_ be split, because the big run will need
a training db built from the same population as the overall test
corpus, and I don't expect to use anything like half the message pool
for that training db. However, if training and test messages are
provided separately, I can of course recombine them :)
--
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
More information about the Bogofilter
mailing list