contributing datasets, was: Is bogotune helpful?

Thu Dec 11 15:09:47 CET 2003

On Thu, 11 Dec 2003 14:19:07 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Boris 'pi' Piwinger wrote:
> 
> > Greg Louis wrote:
> > 
> >>> I've got 40,000 spam and hams roughly 50/50. Let me know how I can
> >>> create and get the datasets to you.
> >> 
> >> I'm not sure if any message-count converter is supplied with
> >bogofilter> these days, but running a command of the form
> >> 
> >>     formail -s bogol dbdir <mboxfile >messagecountfile

As a script which runs bogolexer and bogoutil for each message, bogol is
slow.  Bogotune can now do the same job and is much faster.

Old:   formail -s bogol dbdir <mboxfile >messagecountfile

New:   bogotune -M -I mboxfile -d dbdir>messagecountfile

> > Questions:
> > 
> > 1) Is it important how training was done?

We don't know for sure.  Bogotune _does_ check the message counts in the
wordlist and will balk if there are too few messages (less than 2000 ham
and 2000 spam).  Using the '-F' (force) flag tells it to ignore such
details and go ahead with tuning.  

> > 2) Do you need those messagecountfile seperated for training
> > and testing? (Which would then mean a new training is needed
> > on only a part of the messages.)

Yes.  Split in half - alternating 1 for training 1 for testing works 
well.

> Those are still unanswered. Well, then.

It would be interesting to see what happens when bogotune is given the
same corpus prepared using full training and train-on-error.  After
splitting your messages, you could prepare two wordlists and two
messages sets.  I'd be glad to run bogotune with the two test sets.