contributing datasets, was: Is bogotune helpful?

David Relson relson at osagesoftware.com
Thu Dec 11 15:09:47 CET 2003


On Thu, 11 Dec 2003 14:19:07 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Boris 'pi' Piwinger wrote:
> 
> > Greg Louis wrote:
> > 
> >>> I've got 40,000 spam and hams roughly 50/50. Let me know how I can
> >>> create and get the datasets to you.
> >> 
> >> I'm not sure if any message-count converter is supplied with
> >bogofilter> these days, but running a command of the form
> >> 
> >>     formail -s bogol dbdir <mboxfile >messagecountfile

As a script which runs bogolexer and bogoutil for each message, bogol is
slow.  Bogotune can now do the same job and is much faster.

Old:   formail -s bogol dbdir <mboxfile >messagecountfile

New:   bogotune -M -I mboxfile -d dbdir>messagecountfile


> > Questions:
> > 
> > 1) Is it important how training was done?

We don't know for sure.  Bogotune _does_ check the message counts in the
wordlist and will balk if there are too few messages (less than 2000 ham
and 2000 spam).  Using the '-F' (force) flag tells it to ignore such
details and go ahead with tuning.  

> > 2) Do you need those messagecountfile seperated for training
> > and testing? (Which would then mean a new training is needed
> > on only a part of the messages.)

Yes.  Split in half - alternating 1 for training 1 for testing works 
well.

> Those are still unanswered. Well, then.

It would be interesting to see what happens when bogotune is given the
same corpus prepared using full training and train-on-error.  After
splitting your messages, you could prepare two wordlists and two
messages sets.  I'd be glad to run bogotune with the two test sets.




More information about the Bogofilter mailing list