contributing datasets, was: Is bogotune helpful?

David Relson relson at osagesoftware.com
Wed Dec 3 13:47:48 CET 2003


On Wed, 03 Dec 2003 13:20:59 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> Greg Louis wrote:
> 
> >> I've got 40,000 spam and hams roughly 50/50. Let me know how I can
> >> create and get the datasets to you.
> > 
> > I'm not sure if any message-count converter is supplied with
> > bogofilter these days, but running a command of the form
> > 
> >     formail -s bogol dbdir <mboxfile >messagecountfile
> 
> [...]
> 
> Questions:
> 
> 1) Is it important how training was done?
> 
> 2) Do you need those messagecountfile seperated for training
> and testing? (Which would then mean a new training is needed
> on only a part of the messages.)
> 
> 3) Is it OK to do that in chunks of 5,000 messages or do you
> want them all together?
> 
> pi

Hi pi,

Greg is masterminding this test, so he'll have to answer your first
questions.  As I've done the C implementation of bogotune, I can answer
the other two.

A message count file can hold many messages.  The msg-count.sh script
puts a ".MSG_COUNT spam ham" line at the beginning of each message. 
Bogofilter recognizes these lines and uses them to separate the input
stream into messages.

Chunks of 5,000 are fine.  Likely we'll combine them into two large
files for convenience.  As your initials are bp, the files are likely to
be named bp.1203.ns.mc and bp.1203.sp.mc, i.e. initials.date.type.mc.

David




More information about the Bogofilter mailing list