contributing datasets, was: Is bogotune helpful?

David Relson relson at osagesoftware.com
Sun Dec 7 05:32:05 CET 2003


On Sat, 06 Dec 2003 20:17:18 -0800
Bill Wohler <wohler at newt.com> wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > A message count file can hold many messages.  The msg-count.sh
> > script puts a ".MSG_COUNT spam ham" line at the beginning of each
> > message. Bogofilter recognizes these lines and uses them to separate
> > the input stream into messages.
> 
> I just built these message count files using the formail script that
> Greg provided. The .MSG_COUNT lines seemed nonsensical to me.
> 
> My ham msgbox had 17109 messages. My spam msgbox had 20,000 messages.
> However, a grep of MSG_COUNT revealed 17119 and 20,000 respectively.
> Why would the former be a little off?

Hi Bill,

Sounds familiar.  I don't recall you mentioning the format of your ham
and spam collections.  Are they in mbox format, with "^From " as the
message separators?  What count does 'grep -c "^From " ham.mbx' give? 
An educated guess is that formail split the ham collection into 100
extra messages because for extra "^From " lines.

> In both the spam *and* ham message count file, all of the MSG_COUNT
> tokens are exactly this:
> 
>   ".MSG_COUNT" 10000 8614
>
> Interesting. I created the message count file with this:
> 
> cat $spam_training_set $spam_tuning_set |
>     formail -s bogol /home/wohler/.bogofilter.dev > bw.$date.sp.mc
> 
> cat $ham_training_set $ham_tuning_set |
>     formail -s bogol /home/wohler/.bogofilter.dev > bw.$date.ns.mc
> 
> The numbers above correspond to the number of messages in
> $spam_training_set and $ham_training_set respectively which were also
> used to populate wordlist.db.

That's correct.  The bogol script used bogolexer to convert the initial
message to tokens, then uses sort to order them and remove duplicates,
then uses bogoutil to look up the tokens in the specified wordlist, then
uses awk to format the output.

> Since there is a MSG_COUNT entry for all of the messages, I would have
> expected the counts to be counts of the tokens in the particular
> message, not the total in the training set.

Bogofilter uses MSG_COUNT values from the database to normalize scores.

> Anyway, I passed the files along to Greg and David. Perhaps they can
> view the files and explain why they look like they do.

I think I've answered all your questions.  If not, let me know.

David


-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800
the questions.  If not




More information about the Bogofilter mailing list