contributing datasets, was: Is bogotune helpful?

Bill Wohler wohler at newt.com
Sun Dec 7 05:17:18 CET 2003


David Relson <relson at osagesoftware.com> writes:

> A message count file can hold many messages.  The msg-count.sh script
> puts a ".MSG_COUNT spam ham" line at the beginning of each message. 
> Bogofilter recognizes these lines and uses them to separate the input
> stream into messages.

I just built these message count files using the formail script that
Greg provided. The .MSG_COUNT lines seemed nonsensical to me.

My ham msgbox had 17109 messages. My spam msgbox had 20,000 messages.
However, a grep of MSG_COUNT revealed 17119 and 20,000 respectively.
Why would the former be a little off?

In both the spam *and* ham message count file, all of the MSG_COUNT
tokens are exactly this:

  ".MSG_COUNT" 10000 8614

Interesting. I created the message count file with this:

cat $spam_training_set $spam_tuning_set |
    formail -s bogol /home/wohler/.bogofilter.dev > bw.$date.sp.mc

cat $ham_training_set $ham_tuning_set |
    formail -s bogol /home/wohler/.bogofilter.dev > bw.$date.ns.mc

The numbers above correspond to the number of messages in
$spam_training_set and $ham_training_set respectively which were also
used to populate wordlist.db.

Since there is a MSG_COUNT entry for all of the messages, I would have
expected the counts to be counts of the tokens in the particular
message, not the total in the training set.

Anyway, I passed the files along to Greg and David. Perhaps they can
view the files and explain why they look like they do.

-- 
Bill Wohler <wohler at newt.com>  http://www.newt.com/wohler/  GnuPG ID:610BD9AD
Maintainer of comp.mail.mh FAQ and MH-E. Vote Libertarian!
If you're passed on the right, you're in the wrong lane.




More information about the Bogofilter mailing list