contributing datasets, was: Is bogotune helpful?
Bill Wohler
wohler at newt.com
Sun Dec 7 05:17:18 CET 2003
David Relson <relson at osagesoftware.com> writes:
> A message count file can hold many messages. The msg-count.sh script
> puts a ".MSG_COUNT spam ham" line at the beginning of each message.
> Bogofilter recognizes these lines and uses them to separate the input
> stream into messages.
I just built these message count files using the formail script that
Greg provided. The .MSG_COUNT lines seemed nonsensical to me.
My ham msgbox had 17109 messages. My spam msgbox had 20,000 messages.
However, a grep of MSG_COUNT revealed 17119 and 20,000 respectively.
Why would the former be a little off?
In both the spam *and* ham message count file, all of the MSG_COUNT
tokens are exactly this:
".MSG_COUNT" 10000 8614
Interesting. I created the message count file with this:
cat $spam_training_set $spam_tuning_set |
formail -s bogol /home/wohler/.bogofilter.dev > bw.$date.sp.mc
cat $ham_training_set $ham_tuning_set |
formail -s bogol /home/wohler/.bogofilter.dev > bw.$date.ns.mc
The numbers above correspond to the number of messages in
$spam_training_set and $ham_training_set respectively which were also
used to populate wordlist.db.
Since there is a MSG_COUNT entry for all of the messages, I would have
expected the counts to be counts of the tokens in the particular
message, not the total in the training set.
Anyway, I passed the files along to Greg and David. Perhaps they can
view the files and explain why they look like they do.
--
Bill Wohler <wohler at newt.com> http://www.newt.com/wohler/ GnuPG ID:610BD9AD
Maintainer of comp.mail.mh FAQ and MH-E. Vote Libertarian!
If you're passed on the right, you're in the wrong lane.
More information about the Bogofilter
mailing list