contributing datasets, was: Is bogotune helpful?

Bill Wohler wohler at newt.com
Sun Dec 7 05:47:44 CET 2003


David Relson <relson at osagesoftware.com> writes:

> On Sat, 06 Dec 2003 20:17:18 -0800
> Bill Wohler <wohler at newt.com> wrote:
>
>> David Relson <relson at osagesoftware.com> writes:
>> 
>> > A message count file can hold many messages.  The msg-count.sh
>> > script puts a ".MSG_COUNT spam ham" line at the beginning of each
>> > message. Bogofilter recognizes these lines and uses them to separate
>> > the input stream into messages.
>> 
>> I just built these message count files using the formail script that
>> Greg provided. The .MSG_COUNT lines seemed nonsensical to me.
>> 
>> My ham msgbox had 17109 messages. My spam msgbox had 20,000 messages.
>> However, a grep of MSG_COUNT revealed 17119 and 20,000 respectively.
>> Why would the former be a little off?
>
> Hi Bill,
>
> Sounds familiar.  I don't recall you mentioning the format of your ham
> and spam collections.  Are they in mbox format, with "^From " as the
> message separators?  What count does 'grep -c "^From " ham.mbx' give? 
> An educated guess is that formail split the ham collection into 100
> extra messages because for extra "^From " lines.

They are mbox with "^From " separators. Grepping for this is how I came
up with the 17,109 and 20,000 counts.

> Bogofilter uses MSG_COUNT values from the database to normalize scores.

If that's what you were expecting, then we're good. Hope you find the
data useful.

-- 
Bill Wohler <wohler at newt.com>  http://www.newt.com/wohler/  GnuPG ID:610BD9AD
Maintainer of comp.mail.mh FAQ and MH-E. Vote Libertarian!
If you're passed on the right, you're in the wrong lane.




More information about the Bogofilter mailing list