contributing datasets, was: Is bogotune helpful?

David Relson relson at osagesoftware.com
Sun Dec 7 06:08:04 CET 2003


On Sat, 06 Dec 2003 20:47:44 -0800
Bill Wohler <wohler at newt.com> wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > On Sat, 06 Dec 2003 20:17:18 -0800
> > Bill Wohler <wohler at newt.com> wrote:
> >
> >> David Relson <relson at osagesoftware.com> writes:
> >> 
> >> > A message count file can hold many messages.  The msg-count.sh
> >> > script puts a ".MSG_COUNT spam ham" line at the beginning of each
> >> > message. Bogofilter recognizes these lines and uses them to
> >separate> > the input stream into messages.
> >> 
> >> I just built these message count files using the formail script
> >that> Greg provided. The .MSG_COUNT lines seemed nonsensical to me.
> >> 
> >> My ham msgbox had 17109 messages. My spam msgbox had 20,000
> >messages.> However, a grep of MSG_COUNT revealed 17119 and 20,000
> >respectively.> Why would the former be a little off?
> > 

Bill,

Exactly what commands did you run to get your counts of 17109, 17119,
and 20000?

Here are two, slightly different grep commands and their results:

### for f in *mc ; do grep -c ^..MSG_COUNT $f ; done
17079
19948

### for f in *mc ; do grep -c MSG_COUNT $f ; done
17090
19948

The difference is normal since .MSG_COUNT appears periodically in
bogofilter related messages and goes into the wordlist as "MSG_COUNT"
(without the leading period).  Since bogofilter's parsing doesn't
include the leading period, we were able to utilize .MSG_COUNT for
storing meta-information knowing that it wouldn't conflict with a "real"
token.

> They are mbox with "^From " separators. Grepping for this is how I
> came up with the 17,109 and 20,000 counts.

'Tis a bit strange that I'm seeing different counts than your reported
for the .mc files.  

> 
> > Bogofilter uses MSG_COUNT values from the database to normalize
> > scores.
> 
> If that's what you were expecting, then we're good. Hope you find the
> data useful.

Certainly it will be interesting !  Greg's work and home corpora have
different characteristic from mine.  We're curious to see if yours are
similar to any of ours or, if not, how it differs.

Right now, I'm at the end of a long day and it's time for a break. 
Perhaps tomorrow we can figure it out.  

David




More information about the Bogofilter mailing list