New script to train bogofilter
David Relson
relson at osagesoftware.com
Thu Jul 3 15:16:32 CEST 2003
At 09:04 AM 7/3/03, Boris 'pi' Piwinger wrote:
>David Relson wrote:
>
>[radomtrain]
> > I've used it and think I understand it. First, it creates an index of all
> > the messages. Then it shuffles them. Using the shuffled index, it scores
> > each message and trains on errors.
>
>In that order? Mine always scores with the database after
>training with previous messages. I think this is what
>randomtrain must also do.
>
>I don't really see and advantage of shuffling.
>
> > I'm sure it _does_ look at all messages. Like yours, the resulting
> > wordlists are much smaller. Seems like a small percentage of ham messages
> > trigger training while a fairly large percentage of spam train. I don't
> > remember exact percentages, but I'd guess they were approx 10% and 40%.
>
>That would be way more than mine.
pi,
I'd expect results to be similar.
I found a file of results from May 20. Wanting to see if the error counts
would go to zero, I ran randomtrain several times in succession. Here are
the numbers:
msg reg words pct
0520.0035 spam 14,485 2,066 37,555 14.26%
good 34,156 155 58,085 0.45%
0520.0147 spam 14,485 649 39,646 4.48%
good 34,156 74 64,493 0.22%
0520.0249 spam 14,485 249 40,696 1.72%
good 34,156 39 65,227 0.11%
0520.0346 spam 14,485 134 40,865 0.93%
good 34,156 9 65,306 0.03%
0520.0515 spam 14,485 121 40,963 0.84%
good 34,156 4 65,309 0.01%
"reg" (the number of messages registered) never did make it to zero, but
did continue to go down.
Looking at the above numbers, my memory of 10%/40% is way off. Hopefully
these numbers are closer to yours.
David
More information about the Bogofilter
mailing list