New script to train bogofilter

David Relson relson at osagesoftware.com
Thu Jul 3 15:16:32 CEST 2003


At 09:04 AM 7/3/03, Boris 'pi' Piwinger wrote:
>David Relson wrote:
>
>[radomtrain]
> > I've used it and think I understand it.  First, it creates an index of all
> > the messages.  Then it shuffles them.  Using the shuffled index, it scores
> > each message and trains on errors.
>
>In that order? Mine always scores with the database after
>training with previous messages. I think this is what
>randomtrain must also do.
>
>I don't really see and advantage of shuffling.
>
> > I'm sure it _does_ look at all messages.  Like yours, the resulting
> > wordlists are much smaller.  Seems like a small percentage of ham messages
> > trigger training while a fairly large percentage of spam train.  I don't
> > remember exact percentages, but I'd guess they were approx 10% and 40%.
>
>That would be way more than mine.

pi,

I'd expect results to be similar.

I found a file of results from May 20.  Wanting to see if the error counts 
would go to zero, I ran randomtrain several times in succession.  Here are 
the numbers:

                          msg     reg     words   pct
0520.0035       spam    14,485  2,066   37,555  14.26%
                 good    34,156    155   58,085   0.45%

0520.0147       spam    14,485    649   39,646   4.48%
                 good    34,156     74   64,493   0.22%

0520.0249       spam    14,485    249   40,696   1.72%
                 good    34,156     39   65,227   0.11%

0520.0346       spam    14,485    134   40,865   0.93%
                 good    34,156      9   65,306   0.03%

0520.0515       spam    14,485    121   40,963   0.84%
                 good    34,156      4   65,309   0.01%

"reg" (the number of messages registered) never did make it to zero, but 
did continue to go down.

Looking at the above numbers, my memory of 10%/40% is way off.  Hopefully 
these numbers are closer to yours.

David





More information about the Bogofilter mailing list