New script to train bogofilter

David Relson relson at osagesoftware.com
Thu Jul 3 14:56:32 CEST 2003


At 08:27 AM 7/3/03, Boris 'pi' Piwinger wrote:
>David Relson wrote:
>
> > Bogofilter already has a "train on error" script.  It is
> > contrib/randomtrain.
>
>I know. I discussed that with Greg.
>
> > As I view the two scripts, the big difference is that
> > they use different message orders when scoring.  You use the mailbox 
> order,
> > which I presume to be the order of receipt, while randomtrain uses a 
> random
> > ordering.
>
>Exactly.
>
> > Having written build-bogofilter-database.pl, I know that you prefer
> > it.  Question:  have you considered randomtrain?
>
>I had a look at it but could not really understand how it
>works. I guess it does not look at all messages. That might
>be another difference.
>
>pi

I've used it and think I understand it.  First, it creates an index of all 
the messages.  Then it shuffles them.  Using the shuffled index, it scores 
each message and trains on errors.  It has a progress display that shows 4 
counts for messages scored and trained and for ham/spam.

I'm sure it _does_ look at all messages.  Like yours, the resulting 
wordlists are much smaller.  Seems like a small percentage of ham messages 
trigger training while a fairly large percentage of spam train.  I don't 
remember exact percentages, but I'd guess they were approx 10% and 40%.





More information about the Bogofilter mailing list