New script to train bogofilter
David Relson
relson at osagesoftware.com
Thu Jul 3 14:56:32 CEST 2003
At 08:27 AM 7/3/03, Boris 'pi' Piwinger wrote:
>David Relson wrote:
>
> > Bogofilter already has a "train on error" script. It is
> > contrib/randomtrain.
>
>I know. I discussed that with Greg.
>
> > As I view the two scripts, the big difference is that
> > they use different message orders when scoring. You use the mailbox
> order,
> > which I presume to be the order of receipt, while randomtrain uses a
> random
> > ordering.
>
>Exactly.
>
> > Having written build-bogofilter-database.pl, I know that you prefer
> > it. Question: have you considered randomtrain?
>
>I had a look at it but could not really understand how it
>works. I guess it does not look at all messages. That might
>be another difference.
>
>pi
I've used it and think I understand it. First, it creates an index of all
the messages. Then it shuffles them. Using the shuffled index, it scores
each message and trains on errors. It has a progress display that shows 4
counts for messages scored and trained and for ham/spam.
I'm sure it _does_ look at all messages. Like yours, the resulting
wordlists are much smaller. Seems like a small percentage of ham messages
trigger training while a fairly large percentage of spam train. I don't
remember exact percentages, but I'd guess they were approx 10% and 40%.
More information about the Bogofilter
mailing list