randomtrain - script to train on errors

David Relson relson at osagesoftware.com
Sun Dec 1 21:04:59 CET 2002


At 02:34 PM 12/1/02, Greg Louis wrote:

>On 20021201 (Sun) at 1253:55 -0500, David Relson wrote:
> >
> > You've been busy!  randomtrain sounds really neat.  Can we add it as a
> > contrib?  I'm thinking of splitting your message into contrib/randomtrain
> > and contrib/README.randomtrain.
>
>Sure.
>
> > Also, randomtrain can take advantage of the new message formatting
> > capability (in CVS).  It's task will be a bit easier with a config file
> > that will enables tristate output and puts "Spam", "Ham", "Unsure" (as
> > appropriate) on the X-Bogosity line (instead of plain old "Yes", "No").  I
> > can patch the script to do this.
>
>Well, the 0/1/2 interpretation _is_ a bit more convoluted than it needs
>to be; there's a bit of historical cruft there that I didn't really
>have to keep.  I'd be happier if it would run ok in the _absence_ of a
>config file, though.

Greg,

Shell scripts make it really easy to generate one for special 
purposes.  The regression tests, i.e. tests/t.*, do it all the time.  I've 
attached a short patch that will create randomtrain.cf and use it whenever 
bogofilter is invoked.

There are several benefits to doing it this way: the config file explicitly 
sets the algorithm, ham_cutoff, and labels used for the terse 
header.  Adding '-c randomtrain.cf' to the bogofilter command line ensures 
the setup when bogofilter runs.   Specifying tristate output 
(Spam/Ham/Unsure) means the script will _know_ what to expect from 
bogofilter.  Without the '-c', bogofilter will read the default config 
files.  Given a difference between those files and what is being tested 
will result in training with the wrong parameters.

Of course, if you _still_ don't think it's a good idea, I'll use the 
patches for personal use.

David
-------------- next part --------------
--- randomtrain.orig	2002-12-01 12:55:36.000000000 -0500
+++ randomtrain	2002-12-01 14:55:25.000000000 -0500
@@ -43,6 +43,13 @@
 
 # params may be ok, here goes...
 
+cat <<EOF > randomtrain.cf
+algorithm = fisher
+terse_format = %1.1c %d
+spamicity_tags = Spam,Ham,Unsure
+spamicity_formats = %6.2e %6.2e %0.6f
+EOF
+
 # get all the byte offsets in all the files, in one list
 while [ ${#*} -gt 1 ]; do
     indic=${1:0-1:1} ; shift
@@ -81,7 +88,7 @@
 {
     while read expect fnam offset length; do
 	dd if=$fnam bs=1 skip=$offset count=$length 2>/dev/null >msg.$pid
-	bogofilter -d $bogodir <msg.$pid
+	bogofilter -d $bogodir -c randomtrain.cf <msg.$pid
 	got=$?	# 0=spam, 1=good, 2=unknown, 3=err
 	echo -n "bogo=$got, "
 	if [ $got -eq 0 ]; then got="s"; elif [ $got -eq 1 ]; then got="n"; fi
@@ -89,7 +96,7 @@
 	if [ $got != $expect ]; then
 	    echo -n ", reg=$expect"
 	    # comment out the next line for dry-run testing
-	    bogofilter -d $bogodir -$expect <msg.$pid
+	    bogofilter -d $bogodir -c randomtrain.cf -$expect <msg.$pid
 	fi
 	echo
     done



More information about the Bogofilter mailing list