randomtrain - script to train on errors
David Relson
relson at osagesoftware.com
Sun Dec 1 21:04:59 CET 2002
At 02:34 PM 12/1/02, Greg Louis wrote:
>On 20021201 (Sun) at 1253:55 -0500, David Relson wrote:
> >
> > You've been busy! randomtrain sounds really neat. Can we add it as a
> > contrib? I'm thinking of splitting your message into contrib/randomtrain
> > and contrib/README.randomtrain.
>
>Sure.
>
> > Also, randomtrain can take advantage of the new message formatting
> > capability (in CVS). It's task will be a bit easier with a config file
> > that will enables tristate output and puts "Spam", "Ham", "Unsure" (as
> > appropriate) on the X-Bogosity line (instead of plain old "Yes", "No"). I
> > can patch the script to do this.
>
>Well, the 0/1/2 interpretation _is_ a bit more convoluted than it needs
>to be; there's a bit of historical cruft there that I didn't really
>have to keep. I'd be happier if it would run ok in the _absence_ of a
>config file, though.
Greg,
Shell scripts make it really easy to generate one for special
purposes. The regression tests, i.e. tests/t.*, do it all the time. I've
attached a short patch that will create randomtrain.cf and use it whenever
bogofilter is invoked.
There are several benefits to doing it this way: the config file explicitly
sets the algorithm, ham_cutoff, and labels used for the terse
header. Adding '-c randomtrain.cf' to the bogofilter command line ensures
the setup when bogofilter runs. Specifying tristate output
(Spam/Ham/Unsure) means the script will _know_ what to expect from
bogofilter. Without the '-c', bogofilter will read the default config
files. Given a difference between those files and what is being tested
will result in training with the wrong parameters.
Of course, if you _still_ don't think it's a good idea, I'll use the
patches for personal use.
David
-------------- next part --------------
--- randomtrain.orig 2002-12-01 12:55:36.000000000 -0500
+++ randomtrain 2002-12-01 14:55:25.000000000 -0500
@@ -43,6 +43,13 @@
# params may be ok, here goes...
+cat <<EOF > randomtrain.cf
+algorithm = fisher
+terse_format = %1.1c %d
+spamicity_tags = Spam,Ham,Unsure
+spamicity_formats = %6.2e %6.2e %0.6f
+EOF
+
# get all the byte offsets in all the files, in one list
while [ ${#*} -gt 1 ]; do
indic=${1:0-1:1} ; shift
@@ -81,7 +88,7 @@
{
while read expect fnam offset length; do
dd if=$fnam bs=1 skip=$offset count=$length 2>/dev/null >msg.$pid
- bogofilter -d $bogodir <msg.$pid
+ bogofilter -d $bogodir -c randomtrain.cf <msg.$pid
got=$? # 0=spam, 1=good, 2=unknown, 3=err
echo -n "bogo=$got, "
if [ $got -eq 0 ]; then got="s"; elif [ $got -eq 1 ]; then got="n"; fi
@@ -89,7 +96,7 @@
if [ $got != $expect ]; then
echo -n ", reg=$expect"
# comment out the next line for dry-run testing
- bogofilter -d $bogodir -$expect <msg.$pid
+ bogofilter -d $bogodir -c randomtrain.cf -$expect <msg.$pid
fi
echo
done
More information about the Bogofilter
mailing list