Testing shows katastrophy
David Relson
relson at osagesoftware.com
Wed Jan 22 13:25:17 CET 2003
Boris,
"Katastrophy" sounds like the right word. I guess I should to remove the
special check for univie.ac.at <grin>.
On a more serious note, 0.10.0 _is_ beta software. It has lots of new
features. Prior to the release, the testing was limited. I know I've been
using it successfully for classifying my incoming messages.
The early testing of 0.10 has been good. People have been using it,
encountering problems, reporting them, and they're getting fixed, and the
fixes are going into CVS.
After your big training run, did you check the message counts in the word
lists? A significant error was uncovered in the mime processing code that
affects trainning on mailboxes. The error causes an incorrect .MSG_COUNT
value to be computed and stored in the wordlist. This is likely to cause
incorrect spamicity scores because the scores use the ratio of a word's
occurrence to the number of messages. If you still have the bad databases,
run the command "bogoutil -w /wordlist/dir .MSG_COUNT" to display the
counts for .MSG_COUNT.
The quickest way to "see" why bogofilter classified a message as it did
(when using Robinson or Robinson-Fisher) is to generate the histogram using
"-vv" on the command line.
As a second detail, your use of "min_dev=0.2" will ignore all words with
spamicities between 0.3 and 0.7. This _may_ be a bit extreme. I use
"min_dev=0.1" with a high degree of success.
The graham problem, i.e. "Internal error in graham.c:158]", is caused by
bogofilter choosing a long mime boundary as one of the 15 extrema
tokens. That flaw has been in 0.9.1.2 since it was released. I can send
you a patch for it.
Plans are to release 0.10.1 in the next day or so. I haven't yet gotten to
the "-2" and "-3" options nor have I verified/fixed some other bug reports.
If you can update from cvs, that would be a good thing to do. If your
problems happen with the newer code, I _really_ want to hear about it. If
you can't use cvs, I can build a tarball of 0.10.0.cvs and send it to you.
David
More information about the Bogofilter
mailing list