bogominitrain.pl questions

Daniel Teichert danielt at ee.byu.edu
Fri Jun 11 22:22:23 CEST 2004


First off, thanks for a great piece of software! I get a *lot* of email
and a *lot* of spam, and bogofilter has really done a wonderful job!

Second, sorry if this is the wrong forum for this question/comment---or
if it's already been discussed to death.

And third---the actual content ;). I was wondering about a couple of
things with the bogominitrain.pl script. I've found that in addition to
being very helpful/effective for training purposes, running bogominitrain
with the -s option is a very useful way of checking my SPAM/HAM mboxes
for categorization mistakes; that is, very often the emails that are in
bogominitrain.ham.8 or bogominitrain.spam.12 are taking such a long time
to process becuase in fact I misclassified them. There are a couple of
things, though, that would be really useful when using bogominitrain
as a way of double-checking HAM/SPAM classification in a corpus, and
I'm wondering if it can already do this and I missed it, if there are
perhaps other ways to accomplish this, or whatever. Here are some of
the things I thought of:

* A way to set an upper loop limit (e.g., if you've gone through 10
times quit even if there are still errors).

* A way to set a lower-classification-error limit (e.g., if your false
positive + false negatives < 20, quit... hooks for *either* fp or fn
being under the limit, or for *both* fp and fn being individually under
the limit might also be handy).

(Note that I've tried the -n option as a way of 'quitting
early' but it doesn't do as good a job at picking out the
most-likely-to-be-misclassified ones in my very limited experience.)

* Not as significant but perhaps also handy would be a
"don't bother saving mailboxes until you've got below this limit" and
"only save the last mailbox you produced before hitting the limit."

Finally, I'm curious as to whether bogominitrain could ever get into an
infinite training loop if there were, for instance, a copy of the exact
same email in both HAM and SPAM...

Does anyone else use bogominitrain for checking mis-classifications like
this? If there are any 'tricks' for doing these sorts of things with it
that I've missed, I'd be glad to hear of them.

Thanks again!
-- 
Daniel Teichert <danielt at ee.byu.edu>
ECEn CSR, Brigham Young University, $(cat /dev/std_disclaimers)
"Yea, it is the love of God, which sheddeth itself abroad in the
hearts of the children of men; wherefore, it is the most desirable
above all things." --(from) 1 Nephi 11:22 in _The Book of Mormon_



More information about the Bogofilter mailing list