bogominitrain.pl questions

Boris 'pi' Piwinger 3.14 at piology.org
Sat Jun 12 10:05:16 CEST 2004


Daniel Teichert <danielt at ee.byu.edu> wrote:

>And third---the actual content ;). I was wondering about a couple of
>things with the bogominitrain.pl script. I've found that in addition to
>being very helpful/effective for training purposes, running bogominitrain
>with the -s option is a very useful way of checking my SPAM/HAM mboxes
>for categorization mistakes; 

That is correct. I am using this for a very long time now. I
detected lots of errors when first using it. Now I once in a
while find errors I made in the meantime. Actually, finding
those and other problems was the reason for introducing -s.

>that is, very often the emails that are in
>bogominitrain.ham.8 or bogominitrain.spam.12 are taking such a long time
>to process becuase in fact I misclassified them. 

I am wondering that you get that many iterations. I usually
do with a lot less and never reached even ten, not sure
about nine. Classification errors usually do show up in
early rounds, in my experience.

>There are a couple of
>things, though, that would be really useful when using bogominitrain
>as a way of double-checking HAM/SPAM classification in a corpus, and
>I'm wondering if it can already do this and I missed it, if there are
>perhaps other ways to accomplish this, or whatever. Here are some of
>the things I thought of:
>
>* A way to set an upper loop limit (e.g., if you've gone through 10
>times quit even if there are still errors).

That is easily possible with one little change of code which
just checks the number and exits. In the line
} until ($fn+$fp==0 || $hamadd+$spamadd==0 || !$force);
you can just add a condition of the form $runs>9 or any
number you like.

>* A way to set a lower-classification-error limit (e.g., if your false
>positive + false negatives < 20, quit... hooks for *either* fp or fn
>being under the limit, or for *both* fp and fn being individually under
>the limit might also be handy).

As you can see above the sum is checked to be zero. Change
as you like.

>* Not as significant but perhaps also handy would be a
>"don't bother saving mailboxes until you've got below this limit" and
>"only save the last mailbox you produced before hitting the limit."

I am not sure I really understand that, but you don't know
in advance how many you produce, so it cannot work. Anyway,
after the initial training it will always be small.

>Finally, I'm curious as to whether bogominitrain could ever get into an
>infinite training loop if there were, for instance, a copy of the exact
>same email in both HAM and SPAM...

That would happen. In practice even very similar messages
don't cause problems, so you would really need identical as
an error.

pi



More information about the Bogofilter mailing list