bogominitrain.pl questions

Sat Jun 12 10:05:16 CEST 2004

Daniel Teichert <danielt at ee.byu.edu> wrote:

>And third---the actual content ;). I was wondering about a couple of
>things with the bogominitrain.pl script. I've found that in addition to
>being very helpful/effective for training purposes, running bogominitrain
>with the -s option is a very useful way of checking my SPAM/HAM mboxes
>for categorization mistakes; 

That is correct. I am using this for a very long time now. I
detected lots of errors when first using it. Now I once in a
while find errors I made in the meantime. Actually, finding
those and other problems was the reason for introducing -s.

>that is, very often the emails that are in
>bogominitrain.ham.8 or bogominitrain.spam.12 are taking such a long time
>to process becuase in fact I misclassified them. 

I am wondering that you get that many iterations. I usually
do with a lot less and never reached even ten, not sure
about nine. Classification errors usually do show up in
early rounds, in my experience.

>There are a couple of
>things, though, that would be really useful when using bogominitrain
>as a way of double-checking HAM/SPAM classification in a corpus, and
>I'm wondering if it can already do this and I missed it, if there are
>perhaps other ways to accomplish this, or whatever. Here are some of
>the things I thought of:
>
>* A way to set an upper loop limit (e.g., if you've gone through 10
>times quit even if there are still errors).

That is easily possible with one little change of code which
just checks the number and exits. In the line
} until ($fn+$fp==0 || $hamadd+$spamadd==0 || !$force);
you can just add a condition of the form $runs>9 or any
number you like.

>* A way to set a lower-classification-error limit (e.g., if your false
>positive + false negatives < 20, quit... hooks for *either* fp or fn
>being under the limit, or for *both* fp and fn being individually under
>the limit might also be handy).

As you can see above the sum is checked to be zero. Change
as you like.

>* Not as significant but perhaps also handy would be a
>"don't bother saving mailboxes until you've got below this limit" and
>"only save the last mailbox you produced before hitting the limit."

I am not sure I really understand that, but you don't know
in advance how many you produce, so it cannot work. Anyway,
after the initial training it will always be small.

>Finally, I'm curious as to whether bogominitrain could ever get into an
>infinite training loop if there were, for instance, a copy of the exact
>same email in both HAM and SPAM...

That would happen. In practice even very similar messages
don't cause problems, so you would really need identical as
an error.

pi