how to bogotune?

tallison at tacocat.net tallison at tacocat.net
Wed Sep 29 20:05:33 CEST 2004


> Man, bogotune is very difficult to figure out. The man page is confusing
> since
> it does not clearly state what it means by "wordlist" and "message files".
> After numerous readings I think I have figured out what it wants: I'm
> assuming that the "wordlist" is just my wordlist.db that I've built over
> months of using bogofilter and that the "message files" are some group of
> emails that I have separated into spam and nonspam categories. Is this
> close
> to correct?
>

Very very very close! :)
Actually your right on.

> Next question:
>
> If I have already trained bogofilter with the messages in question, can
> bogotune work on them? Or does that screw it up?
>

It does not screw anything up.

> Final question:
>
> The man page appears to say it wants 500+ messages of spam/ham each, with
> 2000+ ham/spam each in the wordlist. I certainly have more than 2000 each
> in
> my wordlist, and I fed it ~1000 each for messages but it complained about
> "low number / uniformity" of messages and produced no useful results (that
> I
> can tell).
>
> My guess is that the messages I feed in must NOT be already trained, or
> else
> they're all going to read 1.0000 and 0.0000 (and that would make sense).
> Is
> this correct?
>

No, that is not correct.
It's not that simple.

I don't remember the specifics of the wordlist count and such, I'll leave
that to better minds.  I have ~3K of each and I don't get any complaints. 
I also use my existing wordlist with existing messages (that have already
been seen).

The inconsistency comes when you have real email that scores .9999 and
real spam that scores 0.0001.  What bogofilter/bogotune would like to see
is a greater seperation between ham/spam such that your lowest spam might
be 0.5 and your highest ham might be 0.6.  But when you have maybe 50 spam
at 0.0001 and 50 ham at 0.9999 then it will come back with numbers for
cutoff like:
ham_cutoff = 0.435
spam_cutoff = 0.000
duh?

Well, it's math.  It expects things to be well organized.  yeah right!

BTW - I get these errors routinely and sometimes have some really whacked
ham/spam cutoff values.  I generally ignore them and maybe make a slight
adjustment to my settings based on what it suggested.

The alternatives are to:
discard the ugly exceptions.
train and retrain the ugly exceptions several times over until it gets it
right or at least scores more reasonably (eg ~0.5).




More information about the Bogofilter mailing list