Finding problem messages

Fri Apr 23 02:06:29 CEST 2010

On Thu, 22 Apr 2010 12:08:58 -0400
Jonathan Kamens wrote:

> Quoted from David Relson:
> >>   bogofilter -v -d . -n -B -M nonspam.mbx
> >>   bogofilter -v -d . -s -B -M spam.mbx
> >>   bogofilter -v -d . -M -I spam.mbx
> >>   bogofilter -v -d . -M -I nonspam.mbx
> The problem with this approach is that it will not build the same
> word list that bogotune builds when it builds an internal word list,
> at least not if I understand bogotune correctly.
> 
> When bogotune builds an internal word list, it uses half of the
> messages fed to it for building the word list, and then it uses the
> other half of the messages fed to it for scoring and tuning.
> 
> I suppose if I knew exactly how bogotune chooses which messages to
> use for the word list and which ones to use for tuning, I could
> reproduce its behavior by hand.  But since I do not know (I do not
> believe it is documented), I can't do that.
> 
> I could read the source code, sure, but it's easier just to wait
> until I have enough ham and spam messages in my real word list that
> bogotune doesn't have to build an internal one.
> 
>    jik

True.  The above technique doesn't duplicate bogotune.  It's a
technique that will can help identify problem messages, hence is worth
a try.

Alternatively, since you have the bogotune source code, you can modify
it to identify problem messages.  Admittedly not a trivial exercise.

Another approach would be to score all your test messages using your
current wordlist and look for the lowest scoring spam and highest
scoring ham.

Lastly, try flags "-vvvvvv" (six 'v's) which triggers the SCORE_DETAIL
level of verbosity.  That will provide lots of information (including
"message number / bogosity score" pairs).  It's been years since
bogotune was developed and I don't recall exactly how that output
looks, but it might be helpful.

Regards,

David