bogotune suggests spam_cutoff of 0?

David Relson relson at osagesoftware.com
Thu Apr 22 01:19:13 CEST 2010


On Wed, 21 Apr 2010 09:28:28 -0400
Jonathan Kamens wrote:

> I'm pretty sure that there are no misclassified messages in my ham
> and spam files.
> 
> I can't find out what happens if I remove the high-scoring ham or
> low-scoring spam messages from my files before training, because I
> don't know which messages those are.  This is because bogotrain is
> building its own word list rather than using the one I've already
> got, and the scores it's reporting are based on that internal word
> list rather than my actgual word list.  As a result, there's no way
> for me to figure out which messages it's claiming have those scores.
> 
> I think the short answer is that I need to wait until I've got a large
> enough body of ham and spam messages to be able to train using my
> real word list instead of the one built by bogotune.
> 
> Thanks for your help.
> 
>   jik

Whether there are misclassified messages or not is certainly for you to
judge.  Bogofilter is merely pointing out that that there are very high
scoring non-spam and very low scoring spam -- an indication of possible
errors.

It shouldn't be too hard to find the problem messages.  The following 4
lines will build a wordlist from a pair of mbox files and then score
each message and print its score:

  bogofilter -v -d . -n -B -M nonspam.mbx
  bogofilter -v -d . -s -B -M spam.mbx
  bogofilter -v -d . -M -I spam.mbx
  bogofilter -v -d . -M -I nonspam.mbx

HTH,

David



More information about the Bogofilter mailing list