comments from a new user

Andrew Pimlott andrew at pimlott.net
Sat May 10 05:13:08 CEST 2003


On Fri, May 09, 2003 at 09:08:02PM -0400, David Relson wrote:
> At 08:34 PM 5/9/03, Andrew Pimlott wrote:
> >- The FAQ has an obviously wrong explanation of pgood and pbad.  It calls
> >  pgood the "likelihood that a message containing this token is non-spam"
> >  when (I think) it means the "likelihood that a non-spam message contains
> >  this token".
> 
> This is debatable.  What's presently in the FAQ isn't great, but the 
> correct wording isn't obvious.

According to the FAQ, pgood and pbad must trivially add up to 1,
which is clearly not true.  :-)

> Consider a token with a pgood score of 0.1.  That means it appears in 10% 
> of the good messages that have been registered.  Given that there are a 
> gazillion other messages in the world, your statement isn't quite right.

True, but I would think that anyone even roughly familiar with how
bogofilter works would understand that these numbers are based on
the messages that have been registered.

> Perhaps you can suggest a different wording ...

    "the likelihood (extrapolated from your registered non-spam
    messages) that a non-spam message contains this token"

> When bogofilter is in tri-state mode and classifies the message as Unsure, 
> I want to know "why".  The histogram provides information for that.  When 
> the message is clearly ham or spam, the histogram is much less 
> interesting.

Not having used tri-state mode, it's not obvious to me that the
histogram is so much more interesting for "unsure" that this should
be the default.  Further, by the time you run "bogofilter -vv",
don't you typically already know how the message has been classified
(because of the mailbox it's in)?  If so, you would only run
"bogofilter -vv" because you want to see "why".

Hmm...  I'm guessing the scenario you have in mind is where you're
testing some tweak against a corpus of messages.  In that case,
shouldn't you be the one who has to add an extra switch, instead of
ordinary users?

Anyway, what a new user sees is that the histogram is not always
printed, and it's not based just on whether the message is spam, so
it appears to be an "unreliable feature".  And looking in the man
page under "-v" doesn't help.

Andrew




More information about the Bogofilter mailing list