patch for bogofilter status line
David Relson
relson at osagesoftware.com
Sun Dec 1 14:44:20 CET 2002
At 12:42 AM 12/1/02, Graham Wilson wrote:
> > The revisions working well enough in a test version of bogofilter that it
> > passes the regression tests - with the added feature of tristate
> > (Spam/Ham/Unsure) classifications from Robinson-Fisher.
>
>does that mean bogofilter will return a distinct value representing
>unsure?
>
> > I'm also contemplating what is likely a significant change> It would be
> > good for each of the three classifications (Spam, Ham, Unsure) to have its
> > own label (for example "Yes", "No", "Unsure" or "+", "-", "?" ) and
> > format.
>
>do you mean like this?
>
> %y Yes, No, or Unsure
> %l + - ?
The Robinson-Fisher algorithm produces spamicity scores a bit differently
than either with the Graham or Robinson algorithms. With Graham and
Robinson, the scores can be anything from 0 to 1 with a "slightly spammier"
message having a slightly higher score than a "less spammier"
message. Given this characteristic, the spam_cutoff value tends be
somewhat arbitrary (though "empirical" might be a better description), i.e.
use a value such that most spam is above it and most ham is below it.
With Robinson-Fisher, scores for spam tend to be very close to 1.0 and
scores for ham tend to be very close to 0.0. In between those areas is a
wide region for when R-F is unsure of the result. Greg Louis, our
algorithm expert/experimenter/developer/tester is currently using 0.96f for
his spam_cutoff and 0.10f for ham_cutoff. Anything between those two
values is in the "unsure" region.
In terms of the code, R-F returns RC_SPAM if score>= spam_cutoff. If
ham_cutoff is zero, everything else is scored as RC_HAM. If ham_cutoff is
non-zero, it returns RC_HAM if score <= ham_cutoff and returns RC_UNSURE
otherwise.
> > The values would come from the config file.
>
>i saw something like this in your original emails, but opted to try to
>keep the number of config variables to a minimum. i think having
>everything in one string, as opposed to multiple config variables is a
>cleaner syntax.
True. Fewer config variables makes life simpler. However R-F values for
ham and spam tend to be _very_ close to 0 and 1, while the unsure values
can be all over the map. "Very close" means that a format like %0.6f will
usually show 0.000000 or 1.000000, so "%6.2f" is more informative. For
example, message 8 in the regression tests has graham score of 1.000000,
robinson score of 0.692182, and R-F score of 1.000000 (using %0.6f) or
8.88e-16 (using delta, i.e. 1-score, and %6.2e).
Given the difference in scores for R-F, I deemed it useful to have
different formats for the 3 states (ham/spam/unsure) and to have config
file options for specifying them. For someone who absolutely doesn't want
to have a config file, he/she can modify the source code to use his/her values.
> > Letting snprintf() do the conversions means the formatter doesn't need
> > to handle all the flags and parameters (min, precision, zero, etc).
> > That might make the state machine unnecessary.
>
>my original design didnt include a state machine, but it turned out that
>when i added the state machine the code became cleaner and easier to
>expand.
>
>i originally added it so that the user could format the spamicity
>strings to his/her liking. it turns out that it is useful for other
>things also, such as the delta flag (#), and for only printing Y or N
>(%.1y).
>
>like i said, i feel the state machine makes the code cleaner and easier
>to expand for the future, so i would opt to keep. (but then again, i did
>write it :-)
>
> > To preserve your work and in case of future need, format.c and
> > format.h would be committed to CVS as they are now, and then he
> > enhancements would be made.
>
>sounds good.
>
>when are you going to do the cvs commit?
today.
> > So a lot of good stuff has happened and there is more to come. Thank you
> > very much for doing a great job and putting the formatting framework
> > together.
>
>no problem. let me know if any other small additions like that are
>needed. i would be glad to help.
The offer is appreciated, as is the work you have done.
Thank you.
David
More information about the bogofilter-dev
mailing list