patch for bogofilter status line

David Relson relson at osagesoftware.com
Sun Dec 1 14:44:20 CET 2002


At 12:42 AM 12/1/02, Graham Wilson wrote:

> > The revisions working well enough in a test version of bogofilter that it
> > passes the regression tests - with the added feature of tristate
> > (Spam/Ham/Unsure) classifications from Robinson-Fisher.
>
>does that mean bogofilter will return a distinct value representing
>unsure?
>
> > I'm also contemplating what is likely a significant change>  It would be
> > good for each of the three classifications (Spam, Ham, Unsure) to have its
> > own label (for example "Yes", "No", "Unsure" or "+", "-", "?" ) and
> > format.
>
>do you mean like this?
>
>  %y  Yes, No, or Unsure
>  %l  + - ?

The Robinson-Fisher algorithm produces spamicity scores a bit differently 
than either with the Graham or Robinson algorithms.  With Graham and 
Robinson, the scores can be anything from 0 to 1 with a "slightly spammier" 
message having a slightly higher score than a "less spammier" 
message.  Given this characteristic, the spam_cutoff value tends be 
somewhat arbitrary (though "empirical" might be a better description), i.e. 
use a value such that most spam is above it and most ham is below it.

With Robinson-Fisher, scores for spam tend to be very close to 1.0 and 
scores for ham tend to be very close to 0.0.  In between those areas is a 
wide region for when R-F is unsure of the result.  Greg Louis, our 
algorithm expert/experimenter/developer/tester is currently using 0.96f for 
his spam_cutoff and 0.10f for ham_cutoff.  Anything between those two 
values is in the "unsure" region.

In terms of the code, R-F returns RC_SPAM if score>= spam_cutoff.  If 
ham_cutoff is zero, everything else is scored as RC_HAM.  If ham_cutoff is 
non-zero, it returns RC_HAM if score <= ham_cutoff and returns RC_UNSURE 
otherwise.


> > The values would come from the config file.
>
>i saw something like this in your original emails, but opted to try to
>keep the number of config variables to a minimum. i think having
>everything in one string, as opposed to multiple config variables is a
>cleaner syntax.

True.  Fewer config variables makes life simpler.  However R-F values for 
ham and spam tend to be _very_ close to 0 and 1, while the unsure values 
can be all over the map.  "Very close" means that a format like %0.6f will 
usually show 0.000000 or 1.000000, so "%6.2f" is more informative.  For 
example, message 8 in the regression tests has graham score of 1.000000, 
robinson score of 0.692182, and R-F score of 1.000000 (using %0.6f) or 
8.88e-16 (using delta, i.e. 1-score, and %6.2e).

Given the difference in scores for R-F, I deemed it useful to have 
different formats for the 3 states (ham/spam/unsure) and to have config 
file options for specifying them.  For someone who absolutely doesn't want 
to have a config file, he/she can modify the source code to use his/her values.

> > Letting snprintf() do the conversions means the formatter doesn't need
> > to handle all the flags and parameters (min, precision, zero, etc).
> > That might make the state machine unnecessary.
>
>my original design didnt include a state machine, but it turned out that
>when i added the state machine the code became cleaner and easier to
>expand.
>
>i originally added it so that the user could format the spamicity
>strings to his/her liking. it turns out that it is useful for other
>things also, such as the delta flag (#), and for only printing Y or N
>(%.1y).
>
>like i said, i feel the state machine makes the code cleaner and easier
>to expand for the future, so i would opt to keep. (but then again, i did
>write it :-)
>
> > To preserve your work and in case of future need, format.c and
> > format.h would be committed to CVS as they are now, and then he
> > enhancements would be made.
>
>sounds good.
>
>when are you going to do the cvs commit?

today.


> > So a lot of good stuff has happened and there is more to come.  Thank you
> > very much for doing a great job and putting the formatting framework
> > together.
>
>no problem. let me know if any other small additions like that are
>needed. i would be glad to help.

The offer is appreciated, as is the work you have done.

Thank you.

David





More information about the bogofilter-dev mailing list