robinson-fisher - two states vs three states

Tue Jan 21 14:08:04 CET 2003

Good morning,

I've just read all of the late night commentary on the 
desirability/undesirability of the "unsure" classification.

I think most of you know about the "spam_cutoff" parameter used by 
bogofilter.  For those who don't, it's a number in the range of 0.0 to 
1.0.  After computing a message's spamicity score, bogofilter compares the 
score to the value of spam_cutoff.  If score is greater than or equal to 
spam_cutoff, the message is spam and the "X-Bogosity" line will say 
so.  Your filters can look at that line in the header and take the action 
you want.

The Robinson-Fisher algorithm has some additional capability.  It can be 
configured to check a second parameter, named "ham_cutoff" and divide the 
remaining messages, i.e. those classified as non-spam using spam_cutoff, 
into two groups.  If the score is less than ham_cutoff, the program is 
certain that the message is not-spam.  For messages with scores between 
these two cutoffs there is insufficient information for bogofilter to be 
sure whether it is ham or spam.  These messages get the "unsure" label.

What does this mean for people using the Robinson-Fisher algorithm?

First, they can filter on "X-Bogosity: Yes".  If matched, bogofilter has 
classified the message as spam.  Period.  For those wanting a binary 
classification, this is all that they need to check.  It's no different 
than before.

Second, there are the people that autotrain using the '-u' (update) 
flag.  If bogofilter isn't sure whether the message is ham or spam, it 
won't automatically add it to a wordlist.

Third, there are the people who "train on errors".  They check whether 
bogofilter has correctly classified each message and, for those messages 
where the person and the program disagree, let the program know about the 
discrepancy.  Using a tristate configuration, I've found that the 
Robinson-Fisher ham and spam classifications are very, very accurate.  (The 
only way I know to fool it is to send a message like "look at this 
interesting spam message" and include the spam in-line.)  Bogofilter's 
"unsure" classification is a signal to the human that the message needs 
human judgement.  This is exactly the kind of tag that is wanted by a 
"train on errors" person.

To summarize, bogofilter's ability to classify as ham, spam, or unsure is 
an enhancement that won't have a negative impact on how you deal with 
spam.  If all you want to know is spam or not, simple filter on 
"X-Bogosity: Yes".  The filter won't care if the final word is "No" or 
"Unsure" because all it's interested in is "Yes".  On the other hand, if 
you don't want to check every message for classification accuracy, 
bogofilter lets you know which messages it's unsure about.  You can filter 
on this and handle those messages and ignore the ham messages.

Lastly, for those who really, really don't ever want to see a message 
classified as "unsure", you can set the value of ham_cutoff to 0 (and 
bogofilter will only say "Yes" or "No") or you can use the Robinson 
algorithm (via the "-r" command line switch or "algorithm=robinson" in your 
config file).

Bogofilter will continue to support the older algorithms.  This 
conversation is about improvement of spam classification by changing the 
default algorithm, not about discontinuing the older algorithms.

I hope that I've helped shed light on this subject and that I haven't bored 
you all to death.

David

That d