robinson-fisher - two states vs three states

Tue Jan 21 21:00:49 CET 2003

At 08:08 AM 2003-01-21 -0500, David Relson wrote:

>Good morning,
>
>I've just read all of the late night commentary on the 
>desirability/undesirability of the "unsure" classification.
>
>I think most of you know about the "spam_cutoff" parameter used by 
>bogofilter.  For those who don't, it's a number in the range of 0.0 to 
>1.0.  After computing a message's spamicity score, bogofilter compares the 
>score to the value of spam_cutoff.  If score is greater than or equal to 
>spam_cutoff, the message is spam and the "X-Bogosity" line will say 
>so.  Your filters can look at that line in the header and take the action 
>you want.
>
>The Robinson-Fisher algorithm has some additional capability.  It can be 
>configured to check a second parameter, named "ham_cutoff" and divide the 
>remaining messages, i.e. those classified as non-spam using spam_cutoff, 
>into two groups.  If the score is less than ham_cutoff, the program is 
>certain that the message is not-spam.  For messages with scores between 
>these two cutoffs there is insufficient information for bogofilter to be 
>sure whether it is ham or spam.  These messages get the "unsure" label.
>
>What does this mean for people using the Robinson-Fisher algorithm?
>
>First, they can filter on "X-Bogosity: Yes".  If matched, bogofilter has 
>classified the message as spam.  Period.  For those wanting a binary 
>classification, this is all that they need to check.  It's no different 
>than before.

OK.  So you suggest delivering all unsure messages.  Fine.

>Second, there are the people that autotrain using the '-u' (update) 
>flag.  If bogofilter isn't sure whether the message is ham or spam, it 
>won't automatically add it to a wordlist.

Which means that now when I do determine whether it is ham or spam, I have 
the added complexity of having to add some delivered messages to the 
wordlists if I still want to train on all messages.  This is likely to have 
to be done synchronously.

>Third, there are the people who "train on errors".  They check whether 
>bogofilter has correctly classified each message and, for those messages 
>where the person and the program disagree, let the program know about the 
>discrepancy.

Which now means that this is at least twice as complex as before.  Before, 
if a message was delivered, and it was incorrectly classified, you needed 
to reclassify the words in it with, say, the S classification to move the 
words.  Now you have not only the possibility of needing to -S the words in 
a misdelivered spam, you might also want to -s the message.  It depends, 
completely, on whether the message was previously classified as spam or 
unknown.  So you have doubled the retraining complexity.

Further, you might have the possibility that a message was classified as 
spam and it is actually ham.  Before, that was always a -N.  Now it might 
be a -u.  Again, twice as complex.

>  Using a tristate configuration, I've found that the Robinson-Fisher ham 
> and spam classifications are very, very accurate.

That is not the question.  The question is, "Is the delivery any more 
accurate?"  Every misdelivered message is (1) an irritation (2) potential 
lost data (3) something that requires manual action.

In other words, I really do not care if a message is 100% likely to 
actually be spam when declared spam, and 100% likely to be non-spam when 
declared nonspam.  If 25% of the messages are declared "unknown" and I have 
made the decision to deliver them, and half of them are spam, then I have a 
12.5% failure rate.

There really are only two choices:  Deliver or don't.  Any failure to get 
that choice right is what matters, no matter how I have labeled the thing I 
have decided to deliver.  And putting multiple labels on something I am 
going to deliver anyway just makes the recovery path more complex when it 
turns out to be spam, because that is the recovery we are talking about 
here:  You delivered spam and now you have to reclassify it as spam that 
should not be delivered.

So, if you can't tell me what I get in exchange for my more complex 
recovery path, I would still vote that the simpler algorithm should be the 
default:  Either Robinson, or Robinson-Fischer with there being no middle 
ground.

>  (The only way I know to fool it is to send a message like "look at this 
> interesting spam message" and include the spam in-line.)  Bogofilter's 
> "unsure" classification is a signal to the human that the message needs 
> human judgement.  This is exactly the kind of tag that is wanted by a 
> "train on errors" person.
>
>To summarize, bogofilter's ability to classify as ham, spam, or unsure is 
>an enhancement that won't have a negative impact on how you deal with 
>spam.  If all you want to know is spam or not, simple filter on 
>"X-Bogosity: Yes".  The filter won't care if the final word is "No" or 
>"Unsure" because all it's interested in is "Yes".  On the other hand, if 
>you don't want to check every message for classification accuracy, 
>bogofilter lets you know which messages it's unsure about.  You can filter 
>on this and handle those messages and ignore the ham messages.
>
>Lastly, for those who really, really don't ever want to see a message 
>classified as "unsure", you can set the value of ham_cutoff to 0 (and 
>bogofilter will only say "Yes" or "No") or you can use the Robinson 
>algorithm (via the "-r" command line switch or "algorithm=robinson" in 
>your config file).

And if you can't tell me what I am gaining by having mail delivered that is 
being classed as unsure, then I would suggest it is a meaningless 
complication, one that should be reserved for people who want to set 
it.  As you point out, you still deliver R-F, you just set the default 
discriminators the same.

>Bogofilter will continue to support the older algorithms.  This 
>conversation is about improvement of spam classification by changing the 
>default algorithm, not about discontinuing the older algorithms.

Spam classification is meaningless.  You are going to deliver, or you are 
not going to deliver.  If bogofilter adds a "tempfail" return code and a 
way to manage its own queues, then I will agree that this third state might 
have a meaning.  And I would want to support anyone else who wanted 
this.  But if the mail is in my inbox, and it is spam then I want to do 
something simple to get the words into the right list so that subsequent 
filtering is improved.

>I hope that I've helped shed light on this subject and that I haven't 
>bored you all to death.

On everything except the value of the "unknown" classification.

With Graham, I can look at the number and get a very quick feel for how 
close it is.  Sometimes I am interested, but I really do not care beyond 
that.  I have actually considered moving the cutoff a couple of points, but 
now I *think* pending advice, that I want to switch to Robinson first, or 
instead.  But I still do not see what "unsure" does.  I do not have a 
Schrodenger's Cat delivery agent, where the mail can both be delivered and 
not delivered, waiting on someone to observe the delivery state. (OK, that 
is what happens when the UPS guy comes by but does not ring the bell, I 
guess--but it does not apply to e-mail).

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc.  But if it is not all three of Unsolicited,
Bulk, and E-mail, it simply is not spam. Misusing the term plays into the
hands of the spammers, since it causes confusion, and spammers thrive on
confusion.  If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!