robinson-fisher - two states vs three states
Nick Simicich
njs at scifi.squawk.com
Tue Jan 21 21:00:49 CET 2003
At 08:08 AM 2003-01-21 -0500, David Relson wrote:
>Good morning,
>
>I've just read all of the late night commentary on the
>desirability/undesirability of the "unsure" classification.
>
>I think most of you know about the "spam_cutoff" parameter used by
>bogofilter. For those who don't, it's a number in the range of 0.0 to
>1.0. After computing a message's spamicity score, bogofilter compares the
>score to the value of spam_cutoff. If score is greater than or equal to
>spam_cutoff, the message is spam and the "X-Bogosity" line will say
>so. Your filters can look at that line in the header and take the action
>you want.
>
>The Robinson-Fisher algorithm has some additional capability. It can be
>configured to check a second parameter, named "ham_cutoff" and divide the
>remaining messages, i.e. those classified as non-spam using spam_cutoff,
>into two groups. If the score is less than ham_cutoff, the program is
>certain that the message is not-spam. For messages with scores between
>these two cutoffs there is insufficient information for bogofilter to be
>sure whether it is ham or spam. These messages get the "unsure" label.
>
>What does this mean for people using the Robinson-Fisher algorithm?
>
>First, they can filter on "X-Bogosity: Yes". If matched, bogofilter has
>classified the message as spam. Period. For those wanting a binary
>classification, this is all that they need to check. It's no different
>than before.
OK. So you suggest delivering all unsure messages. Fine.
>Second, there are the people that autotrain using the '-u' (update)
>flag. If bogofilter isn't sure whether the message is ham or spam, it
>won't automatically add it to a wordlist.
Which means that now when I do determine whether it is ham or spam, I have
the added complexity of having to add some delivered messages to the
wordlists if I still want to train on all messages. This is likely to have
to be done synchronously.
>Third, there are the people who "train on errors". They check whether
>bogofilter has correctly classified each message and, for those messages
>where the person and the program disagree, let the program know about the
>discrepancy.
Which now means that this is at least twice as complex as before. Before,
if a message was delivered, and it was incorrectly classified, you needed
to reclassify the words in it with, say, the S classification to move the
words. Now you have not only the possibility of needing to -S the words in
a misdelivered spam, you might also want to -s the message. It depends,
completely, on whether the message was previously classified as spam or
unknown. So you have doubled the retraining complexity.
Further, you might have the possibility that a message was classified as
spam and it is actually ham. Before, that was always a -N. Now it might
be a -u. Again, twice as complex.
> Using a tristate configuration, I've found that the Robinson-Fisher ham
> and spam classifications are very, very accurate.
That is not the question. The question is, "Is the delivery any more
accurate?" Every misdelivered message is (1) an irritation (2) potential
lost data (3) something that requires manual action.
In other words, I really do not care if a message is 100% likely to
actually be spam when declared spam, and 100% likely to be non-spam when
declared nonspam. If 25% of the messages are declared "unknown" and I have
made the decision to deliver them, and half of them are spam, then I have a
12.5% failure rate.
There really are only two choices: Deliver or don't. Any failure to get
that choice right is what matters, no matter how I have labeled the thing I
have decided to deliver. And putting multiple labels on something I am
going to deliver anyway just makes the recovery path more complex when it
turns out to be spam, because that is the recovery we are talking about
here: You delivered spam and now you have to reclassify it as spam that
should not be delivered.
So, if you can't tell me what I get in exchange for my more complex
recovery path, I would still vote that the simpler algorithm should be the
default: Either Robinson, or Robinson-Fischer with there being no middle
ground.
> (The only way I know to fool it is to send a message like "look at this
> interesting spam message" and include the spam in-line.) Bogofilter's
> "unsure" classification is a signal to the human that the message needs
> human judgement. This is exactly the kind of tag that is wanted by a
> "train on errors" person.
>
>To summarize, bogofilter's ability to classify as ham, spam, or unsure is
>an enhancement that won't have a negative impact on how you deal with
>spam. If all you want to know is spam or not, simple filter on
>"X-Bogosity: Yes". The filter won't care if the final word is "No" or
>"Unsure" because all it's interested in is "Yes". On the other hand, if
>you don't want to check every message for classification accuracy,
>bogofilter lets you know which messages it's unsure about. You can filter
>on this and handle those messages and ignore the ham messages.
>
>Lastly, for those who really, really don't ever want to see a message
>classified as "unsure", you can set the value of ham_cutoff to 0 (and
>bogofilter will only say "Yes" or "No") or you can use the Robinson
>algorithm (via the "-r" command line switch or "algorithm=robinson" in
>your config file).
And if you can't tell me what I am gaining by having mail delivered that is
being classed as unsure, then I would suggest it is a meaningless
complication, one that should be reserved for people who want to set
it. As you point out, you still deliver R-F, you just set the default
discriminators the same.
>Bogofilter will continue to support the older algorithms. This
>conversation is about improvement of spam classification by changing the
>default algorithm, not about discontinuing the older algorithms.
Spam classification is meaningless. You are going to deliver, or you are
not going to deliver. If bogofilter adds a "tempfail" return code and a
way to manage its own queues, then I will agree that this third state might
have a meaning. And I would want to support anyone else who wanted
this. But if the mail is in my inbox, and it is spam then I want to do
something simple to get the words into the right list so that subsequent
filtering is improved.
>I hope that I've helped shed light on this subject and that I haven't
>bored you all to death.
On everything except the value of the "unknown" classification.
With Graham, I can look at the number and get a very quick feel for how
close it is. Sometimes I am interested, but I really do not care beyond
that. I have actually considered moving the cutoff a couple of points, but
now I *think* pending advice, that I want to switch to Robinson first, or
instead. But I still do not see what "unsure" does. I do not have a
Schrodenger's Cat delivery agent, where the mail can both be delivered and
not delivered, waiting on someone to observe the delivery state. (OK, that
is what happens when the UPS guy comes by but does not ring the bell, I
guess--but it does not apply to e-mail).
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of Unsolicited,
Bulk, and E-mail, it simply is not spam. Misusing the term plays into the
hands of the spammers, since it causes confusion, and spammers thrive on
confusion. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list