robinson-fisher - two states vs three states
David Relson
relson at osagesoftware.com
Tue Jan 21 14:08:04 CET 2003
Good morning,
I've just read all of the late night commentary on the
desirability/undesirability of the "unsure" classification.
I think most of you know about the "spam_cutoff" parameter used by
bogofilter. For those who don't, it's a number in the range of 0.0 to
1.0. After computing a message's spamicity score, bogofilter compares the
score to the value of spam_cutoff. If score is greater than or equal to
spam_cutoff, the message is spam and the "X-Bogosity" line will say
so. Your filters can look at that line in the header and take the action
you want.
The Robinson-Fisher algorithm has some additional capability. It can be
configured to check a second parameter, named "ham_cutoff" and divide the
remaining messages, i.e. those classified as non-spam using spam_cutoff,
into two groups. If the score is less than ham_cutoff, the program is
certain that the message is not-spam. For messages with scores between
these two cutoffs there is insufficient information for bogofilter to be
sure whether it is ham or spam. These messages get the "unsure" label.
What does this mean for people using the Robinson-Fisher algorithm?
First, they can filter on "X-Bogosity: Yes". If matched, bogofilter has
classified the message as spam. Period. For those wanting a binary
classification, this is all that they need to check. It's no different
than before.
Second, there are the people that autotrain using the '-u' (update)
flag. If bogofilter isn't sure whether the message is ham or spam, it
won't automatically add it to a wordlist.
Third, there are the people who "train on errors". They check whether
bogofilter has correctly classified each message and, for those messages
where the person and the program disagree, let the program know about the
discrepancy. Using a tristate configuration, I've found that the
Robinson-Fisher ham and spam classifications are very, very accurate. (The
only way I know to fool it is to send a message like "look at this
interesting spam message" and include the spam in-line.) Bogofilter's
"unsure" classification is a signal to the human that the message needs
human judgement. This is exactly the kind of tag that is wanted by a
"train on errors" person.
To summarize, bogofilter's ability to classify as ham, spam, or unsure is
an enhancement that won't have a negative impact on how you deal with
spam. If all you want to know is spam or not, simple filter on
"X-Bogosity: Yes". The filter won't care if the final word is "No" or
"Unsure" because all it's interested in is "Yes". On the other hand, if
you don't want to check every message for classification accuracy,
bogofilter lets you know which messages it's unsure about. You can filter
on this and handle those messages and ignore the ham messages.
Lastly, for those who really, really don't ever want to see a message
classified as "unsure", you can set the value of ham_cutoff to 0 (and
bogofilter will only say "Yes" or "No") or you can use the Robinson
algorithm (via the "-r" command line switch or "algorithm=robinson" in your
config file).
Bogofilter will continue to support the older algorithms. This
conversation is about improvement of spam classification by changing the
default algorithm, not about discontinuing the older algorithms.
I hope that I've helped shed light on this subject and that I haven't bored
you all to death.
David
That d
More information about the Bogofilter
mailing list