help with usage
David Relson
relson at osagesoftware.com
Thu Mar 6 22:16:48 CET 2003
At 03:46 PM 3/6/03, Terry Todd wrote:
>I got it working but now I need some help with usage.
>I read the man page and it seems I need to add a -o option.
>
>Here's the resultant line from a spam header that I just received:
> > X-Bogosity: No, tests=bogofilter, spamicity=0.795945, version=0.11.1
>
>Could you give some examples of usage in the man page for how to
>set the values? What are good values to try to start with? Could
>you show an example of setting a spam and ham value?
>
>Thanks,
>Terry Todd
Hi Terry,
Bogofilter's algorithm, known as Robinson-Fisher, has the interesting
characteristic that knows when there's clear enough information to make a
good classification and when there's insufficient information to make a
clear call whether a message is ham or spam. Some people prefer to run
bogofilter in binary (ham/spam) mode and some prefer that bogofilter tell
them when it's unsure.
These two modes are known as "two-state" and "three-state". In two-state
mode, a message is either classified as spam or name (and given a "Yes" or
a "No" on the X-Bogosity line). In three-state mode, bogofilter will
classify the message as spam, ham, or unsure.
With the Robinson-Fisher algorithm, when the evidence for spamness is
clear, the spamicity score will be very close to 1.0 (often 0.99 or
above). When the evidence for hamness is clear, the score will be close to
0.0 (often 0.10 or less). This leaves a large middle range where
bogofilter isn't sure. The size of this middle range isn't a problem since
relatively few messages get scores in that area - once bogofilter is trained.
Two cutoff values named spam_cutoff and ham_cutoff determine the mode and
the resulting behavior. If only spam_cutoff is non-zero (which is the
default in the distribution), messages whose spamicity equal or exceed
spam_cutoff are labeled spam and all others are labeled ham.
If both spam_cutoff and ham_cutoff are set, messages whose spamicity equal
or exceed spam_cutoff are labeled spam and messages whose spamicity is less
than ham_cutoff are labeled ham. The remaining messages are labeled unsure.
As distributed, bogofilter has a ham_cutoff value of 0.0 and a spam_cutoff
of 0.95. This gives a two-state result which is what the majority of the
users wanted (when they were polled). I personally use values of 0.10 and
0.95 and see approx 97% of my incoming mail classified _correctly_ as
either spam or ham. The other 3% can have spam with scores of 0.25 and ham
with scores of 0.85. I manually classify those messages and use them to
further train bogofilter. I've been using these values long enough that
I'm quite confident that bogofilter is correct when it labels a message as
ham or spam. In fact, my MUA (Eudora) messes up on its filtering rules
more often than bogofilter gives either false positives or false negatives.
In conversations with other bogofilter users, it has become apparent that
different numbers work better for different people. If you chose to
experiment, you will sooner or later find a value(s) that gives results you
are happy with.
Here are some examples of '-o' usage:
"bogofilter -o 0.90" to set spam_cutoff to 0.90
"bogofilter -o 0.80,0.20" to set spam_cutoff to 0.80 and ham_cutoff to 0.20
"bogofilter -o 0.95" to set the default value of spam_cutoff (for two-state)
"bogofilter -o 0.95,0.10" to set the values I use for spam_cutoff and
ham_cutoff (three-state)
Hope this helps :-)
David
More information about the Bogofilter
mailing list