help with usage

Thu Mar 6 22:16:48 CET 2003

At 03:46 PM 3/6/03, Terry Todd wrote:

>I got it working but now I need some help with usage.
>I read the man page and it seems I need to add a -o option.
>
>Here's the resultant line from a spam header that I just received:
> > X-Bogosity: No, tests=bogofilter, spamicity=0.795945, version=0.11.1
>
>Could you give some examples of usage in the man page for how to
>set the values?  What are good values to try to start with?  Could
>you show an example of setting a spam and ham value?
>
>Thanks,
>Terry Todd

Hi Terry,

Bogofilter's algorithm, known as Robinson-Fisher, has the interesting 
characteristic that knows when there's clear enough information to make a 
good classification and when there's insufficient information to make a 
clear call whether a message is ham or spam.  Some people prefer to run 
bogofilter in binary (ham/spam) mode and some prefer that bogofilter tell 
them when it's unsure.

These two modes are known as "two-state" and "three-state".  In two-state 
mode, a message is either classified as spam or name (and given a "Yes" or 
a "No" on the X-Bogosity line).  In three-state mode, bogofilter will 
classify the message as spam, ham, or unsure.

With the Robinson-Fisher algorithm, when the evidence for spamness is 
clear, the spamicity score will be very close to 1.0 (often 0.99 or 
above).  When the evidence for hamness is clear, the score will be close to 
0.0 (often 0.10 or less).  This leaves a large middle range where 
bogofilter isn't sure.  The size of this middle range isn't a problem since 
relatively few messages get scores in that area - once bogofilter is trained.

Two cutoff values named spam_cutoff and ham_cutoff determine the mode and 
the resulting behavior.  If only spam_cutoff is non-zero (which is the 
default in the distribution), messages whose spamicity equal or exceed 
spam_cutoff are labeled spam and all others are labeled ham.

If both spam_cutoff and ham_cutoff are set, messages whose spamicity equal 
or exceed spam_cutoff are labeled spam and messages whose spamicity is less 
than ham_cutoff are labeled ham.  The remaining messages are labeled unsure.

As distributed, bogofilter has a ham_cutoff value of 0.0 and a spam_cutoff 
of 0.95.  This gives a two-state result which is what the majority of the 
users wanted (when they were polled).  I personally use values of 0.10 and 
0.95 and see approx 97% of my incoming mail classified _correctly_ as 
either spam or ham.  The other 3% can have spam with scores of 0.25 and ham 
with scores of 0.85.  I manually classify those messages and use them to 
further train bogofilter.  I've been using these values long enough that 
I'm quite confident that bogofilter is correct when it labels a message as 
ham or spam.  In fact, my MUA (Eudora) messes up on its filtering rules 
more often than bogofilter gives either false positives or false negatives.

In conversations with other bogofilter users, it has become apparent that 
different numbers work better for different  people.  If you chose to 
experiment, you will sooner or later find a value(s) that gives results you 
are happy with.

Here are some examples of '-o' usage:

"bogofilter -o 0.90" to set spam_cutoff to 0.90
"bogofilter -o 0.80,0.20" to set spam_cutoff to 0.80 and ham_cutoff to 0.20

"bogofilter -o 0.95" to set the default value of spam_cutoff (for two-state)
"bogofilter -o 0.95,0.10" to set the values I use for spam_cutoff and 
ham_cutoff (three-state)

Hope this helps :-)

David