Training question

David Relson relson at osagesoftware.com
Tue Nov 23 01:02:59 CET 2004


On Mon, 22 Nov 2004 13:34:50 -0500
Sean Brown wrote:

> I'd like some input on training.  Thusfar I am not getting the results
> I'd like to see from bogofilter (i.e., more SPAM is getting through
> than I expected), so I thought I'd see if the way I am doing training
> is reasonable.
> 
> Background:  qmail and Courier IMAP on RedHat 9.
> 
> I have created a "spam" folder under my inbox.  When spam makes it
> past bogofilter without being caught, I move it to that folder.  Then,
> on a nightly basis I run this cron job:
> 
> [all on one line]
> /usr/local/bin/bogofilter -vsB 
> /var/qmail/mailnames/sean-brown.com/sean/Maildir/.spam/
> 
> Does this sound reasonable to all of you?  If not, what can I do 
> differently?
> 
> Sean

Hello Sean,

You don't mention which version of bogofilter you have, nor how you have
it configured.  By the way, if you haven't read the GETTING.STARTED
document that ships with 0.93.1 I recommend you do so.  If need be, I
can email a copy to you.

Prior to 0.93.0, bogofilter used a conservative value of 0.99 as the
default value for spam_cutoff, which means messages scoring 0.99 or
above are considered spam and all other messages are considered ham. The
value of spam_cutoff _can_ be lowered, but that increases the likelihood
of false positives (ham messages scored as spam) and we don't want to do
that.  In your case, you may want to lower spam_cutoff's value to lessen
the amount of spam classified as ham. However, don't change the value
until you've read further.

Bogofilter has long been able to operate in a tri-state mode in which
messages are classified as Spam, Ham, or Unsure.  Effective with
release 0.93.0, the default configuration sets spam_cutoff=0.99 (as
before) and ham_cutoff=0.10.  Messages scoring between 0.10 and 0.99 are
classified as Unsure.

I'd recommend that you modify your bogofilter.cf file to enable
tri-state mode, i.e. set ham_cutoff=0.10, and that you use the '-p'
(passthrough) option when running bogofilter (to get the X-Bogosity line
added to the message header).  Also, have your mail program check for
"X-Bogosity: Unsure" in order to identify and separate those
messages.  Once you have that, you'll have a group of messages that
bogofilter couldn't readily classify, i.e. messages that _should_ be
used in training so bogofilter can expand its vocabulary.

So, separate the Unsures into the two groups "Unsure, is actually Ham"
and "Unsure, is actually Spam".  Use all those messages for training
(not just the spam).  Also keep all those messages for later. 

In a week or so, look at the X-Bogosity lines of the two groups of
messages and find the highest scoring ham and lowest scoring spam.  You
can use those values to increase the ham_cutoff value and decrease the
spam_cutoff value.  The result of doing this will be more messages
scored as Ham and Spam and fewer scored as Unsure. 

When you have a lot of messages, i.e. 2000 each ham and spam in your
wordlist and another 2000 (or more) each of ham and spam, then you'll be
able to run bogotune and have the computer find optimal scoring
parameters for your machine.  Bogotune is an advanced capability that
you needn't worry about now, but might want to keep in mind for the
future.

HTH,

David



More information about the Bogofilter mailing list