The 2 U's - Unsure and Update [was: wordlist.db problem]

David Relson relson at osagesoftware.com
Fri Jun 18 15:25:52 CEST 2004


On Thu, 17 Jun 2004 21:45:37 -0700
OTR Comm wrote:

...[snip]...

> I don't want to automatically update the database anyway, so this
> problem made me dig a little deeper into how bogofilter works.  Still
> a long way to go, but this has been helpful.  I may be wrong about not
> using the -u switch, but I don't see that it buys me much.  Or does
> it?

...[snip]...

> One other question though, how does bogofilter ever come up with an
> 'Unsure' classification?  It always classifies mine as either Yes or
> No?  I thought that it would have some bound around .5 probabliity
> that would trigger an 'Unsure' classification.  Is this somewhere in
> bogofilter.cf.example that I missed?

Bogofilter's default configuration will classify a message as spam or
non-spam.  The SPAM_CUTOFF parameter is used for this.  Messages with
scores greater than or equal to SPAM_CUTOFF are classified as spam. 
Other messages are classified as ham.

There is also a HAM_CUTOFF parameter.  When used, messages must have
scores less than or equal to HAM_CUTOFF to be classified as ham.
Messages with scores between HAM_CUTOFF and SPAM_CUTOFF are classified
as unsure.  If you look in /etc/bogofilter.cf, you will see the
following lines:

  #### CUTOFF Values
  #
  #	both ham_cutoff and spam_cutoff are allowed.
  #	setting ham_cutoff to a non-zero value will
  #	enable tristate results (Yes/No/Unsure).
  #
  #ham_cutoff  = 0.00
  #spam_cutoff = 0.99
  #
  ## with Yes/No/Unsure output:
  ## ham_cutoff = 0.45
  ## spam_cutoff= 0.99

To turn on Yes/No/Unsure classification, remove the #'s from the last
two lines.

Once that's done, you may want to set the filtering rules for your mail
program to include rules like:

  if header contains "X-Bogosity: Yes", put in Spam folder
  if header contains "X-Bogosity: Unsure", put in Unsure folder

Alternatively, /etc/bogofilter.cf has directives for modifying the
Subject: line, i.e.

  #### SPAM_SUBJECT_TAG
  #
  #	tag added to "Subject: " line for identifying spam or unsure
  #	default is to add nothing.
  #
  ##spam_subject_tag=***SPAM***
  ##unsure_subject_tag=???UNSURE???

The "-u" switch (autoupdate) is used to automatically expand the
wordlist.  When this switch is used and bogofilter classifies a message
as Spam or Ham, the message's tokens are added to the wordlist with a
ham/spam tag (as appropriate).

As an example, suppose a new "Refinance now - best Mortgage rates"
message comes in.  It will have some words that bogofilter has seen and
(probably) some new ones as well.  Using '-u' the new words will be
added to the wordlist so that bogofilter can better recognize the next,
related message.

If/when you use to use '-u', you need to be on the lookout for
classification errors and retrain bogofilter with any messages that have
been classified incorrectly.  An incorrectly classified message that is
auto-updated _may_ cause bogofilter to make additional classification
errors in the future.   This is the same problem as when you (the sys
admin) incorrectly register a ham message as spam (or vice versa).

HTH,

David



More information about the Bogofilter mailing list