The 2 U's - Unsure and Update [was: wordlist.db problem]
David Relson
relson at osagesoftware.com
Fri Jun 18 15:25:52 CEST 2004
On Thu, 17 Jun 2004 21:45:37 -0700
OTR Comm wrote:
...[snip]...
> I don't want to automatically update the database anyway, so this
> problem made me dig a little deeper into how bogofilter works. Still
> a long way to go, but this has been helpful. I may be wrong about not
> using the -u switch, but I don't see that it buys me much. Or does
> it?
...[snip]...
> One other question though, how does bogofilter ever come up with an
> 'Unsure' classification? It always classifies mine as either Yes or
> No? I thought that it would have some bound around .5 probabliity
> that would trigger an 'Unsure' classification. Is this somewhere in
> bogofilter.cf.example that I missed?
Bogofilter's default configuration will classify a message as spam or
non-spam. The SPAM_CUTOFF parameter is used for this. Messages with
scores greater than or equal to SPAM_CUTOFF are classified as spam.
Other messages are classified as ham.
There is also a HAM_CUTOFF parameter. When used, messages must have
scores less than or equal to HAM_CUTOFF to be classified as ham.
Messages with scores between HAM_CUTOFF and SPAM_CUTOFF are classified
as unsure. If you look in /etc/bogofilter.cf, you will see the
following lines:
#### CUTOFF Values
#
# both ham_cutoff and spam_cutoff are allowed.
# setting ham_cutoff to a non-zero value will
# enable tristate results (Yes/No/Unsure).
#
#ham_cutoff = 0.00
#spam_cutoff = 0.99
#
## with Yes/No/Unsure output:
## ham_cutoff = 0.45
## spam_cutoff= 0.99
To turn on Yes/No/Unsure classification, remove the #'s from the last
two lines.
Once that's done, you may want to set the filtering rules for your mail
program to include rules like:
if header contains "X-Bogosity: Yes", put in Spam folder
if header contains "X-Bogosity: Unsure", put in Unsure folder
Alternatively, /etc/bogofilter.cf has directives for modifying the
Subject: line, i.e.
#### SPAM_SUBJECT_TAG
#
# tag added to "Subject: " line for identifying spam or unsure
# default is to add nothing.
#
##spam_subject_tag=***SPAM***
##unsure_subject_tag=???UNSURE???
The "-u" switch (autoupdate) is used to automatically expand the
wordlist. When this switch is used and bogofilter classifies a message
as Spam or Ham, the message's tokens are added to the wordlist with a
ham/spam tag (as appropriate).
As an example, suppose a new "Refinance now - best Mortgage rates"
message comes in. It will have some words that bogofilter has seen and
(probably) some new ones as well. Using '-u' the new words will be
added to the wordlist so that bogofilter can better recognize the next,
related message.
If/when you use to use '-u', you need to be on the lookout for
classification errors and retrain bogofilter with any messages that have
been classified incorrectly. An incorrectly classified message that is
auto-updated _may_ cause bogofilter to make additional classification
errors in the future. This is the same problem as when you (the sys
admin) incorrectly register a ham message as spam (or vice versa).
HTH,
David
More information about the Bogofilter
mailing list