Bogofilter seems to not be working

Wed Mar 26 06:17:08 CET 2003

Hello Jesse,

At 10:21 PM 3/25/03, Jesse Meyer wrote:

>On Tue, Mar 25, 2003 at 11:49:25AM -0800, daniel wrote:
> > I have set up bogofilter with the procmail recipies in the man page:
> >
> > :0fw
> > | bogofilter -u -e -p
>                ^^^
>Note the -u flag, I'll explain my theory in a bit.
>
> >
> > [ Snip rest of procmail configuration ]
> >
> > [ Snip description of spam filter scores approaching 0 over time ]
>
>Here's my (bogofilter-uneducated) theory.  If I recall the man page
>correctly, the -u flag seems to allow bogofilter to continue learning,
>so if it thinks a message is spam, it tries to figure out what new
>spam rules it can learn from that message.  Inversely, if it considers
>the message as ham, it tries to figure out what new non-spam rules
>it can learn from the message.

"-u" needs to be used wisely and carefully.  When this option is used, 
bogofilter classifies the new message and then adds to the tokens to the 
wordlist corresponding to the classification.  If bogofilter classifies the 
message as spam, the tokens are added to the spamlist.  If the 
classification is non-spam, the tokens are added to the goodlist.

As you say, this process is learning and extending bogofilter's 
vocabulary.  It also assumes bogofilter is doing a good job (which it can 
do).  However, bogofilter can not be all knowing, so there will be messages 
that are incorrectly classified.  When bogofilter is first being used, the 
wordlists are small, and bogofilter's accuracy is at its worst.  When using 
"-u", the sysadmin _must_ monitor what bogofilter is doing.  When 
bogofilter makes a mistakes, the sysadmin needs to notifiy bogofilter and 
have the message removed from one wordlist and added to the other.

When a good message is added (incorrectly) to the spam wordlist, bogofilter 
should be re-run with flags "-S -n" to take the words out of the spam 
wordlist and add them to the good wordlist.  When a spam message is 
incorrectly added to the good wordlist, use flags "-N -s" to correct the 
problem.

Failure to monitor and correct would indeed follow the scenario you 
describe and could be the cause of the problem initially reported.

>You recieved scores in the .40's, originally, and they then slowly
>approached .00 over time.  When I read the documentation, I believe
>it mentioned .54 as the dividing point between spam and ham.

FWIW, 0.54 is the spam_cutoff value used with the Robinson-GM 
algorithm.  Bogofilter now uses the Fisher algorithm which has a 
spam_cutoff value of 0.95.

>Now here's my theory, which hinges on the assumption that you are
>_not_ continually training bogofilter (you didn't mention it
>doing so).  I believe that bogofilter might been poorly trained
>in the beginning, so that it classified most spam as ham.  Then, as
>new messages were filtered through procmail and bogofilter, it then
>added more rules to classify the fake-ham as ham, dropping your
>scores near 0.
>
>If this is the problem, then the solution is simple - remove the
>old score files, train bogofilter properly, and continue training it
>when it recieves false positives or negatives.
>
>Of course, please note that I have been using bogofilter for
>roughly 48 hours now, so I could be way off.  :)

You may be new to bogofilter, but you've obviously been reading and 
thinking.  Your understanding is very good.

David