Maintaining a snappy bogofilter

Thu Apr 10 20:52:15 CEST 2003

At 02:06 PM 4/10/03, Peter Bishop wrote:

>On 10 Apr 2003, at 13:01, David Relson wrote:
>
> > I run bogofilter with the "-u" (update) option so that messages classified
> > as spam are added to spamlist.db and messages that are classified as ham
> > are added to goodlist.db.  I also use Fisher in tri-state mode so
> > bogofilter will classify a message as Unsure (when it can't say ham/spam
> > with reasonable certainty).  As "unsure" messages don't automatically feed
> > into either data base, I manually classify them and have a cron job update
> > the appropriate wordlist.  Using "-u" causes the token's timestamp to be
> > updated to "today", which makes checking for "no recent hits" feasible.
>
>So "unsure" spams would need to be registered manually to keep "hits" up
>to date.

Correct.

>This might not work too well in my case as I have set up a shared wordlist
>and I have a "honeytrap" account (a non-existent user) for attracting spam.
>The honeytrap emails are is fed directly into the shared spam wordlist
>(bogofilter -s). The shared wordlist is used to filter all email accounts, 
>but I
>cannot tell whether a hapax in the wordlist is getting lots of hits by the
>users and I don't want to use -u to allow all users to update the list. as it
>requires global access to the wordlist (and what about locking?). And
>anyway I'm using Robinson with no "unsure" state because it seems to
>work better the Fisher.

The honeytrap is a great idea.  Does it get all the same spam as other 
accounts?  How do you train for non-spam?

Let's use the more formal names for the two algorithms:  Robinson-GM 
(geometric mean) and Robinson-Fisher.  The Robinson-Fisher algorithm first 
executes the Robinson-GM algorithm and then does a chi-square test to 
measure the "certainty" of the result.  For example, a high score based on 
a large number of tokens will have a higher measure of certainty than a 
medium high score or a score based on a small number of tokens.  The 
Robinson-Fisher algorithm was suggested by Gary Robinson himself as an 
improvement to the basic GM algorithm.  Experiments indicate that R-F 
_does_ produce higher quality results, which is why it's the default 
algorithm in bogofilter.

The Robinson-Fisher algorithm has classification modes of Yes/No or as 
Yes/No/Unsure depending on switches and configuration.  The two-state mode 
just uses the value of spam_cutoff to divide the email into two 
classifications.

In addition to spam_cutoff, bogofilter has a ham_cutoff 
parameter.  Messages with spam scores below ham_cutoff are classified as 
"No", i.e. good, i.e. non-spam.  If the spam score is above spam_cutoff, 
the classification is "Yes", a.k.a. spam.  If the score is between 
ham_cutoff and spam_cutoff, the message is classified as Unsure.  Command 
line switches "-2" and "-3" can also be used to force the two-state and 
three-state modes.

>On the other hand I have such a wierd setup it might not be worth
>spending much time thinking about it as other users are unlikely to benefit
>
> > P.S.  If you haven't tried subscribing, you should do so.  If you tried 
> and
> > had trouble, send me the details so we can get the problem fixed.
>
>I did subscribe - thats how I got on the list (see info below). I have also
>tried unsubscribing (to clean things up) but it seems to be ignored.
>No idea why.

I'll forward the subscription info to the list admin.

David