Maintaining a snappy bogofilter
David Relson
relson at osagesoftware.com
Thu Apr 10 20:52:15 CEST 2003
At 02:06 PM 4/10/03, Peter Bishop wrote:
>On 10 Apr 2003, at 13:01, David Relson wrote:
>
> > I run bogofilter with the "-u" (update) option so that messages classified
> > as spam are added to spamlist.db and messages that are classified as ham
> > are added to goodlist.db. I also use Fisher in tri-state mode so
> > bogofilter will classify a message as Unsure (when it can't say ham/spam
> > with reasonable certainty). As "unsure" messages don't automatically feed
> > into either data base, I manually classify them and have a cron job update
> > the appropriate wordlist. Using "-u" causes the token's timestamp to be
> > updated to "today", which makes checking for "no recent hits" feasible.
>
>So "unsure" spams would need to be registered manually to keep "hits" up
>to date.
Correct.
>This might not work too well in my case as I have set up a shared wordlist
>and I have a "honeytrap" account (a non-existent user) for attracting spam.
>The honeytrap emails are is fed directly into the shared spam wordlist
>(bogofilter -s). The shared wordlist is used to filter all email accounts,
>but I
>cannot tell whether a hapax in the wordlist is getting lots of hits by the
>users and I don't want to use -u to allow all users to update the list. as it
>requires global access to the wordlist (and what about locking?). And
>anyway I'm using Robinson with no "unsure" state because it seems to
>work better the Fisher.
The honeytrap is a great idea. Does it get all the same spam as other
accounts? How do you train for non-spam?
Let's use the more formal names for the two algorithms: Robinson-GM
(geometric mean) and Robinson-Fisher. The Robinson-Fisher algorithm first
executes the Robinson-GM algorithm and then does a chi-square test to
measure the "certainty" of the result. For example, a high score based on
a large number of tokens will have a higher measure of certainty than a
medium high score or a score based on a small number of tokens. The
Robinson-Fisher algorithm was suggested by Gary Robinson himself as an
improvement to the basic GM algorithm. Experiments indicate that R-F
_does_ produce higher quality results, which is why it's the default
algorithm in bogofilter.
The Robinson-Fisher algorithm has classification modes of Yes/No or as
Yes/No/Unsure depending on switches and configuration. The two-state mode
just uses the value of spam_cutoff to divide the email into two
classifications.
In addition to spam_cutoff, bogofilter has a ham_cutoff
parameter. Messages with spam scores below ham_cutoff are classified as
"No", i.e. good, i.e. non-spam. If the spam score is above spam_cutoff,
the classification is "Yes", a.k.a. spam. If the score is between
ham_cutoff and spam_cutoff, the message is classified as Unsure. Command
line switches "-2" and "-3" can also be used to force the two-state and
three-state modes.
>On the other hand I have such a wierd setup it might not be worth
>spending much time thinking about it as other users are unlikely to benefit
>
> > P.S. If you haven't tried subscribing, you should do so. If you tried
> and
> > had trouble, send me the details so we can get the problem fixed.
>
>I did subscribe - thats how I got on the list (see info below). I have also
>tried unsubscribing (to clean things up) but it seems to be ignored.
>No idea why.
I'll forward the subscription info to the list admin.
David
More information about the Bogofilter
mailing list