Getting rid of plain obvious spam

Andreas Pardeike andreas at pardeike.net
Wed Apr 7 16:36:41 CEST 2004


On 2004-04-07, at 13.36, David Relson wrote:

> It looks like you're using a combination of bogofilter 0.17.5's new
> scoring parameters and your own values.  "bogofilter -C -Q" shows the
> following parameters (without config file):
>
> robx        = 0.520000  # (5.20e-01)
> robs        = 0.017800  # (1.78e-02)
> min_dev     = 0.375000  # (3.75e-01)
> ham_cutoff  = 0.000000  # (0.00e+00)
> spam_cutoff = 0.990000  # (9.90e-01)
>
> You have the same values, except for robx.

You're right. Four days ago, I found out that our old wordlist.db had
only 1700 ham but 52000 spam messages in it. I am not aware of any
mistakes I made in the past but who knows. So I thought it would be
good to start over and I collected about 1000 messages that were already
sorted in ham/spam and generated a initial wordlist with that corpus.

Then, I think, I played around with bogoutil and I accidentally used it
so it updated the values (hence the different robx).

> Suggestion 1:  Use only the default parameters, keep on training, and
> bogofilter will do well (once there's been enough training).
>
> Suggestion 2:  Continue to use your old parameters.  I'm betting that
> they were working well for you.  If that is so, there's no need to
> change. If it ain't broken, don't fix it!

It's difficult for me to decide between 1 + 2 without more info. 
Following
1) a month ago got me to the problem that I originally tried to fix.

> Suggestion 3:  Lower the spam_cutoff to 0.98 or 0.97.  Lower values
> increase the likelihood of a false positive.  You'll have to decide for
> yourself which is more important -- having the message be identified as
> spam or increasing the chances of a false positive.

I see 3) as an additional parameter that I can tweak together with a
choice of 1) or 2).


BTW: quick poll - what are you people using to verify the work of
bogofilter?

a) undetected_spam / total_spam   ratio (in %)

or

b) undetected_spam / total_ham    ratio (in %)

(You can see my figures at http://www.pardeike.net/junk.html?ratio)

Andreas Pardeike





More information about the Bogofilter mailing list