Getting rid of plain obvious spam

Wed Apr 7 12:04:52 CEST 2004

Andreas Pardeike wrote:

> I am using bogofilter with success but somehow I can't get it
> to recognise plain spam even if it should be rather simple to
> detect. What's so special about the message that is shown in
> the attachment (reports includes token and histogram output too)
> and what can I do to get better results on obvious buzzwords?
> 
> Simple training seems not to be enough (I run all incoming through
> bogofilter -u and correct with -sN and -Sn accordingly).

It looks like you are in good shape. Your parameters are too
strict, probably:

> bogofilter -vvv < viagra01.txt 
> X-Bogosity: No, tests=bogofilter, spamicity=0.987342, version=0.17.5

This is a very high value. It is still not rated as spam. Why?

> bogofilter -Q
> # bogofilter version 0.17.5
> 
> robx        = 0.644661  # (6.45e-01)
> robs        = 0.017800  # (1.78e-02)
> min_dev     = 0.375000  # (3.75e-01)
> ham_cutoff  = 0.000000  # (0.00e+00)
> spam_cutoff = 0.990000  # (9.90e-01)

The answer is here. You are extremely strict what you call
spam. While this helps to make sure you don't get false
positives you will get a lot of false negatives as in your
example. You need to find out which cutoff is still safe for
you, but catches a lot of spam.

There are basically two approaches:

a) Full training. To get it work properly you have to find
good paramters. Using bogotune will do that for you. Without
it you will see what you report.

b) Training to exhaustion. This incorporates the parameters,
there is no tuning.

pi