What is a spamicity of exactly 0.5?

Sun Jan 25 15:20:27 CET 2004

"Jason A. Smith" <jazbo at jazbo.dyndns.org> wrote:

>All of the spam that I get with a lot of random words appended to the
>end get a spamicity score of exactly 0.5.  Why is this happening and
>what does that score mean?  

It is temting to give a simple, easy to understand wrong
answer (Grossman's Misquote;-). So let me first suggest to
read the man page where the theory section describes what
the spamicity really is.

In some more detail, it depends on several factors. One are
the parameters used which influence values for individual
tokens and if those are used in the calculation. Then, of
course, it is important how many tokens in the message are
spammish and hammish and how strongly. Then we get two
values, one indicating how strongly a message looks like
spam and the other how strongly it looks like spam. Those
are then combined.

Getting exactly .5 means the last two numbers are equal.
Given the many things used in the calculation this is
nothing with a too deep meaning.

For example, I use .5 as spam_cutoff, i.e., messages with
exactly .5 are marked as spam. Works great for me, but of
course that depends on the other parameters (like robx=.499)
and my training method.

>I don't understand why they don't get scored as spam

Look at bogofilter -vvv.

> since most are advertising the exact same website and come from
>the same source.  Shouldn't those few known spam tokens outweigh the
>random words?

Yes, but you will find hammish words.

>Is there anything that I can do to improve bogofilter's
>detection of spam like that with random words?

I doubt random words are a problem. It might be your
settings or training, though. The FAQ offers some advice on
this.

pi