What is a spamicity of exactly 0.5?

Sun Jan 25 15:33:57 CET 2004

On Sun, 25 Jan 2004 08:50:21 -0500
Jason A. Smith wrote:

> All of the spam that I get with a lot of random words appended to the
> end get a spamicity score of exactly 0.5.  Why is this happening and
> what does that score mean?  I don't understand why they don't get
> scored as spam since most are advertising the exact same website and
> come from the same source.  Shouldn't those few known spam tokens
> outweigh the random words?  Is there anything that I can do to improve
> bogofilter's detection of spam like that with random words?
> 
> ~Jason

Welcom Jason,

Good questions!

The Robinson-Fisher algorithm has, as its last step, a chi-square test
which computes a certainty level based on the computed score and the
number of tokens.  When there are a lot of spam tokens _and_ a lot of
ham tokens in the message, the result is often 0.500000.  Such a score
means that the computation can't say (with any level of certainty)
whether the message is ham or spam.

If you want to see more about how a particular message is scored, run
bogofilter with "-vv" to generate a histogram or with "-vvv" to generate
a list of all the tokens and their individual scores.

You ask about improving bogofilter's detection of spam with random
words.  If you have an archive with several thousand ham and spam
messages, you can run bogotune to compute a set of parameters customized
for _your_ environment and for _your_ mix of ham and spam.

Hope this helps!

David