What is a spamicity of exactly 0.5?

Jason A. Smith jazbo at jazbo.dyndns.org
Sun Jan 25 23:42:02 CET 2004


Thanks for the explanations, both David and pi.

On Sun, 2004-01-25 at 09:33, David Relson wrote:
> On Sun, 25 Jan 2004 08:50:21 -0500
> Jason A. Smith wrote:
> 
> > All of the spam that I get with a lot of random words appended to the
> > end get a spamicity score of exactly 0.5.  Why is this happening and
> > what does that score mean?  I don't understand why they don't get
> > scored as spam since most are advertising the exact same website and
> > come from the same source.  Shouldn't those few known spam tokens
> > outweigh the random words?  Is there anything that I can do to improve
> > bogofilter's detection of spam like that with random words?
> > 
> > ~Jason
> 
> Welcom Jason,
> 
> Good questions!
> 
> The Robinson-Fisher algorithm has, as its last step, a chi-square test
> which computes a certainty level based on the computed score and the
> number of tokens.  When there are a lot of spam tokens _and_ a lot of
> ham tokens in the message, the result is often 0.500000.  Such a score
> means that the computation can't say (with any level of certainty)
> whether the message is ham or spam.
> 
> If you want to see more about how a particular message is scored, run
> bogofilter with "-vv" to generate a histogram or with "-vvv" to generate
> a list of all the tokens and their individual scores.

I have tried using these flags before, but I am not sure how to read the
output since I don't know exactly what the histogram is plotting and I'm
not sure what the columns are in -vvv.  Can the man page be updated to
explain the output of the various -v flags in more detail?

> You ask about improving bogofilter's detection of spam with random
> words.  If you have an archive with several thousand ham and spam
> messages, you can run bogotune to compute a set of parameters customized
> for _your_ environment and for _your_ mix of ham and spam.

I can't use bogotune yet since I just started using bogofilter and
haven't saved enough spam yet to reach the min 2k threshold.  It would
be nice if bogotune included a flag to disable this enforced minimum. 
New users could then at least start with some numbers besides the built
in defaults, even though they may not be as accurate as if they had
waited till the 2k limit.  They can always re-run bogotune later once
they build up enough spam.  Depending on how much spam someone receives
daily, it could take weeks or months to reach this minimum and during
that time the user can only guess at the parameters or stick with the
built in defaults.

> Hope this helps!
> 
> David
> 

Thanks again,
~Jason






More information about the Bogofilter mailing list