What is a spamicity of exactly 0.5?

Sun Jan 25 23:53:55 CET 2004

On Sun, 25 Jan 2004 17:42:02 -0500
Jason A. Smith wrote:

> Thanks for the explanations, both David and pi.
> 
> On Sun, 2004-01-25 at 09:33, David Relson wrote:
> > On Sun, 25 Jan 2004 08:50:21 -0500
> > Jason A. Smith wrote:
> > 
> > > All of the spam that I get with a lot of random words appended to
> > > the end get a spamicity score of exactly 0.5.  Why is this
> > > happening and what does that score mean?  I don't understand why
> > > they don't get scored as spam since most are advertising the exact
> > > same website and come from the same source.  Shouldn't those few
> > > known spam tokens outweigh the random words?  Is there anything
> > > that I can do to improve bogofilter's detection of spam like that
> > > with random words?
> > > 
> > > ~Jason
> > 
> > Welcom Jason,
> > 
> > Good questions!
> > 
> > The Robinson-Fisher algorithm has, as its last step, a chi-square
> > test which computes a certainty level based on the computed score
> > and the number of tokens.  When there are a lot of spam tokens _and_
> > a lot of ham tokens in the message, the result is often 0.500000. 
> > Such a score means that the computation can't say (with any level of
> > certainty) whether the message is ham or spam.
> > 
> > If you want to see more about how a particular message is scored,
> > run bogofilter with "-vv" to generate a histogram or with "-vvv" to
> > generate a list of all the tokens and their individual scores.
> 
> I have tried using these flags before, but I am not sure how to read
> the output since I don't know exactly what the histogram is plotting
> and I'm not sure what the columns are in -vvv.  Can the man page be
> updated to explain the output of the various -v flags in more detail?

They're described in the FAQ.

> > You ask about improving bogofilter's detection of spam with random
> > words.  If you have an archive with several thousand ham and spam
> > messages, you can run bogotune to compute a set of parameters
> > customized for _your_ environment and for _your_ mix of ham and
> > spam.
> 
> I can't use bogotune yet since I just started using bogofilter and
> haven't saved enough spam yet to reach the min 2k threshold.  It would
> be nice if bogotune included a flag to disable this enforced minimum. 
> New users could then at least start with some numbers besides the
> built in defaults, even though they may not be as accurate as if they
> had waited till the 2k limit.  They can always re-run bogotune later
> once they build up enough spam.  Depending on how much spam someone
> receives daily, it could take weeks or months to reach this minimum
> and during that time the user can only guess at the parameters or
> stick with the built in defaults.

Hi Jason,

The builtin defaults are more than adequate.  They'll do a very good job
for you.  They're what I used for many months.  

Bogotune used to be a perl script which ran slowly because it had to run
bogofilter many times for each messages.  Switching bogotune from perl
to C allowed the parsed messages to be cached in ram and speeds up the
whole process by a factor of 100 or so.

Given sufficient information, bogotune will generate better parameters
than the default ones.  The word "sufficient" is the key.  For someone
with low volume, the default parameters will be fine.  Using bogotune to
optimize is like adding chocolate icing to a chocolate cake -- it's
already good and gets even better :-)

David