histogram of wordlist.db

Sat Jan 3 18:49:54 CET 2004

On Sat, 3 Jan 2004 16:01:25 +0100
Matthias Andree <matthias.andree at gmx.de> wrote:

> 
> I'm willing to forward some of the spam off-list.
> 
> The histogram of such spam looks like this:
> 
> X-Bogosity: No, tests=bogofilter, spamicity=0.500000,
>     version=0.16.0.cvs.CVStime_20040102_163533
>    int  cnt   prob  spamicity histogram
>   0.00   95 0.006758 0.002128 ################################
>   0.10   10 0.159268 0.007608 ####
>   0.20   26 0.260562 0.030739 #########
>   0.30   42 0.356437 0.078117 ###############
>   0.40    0 0.000000 0.078117 
>   0.50    0 0.000000 0.078117 
>   0.60   33 0.642097 0.149531 ############
>   0.70   33 0.740073 0.220945 ############
>   0.80   24 0.847833 0.276591 #########
>   0.90  143 0.980940 0.515557
>   ################################################

Matthias,

Have you thought about running bogotune?  A tuned set of parameters will
help bogofilter do a better job.  With my wordlist histogram, bogotune
recommends min_dev of 0.435 so the score is based on those tokens which
are very strongly hammish or spammish.

> This is the kind of Bayes evasion spam, it fills your data base with
> junk you'll never see again (but which looks like natural language
> tokens) and that isn't recognized the first time you'll see such a
> spam.

One technique would be to score each MIME part separately, select the
part with the most indicative score, merge its tokens with the header
tokens and then compute the overall score for the message.  This would
eliminate the effect of an extra, unrelated MIME part.

> Oh, and it's about the only spam that makes it past
> bogofilter+spamassassin.  I've tried to  update SA, but its self-test
> fails, so I'm not going to install it.
> 
> I wonder if the "unknown token" score, currently 0.415, should be
> rised as more messages are in the data base, or if this can be coerced
> into the model somehow, since "I haven't yet seen this token in ten
> thousand mails" certainly is information we don't handle gently at the
> moment.

Bogotune can answer that question.  So long as robx (which is the name
of the "unknown token" score) is closer to EVEN_ODDS, i.e. 1/2, than the
value of min_dev (default of 0.1), the unknown tokens aren't included in
the messages score.  In a histogram, the lines with cnt==-  show the
effect of min_dev.  With default parameters, as you're using, the cnt==0
lines are 0.40 and 0.50, which is exactly the expected effect of
robx=0.415 and min_dev=0.1

David