bogotune suggests spam_cutoff of 0?

David Relson relson at osagesoftware.com
Wed Apr 21 01:22:26 CEST 2010


Hello Jonathan,

The credit for bogotune belongs to Greg Louis, who's the bogofilter
team member with the most advanced math skills.  The initial
implementation was in R and I translated that into C (with Greg's
advice and assistance).  Sadly to say, I do not have a good grasp of
the math concepts behind this tool.  My understanding is that, for each
of bogofilter's major tuning parameters, a range of values is tried
(giving a multidimensional search) and then local maxima/minima are
found.  After the coarse scan, a fine scan is run and the best result
is output.

Looking at your results, I notice the "warning: test messages include
many high scoring nonspam".  The results show 3 such high nonspam and
a zero scoring spam.  Is it possible that some misclassifications are
present in your data? What happens if you "clean" your test messages by
removing the high scoring nonspam? You also have a low sc

'Tis possible that the 0.000000 spam is triggering the 0 spam_cutoff
value.  What happens if you remove that message from your inputs?

I get at least two different flavors of spam.  One type is the
bigger/faster whatever baldness remedy (presumably totally bogus scam
message).  Another is the (possibly legitimate) product offer for a
real product in which I've never had an interest.  The first type tries
to avoid spam filters because it's BS, while the second uses normal
(hammy) wording.  Trying to catch both types is challenging because
they're so different.

Anyhow, enough rambling...

HTH,

David

P.S.  I enjoyed your postings to the Risks list.

On Tue, 20 Apr 2010 14:28:24 -0400
Jonathan Kamens wrote:

> As the bogotune output below shows (bogofilter 1.2.0), it seems to
> have run properly, but at the end, it recommended a spam_cutoff value
> of 0.000000. That seems absurdly wrong.
> 
>  
> 
> Both the notspam and bogospam archives fed into bogotune are correct,
> i.e., the notspam archive contains only ham and the bogospam archive
> contains only ham.
> 
> 
> Does anybody have any idea what's up with this?  I've never seen it
> before.
> 
>  
> 
> Thanks,
> 
>  
> 
>   Jik
> 
>  
> 
> + bogotune -D -T 0 -n /tmp/notspam -s /tmp/bogospam
> 
> Warning: test messages include many high scoring nonspam.
> 
>          You may wish to reclassify them and rerun.
> 
>     high ham scores:
> 
>        1 1.000000
> 
>        2 0.992545
> 
>        3 0.992532
> 
>     low spam scores:
> 
>        1 0.000000
> 
> Initial x value is 0.520000
> 
> False-positive target is 3 (cutoff 0.975000)
> 
> Performing coarse scan:
> 
> 2940
> [......................................................................]
> 
> Top ten parameter sets from this scan:
> 
>         rs     md    rx    spesf    nsesf    co     fp  fn   fppc
> fnpc
> 
>  2096 0.0100 0.060 0.570 0.007517 0.100113 0.6310    3   4  1.2000
> 1.1799
> 
>  2095 0.0100 0.060 0.570 0.007517 0.237305 0.6363    3   4  1.2000
> 1.1799
> 
>  2103 0.0100 0.060 0.570 0.003171 0.100113 0.6570    3   4  1.2000
> 1.1799
> 
>  2094 0.0100 0.060 0.570 0.007517 0.562500 0.6648    3   4  1.2000
> 1.1799
> 
>  2102 0.0100 0.060 0.570 0.003171 0.237305 0.6789    3   4  1.2000
> 1.1799
> 
>  1982 0.0100 0.060 0.520 0.042235 0.562500 0.5049    3   5  1.2000
> 1.4749
> 
>  2341 0.0100 0.140 0.420 0.007517 0.100113 0.5064    3   5  1.2000
> 1.4749
> 
>  1268 0.1000 0.140 0.470 0.003171 0.562500 0.5084    3   5  1.2000
> 1.4749
> 
>  1353 0.1000 0.140 0.420 0.017818 0.237305 0.5119    3   5  1.2000
> 1.4749
> 
>  1465 0.1000 0.220 0.470 0.003171 0.237305 0.5151    3   5  1.2000
> 1.4749
> 
>  
> 
> Minimum found at s 0.0100, md 0.060, x 0.570, spesf 0.007517, nsesf
> 0.100113
> 
>         fp 3 (1.2000%), fn 4 (1.1799%)
> 
>  
> 
> Performing fine scan:
> 
> 4410
> [......................................................................]
> 
> Top ten parameter sets from this scan:
> 
>         rs     md    rx    spesf    nsesf    co     fp  fn   fppc
> fnpc
> 
>  3895 0.0100 0.062 0.596 0.007517 0.115600 0.6441    3   2  1.2000
> 0.5900
> 
>  3903 0.0100 0.062 0.596 0.006510 0.100113 0.6443    3   2  1.2000
> 0.5900
> 
>  3894 0.0100 0.062 0.596 0.007517 0.133484 0.6457    3   2  1.2000
> 0.5900
> 
>  3886 0.0100 0.062 0.596 0.008680 0.154134 0.6463    3   2  1.2000
> 0.5900
> 
>  4131 0.0100 0.076 0.596 0.008680 0.154134 0.6356    3   3  1.2000
> 0.8850
> 
>  4132 0.0100 0.076 0.596 0.008680 0.133484 0.6359    3   3  1.2000
> 0.8850
> 
>  4124 0.0100 0.076 0.596 0.010023 0.154134 0.6362    3   3  1.2000
> 0.8850
> 
>  4139 0.0100 0.076 0.596 0.007517 0.133484 0.6363    3   3  1.2000
> 0.8850
> 
>  4140 0.0100 0.076 0.596 0.007517 0.115600 0.6364    3   3  1.2000
> 0.8850
> 
>  4125 0.0100 0.076 0.596 0.010023 0.133484 0.6366    3   3  1.2000
> 0.8850
> 
>  
> 
> 256 outliers encountered.
> 
> Minimum found at s 0.0100, md 0.048, x 0.570, spesf 0.004882, nsesf
> 0.133484
> 
>         fp 3 (1.2000%), fn 4 (1.1799%)
> 
>  
> 
> Performing final scoring:
> 
> Spam...  Non-Spam...
> 
> 0.002138 0.723810
> 
> 0.495063 0.723777
> 
> 0.609320 0.650470
> 
> 0.633773 0.638916
> 
> 0.656996 0.627063
> 
> 0.666932 0.619914
> 
> 0.675344 0.613948
> 
> 0.711605 0.589086
> 
> 0.715091 0.554175
> 
> 0.723861 0.495431
> 
>  
> 
> Recommendations:
> 
>  
> 
> ---cut---
> 
> db_cachesize=4
> 
> robs=0.0100
> 
> min_dev=0.048
> 
> robx=0.570000
> 
> sp_esf=0.004882
> 
> ns_esf=0.133484
> 
> spam_cutoff=0.000000    # for 0.00% fp (0); expect 0.00% fn (0).
> 
> ham_cutoff=0.100
> 
> ---cut---
> 
>  
> 
> Tuning completed.
> 
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter



More information about the Bogofilter mailing list