bogotune suggests spam_cutoff of 0?

Jonathan Kamens jik at kamens.brookline.ma.us
Wed Apr 21 15:28:28 CEST 2010


I'm pretty sure that there are no misclassified messages in my ham and spam
files.

I can't find out what happens if I remove the high-scoring ham or
low-scoring spam messages from my files before training, because I don't
know which messages those are.  This is because bogotrain is building its
own word list rather than using the one I've already got, and the scores
it's reporting are based on that internal word list rather than my actgual
word list.  As a result, there's no way for me to figure out which messages
it's claiming have those scores.

I think the short answer is that I need to wait until I've got a large
enough body of ham and spam messages to be able to train using my real word
list instead of the one built by bogotune.

Thanks for your help.

  jik

-----Original Message-----
From: bogofilter-bounces+jik=kamens.brookline.ma.us at bogofilter.org
[mailto:bogofilter-bounces+jik=kamens.brookline.ma.us at bogofilter.org] On
Behalf Of David Relson
Sent: Tuesday, April 20, 2010 7:22 PM
Cc: 'bf-users'
Subject: Re: bogotune suggests spam_cutoff of 0?

Hello Jonathan,

The credit for bogotune belongs to Greg Louis, who's the bogofilter
team member with the most advanced math skills.  The initial
implementation was in R and I translated that into C (with Greg's
advice and assistance).  Sadly to say, I do not have a good grasp of
the math concepts behind this tool.  My understanding is that, for each
of bogofilter's major tuning parameters, a range of values is tried
(giving a multidimensional search) and then local maxima/minima are
found.  After the coarse scan, a fine scan is run and the best result
is output.

Looking at your results, I notice the "warning: test messages include
many high scoring nonspam".  The results show 3 such high nonspam and
a zero scoring spam.  Is it possible that some misclassifications are
present in your data? What happens if you "clean" your test messages by
removing the high scoring nonspam? You also have a low sc

'Tis possible that the 0.000000 spam is triggering the 0 spam_cutoff
value.  What happens if you remove that message from your inputs?

I get at least two different flavors of spam.  One type is the
bigger/faster whatever baldness remedy (presumably totally bogus scam
message).  Another is the (possibly legitimate) product offer for a
real product in which I've never had an interest.  The first type tries
to avoid spam filters because it's BS, while the second uses normal
(hammy) wording.  Trying to catch both types is challenging because
they're so different.

Anyhow, enough rambling...

HTH,

David

P.S.  I enjoyed your postings to the Risks list.

On Tue, 20 Apr 2010 14:28:24 -0400
Jonathan Kamens wrote:

> As the bogotune output below shows (bogofilter 1.2.0), it seems to
> have run properly, but at the end, it recommended a spam_cutoff value
> of 0.000000. That seems absurdly wrong.
> 
>  
> 
> Both the notspam and bogospam archives fed into bogotune are correct,
> i.e., the notspam archive contains only ham and the bogospam archive
> contains only ham.
> 
> 
> Does anybody have any idea what's up with this?  I've never seen it
> before.
> 
>  
> 
> Thanks,
> 
>  
> 
>   Jik
> 
>  
> 
> + bogotune -D -T 0 -n /tmp/notspam -s /tmp/bogospam
> 
> Warning: test messages include many high scoring nonspam.
> 
>          You may wish to reclassify them and rerun.
> 
>     high ham scores:
> 
>        1 1.000000
> 
>        2 0.992545
> 
>        3 0.992532
> 
>     low spam scores:
> 
>        1 0.000000
> 
> Initial x value is 0.520000
> 
> False-positive target is 3 (cutoff 0.975000)
> 
> Performing coarse scan:
> 
> 2940
> [......................................................................]
> 
> Top ten parameter sets from this scan:
> 
>         rs     md    rx    spesf    nsesf    co     fp  fn   fppc
> fnpc
> 
>  2096 0.0100 0.060 0.570 0.007517 0.100113 0.6310    3   4  1.2000
> 1.1799
> 
>  2095 0.0100 0.060 0.570 0.007517 0.237305 0.6363    3   4  1.2000
> 1.1799
> 
>  2103 0.0100 0.060 0.570 0.003171 0.100113 0.6570    3   4  1.2000
> 1.1799
> 
>  2094 0.0100 0.060 0.570 0.007517 0.562500 0.6648    3   4  1.2000
> 1.1799
> 
>  2102 0.0100 0.060 0.570 0.003171 0.237305 0.6789    3   4  1.2000
> 1.1799
> 
>  1982 0.0100 0.060 0.520 0.042235 0.562500 0.5049    3   5  1.2000
> 1.4749
> 
>  2341 0.0100 0.140 0.420 0.007517 0.100113 0.5064    3   5  1.2000
> 1.4749
> 
>  1268 0.1000 0.140 0.470 0.003171 0.562500 0.5084    3   5  1.2000
> 1.4749
> 
>  1353 0.1000 0.140 0.420 0.017818 0.237305 0.5119    3   5  1.2000
> 1.4749
> 
>  1465 0.1000 0.220 0.470 0.003171 0.237305 0.5151    3   5  1.2000
> 1.4749
> 
>  
> 
> Minimum found at s 0.0100, md 0.060, x 0.570, spesf 0.007517, nsesf
> 0.100113
> 
>         fp 3 (1.2000%), fn 4 (1.1799%)
> 
>  
> 
> Performing fine scan:
> 
> 4410
> [......................................................................]
> 
> Top ten parameter sets from this scan:
> 
>         rs     md    rx    spesf    nsesf    co     fp  fn   fppc
> fnpc
> 
>  3895 0.0100 0.062 0.596 0.007517 0.115600 0.6441    3   2  1.2000
> 0.5900
> 
>  3903 0.0100 0.062 0.596 0.006510 0.100113 0.6443    3   2  1.2000
> 0.5900
> 
>  3894 0.0100 0.062 0.596 0.007517 0.133484 0.6457    3   2  1.2000
> 0.5900
> 
>  3886 0.0100 0.062 0.596 0.008680 0.154134 0.6463    3   2  1.2000
> 0.5900
> 
>  4131 0.0100 0.076 0.596 0.008680 0.154134 0.6356    3   3  1.2000
> 0.8850
> 
>  4132 0.0100 0.076 0.596 0.008680 0.133484 0.6359    3   3  1.2000
> 0.8850
> 
>  4124 0.0100 0.076 0.596 0.010023 0.154134 0.6362    3   3  1.2000
> 0.8850
> 
>  4139 0.0100 0.076 0.596 0.007517 0.133484 0.6363    3   3  1.2000
> 0.8850
> 
>  4140 0.0100 0.076 0.596 0.007517 0.115600 0.6364    3   3  1.2000
> 0.8850
> 
>  4125 0.0100 0.076 0.596 0.010023 0.133484 0.6366    3   3  1.2000
> 0.8850
> 
>  
> 
> 256 outliers encountered.
> 
> Minimum found at s 0.0100, md 0.048, x 0.570, spesf 0.004882, nsesf
> 0.133484
> 
>         fp 3 (1.2000%), fn 4 (1.1799%)
> 
>  
> 
> Performing final scoring:
> 
> Spam...  Non-Spam...
> 
> 0.002138 0.723810
> 
> 0.495063 0.723777
> 
> 0.609320 0.650470
> 
> 0.633773 0.638916
> 
> 0.656996 0.627063
> 
> 0.666932 0.619914
> 
> 0.675344 0.613948
> 
> 0.711605 0.589086
> 
> 0.715091 0.554175
> 
> 0.723861 0.495431
> 
>  
> 
> Recommendations:
> 
>  
> 
> ---cut---
> 
> db_cachesize=4
> 
> robs=0.0100
> 
> min_dev=0.048
> 
> robx=0.570000
> 
> sp_esf=0.004882
> 
> ns_esf=0.133484
> 
> spam_cutoff=0.000000    # for 0.00% fp (0); expect 0.00% fn (0).
> 
> ham_cutoff=0.100
> 
> ---cut---
> 
>  
> 
> Tuning completed.
> 
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter






More information about the Bogofilter mailing list