troublesome false negative

David Relson relson at osagesoftware.com
Mon Nov 4 23:09:56 CET 2002


At 09:06 AM 11/4/02, you wrote:

>On 20021104 (Mon) at 0750:18 -0500, David Relson wrote:
>
> > This morning I ran the message with ROBX values of 0.200 and 0.400 and
> > using Graham.  Here are the 3 status lines:
> >
> >       X-Bogosity: No,  tests=bogofilter, spamicity=0.478232
> >       X-Bogosity: No,  tests=bogofilter, spamicity=0.495448
> >       X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000
> >
>
> > FWIW, the calculated .ROBX for my wordlist is approx 0.19.
>
>This should be calculated with scaled counts when the wordlist sizes
>differ.

And "scaled counts" means what?  The 0.19 figure is the figure calculated 
by the robx.pl script and by bogoutil's -x option.  If it needs to be 
scaled, then there's something missing in bogofilter.

> > Could the two of you generate the histograms and send them to me?  I use
> > "bogofilter -r -v -v < msg.1103.txt" to generate my histograms.
>
>I wasn't going to download rc1 but it wasn't a big deal to do so and
>build it, so...
># ./bogofilter -r -v -v </root/msg.1103.txt >/root/msg.1103.hist
>ar[131(259)]/usr/local/src/bogofilter-0.8.0.rc1
># less /root/msg.1103.hist
>version=0.8.0.rc1
>           int  cnt    prob   spamicity  histogram
>          0.00    3  0.000098  0.000011  ###
>          0.10    6  0.178340  0.030855  ######
>          0.20   47  0.205908  0.111934 
> #########################################
>          0.30   17  0.361586  0.153788  ###############
>          0.40   36  0.453732  0.241106  ################################
>          0.50   39  0.544669  0.321817  ##################################
>          0.60   21  0.641065  0.364417  ###################
>          0.70   20  0.752761  0.410426  ##################
>          0.80   27  0.850332  0.472689  ########################
>          0.90   58  0.974240  0.618225 
> ##################################################
>
>I don't know what the increasing "spamicity" figures in that column are
>telling us; I assume the difference between the final reported value of
>0.618225 and my bogofilter's report of 0.649718 is caused by the
>different x values; anyhow, that's what you get with my training set.

The purpose of the spamicity column is to show how the histogram entries 
contribute to the final spamicity of the message.  For me, the final 
spamicity entry is what's shown in the X-Bogosity message.  I wonder why 
you're getting a different result.

Here's my output (originally posted with the message that started this thread):

[root at nic spam-fixups]# bogofilter -r -v -v < spam.1103.1838.txt
X-Bogosity: No, tests=bogofilter, spamicity=0.497731, version=0.8.0-1102.1447
#      int  cnt    prob   spamicity  histogram
#     0.00   22  0.010732  0.002003  ##################
#     0.10   15  0.167809  0.021700  ############
#     0.20   64  0.222184  0.094994 
##################################################
#     0.30   36  0.348368  0.146837  #############################
#     0.40   24  0.441698  0.186142  ###################
#     0.50   23  0.556350  0.231152  ##################
#     0.60   12  0.651680  0.257642  ##########
#     0.70   14  0.745020  0.292841  ###########
#     0.80   23  0.855185  0.356823  ##################
#     0.90   41  0.962410  0.497731  #################################


>I did
># bogofilter -R <msg.1103.txt | sort -k +5 >msg.1103.tbl
>
>There are 275 tokens, of which 107 have f(w) values under 0.5 and
>166 have values over 0.5.  Just as a rough indicator, only 6 tokens
>have f(w) values under 0.2, but 80 have f(w) values over 0.8.
>
>Wanna send me the output of the same command as seen with your word
>lists?  That might be interesting...
>
>Regards............
>--
>| G r e g  L o u i s          | gpg public key:      |
>|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |





More information about the Bogofilter mailing list