troublesome false negative
David Relson
relson at osagesoftware.com
Mon Nov 4 23:09:56 CET 2002
At 09:06 AM 11/4/02, you wrote:
>On 20021104 (Mon) at 0750:18 -0500, David Relson wrote:
>
> > This morning I ran the message with ROBX values of 0.200 and 0.400 and
> > using Graham. Here are the 3 status lines:
> >
> > X-Bogosity: No, tests=bogofilter, spamicity=0.478232
> > X-Bogosity: No, tests=bogofilter, spamicity=0.495448
> > X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000
> >
>
> > FWIW, the calculated .ROBX for my wordlist is approx 0.19.
>
>This should be calculated with scaled counts when the wordlist sizes
>differ.
And "scaled counts" means what? The 0.19 figure is the figure calculated
by the robx.pl script and by bogoutil's -x option. If it needs to be
scaled, then there's something missing in bogofilter.
> > Could the two of you generate the histograms and send them to me? I use
> > "bogofilter -r -v -v < msg.1103.txt" to generate my histograms.
>
>I wasn't going to download rc1 but it wasn't a big deal to do so and
>build it, so...
># ./bogofilter -r -v -v </root/msg.1103.txt >/root/msg.1103.hist
>ar[131(259)]/usr/local/src/bogofilter-0.8.0.rc1
># less /root/msg.1103.hist
>version=0.8.0.rc1
> int cnt prob spamicity histogram
> 0.00 3 0.000098 0.000011 ###
> 0.10 6 0.178340 0.030855 ######
> 0.20 47 0.205908 0.111934
> #########################################
> 0.30 17 0.361586 0.153788 ###############
> 0.40 36 0.453732 0.241106 ################################
> 0.50 39 0.544669 0.321817 ##################################
> 0.60 21 0.641065 0.364417 ###################
> 0.70 20 0.752761 0.410426 ##################
> 0.80 27 0.850332 0.472689 ########################
> 0.90 58 0.974240 0.618225
> ##################################################
>
>I don't know what the increasing "spamicity" figures in that column are
>telling us; I assume the difference between the final reported value of
>0.618225 and my bogofilter's report of 0.649718 is caused by the
>different x values; anyhow, that's what you get with my training set.
The purpose of the spamicity column is to show how the histogram entries
contribute to the final spamicity of the message. For me, the final
spamicity entry is what's shown in the X-Bogosity message. I wonder why
you're getting a different result.
Here's my output (originally posted with the message that started this thread):
[root at nic spam-fixups]# bogofilter -r -v -v < spam.1103.1838.txt
X-Bogosity: No, tests=bogofilter, spamicity=0.497731, version=0.8.0-1102.1447
# int cnt prob spamicity histogram
# 0.00 22 0.010732 0.002003 ##################
# 0.10 15 0.167809 0.021700 ############
# 0.20 64 0.222184 0.094994
##################################################
# 0.30 36 0.348368 0.146837 #############################
# 0.40 24 0.441698 0.186142 ###################
# 0.50 23 0.556350 0.231152 ##################
# 0.60 12 0.651680 0.257642 ##########
# 0.70 14 0.745020 0.292841 ###########
# 0.80 23 0.855185 0.356823 ##################
# 0.90 41 0.962410 0.497731 #################################
>I did
># bogofilter -R <msg.1103.txt | sort -k +5 >msg.1103.tbl
>
>There are 275 tokens, of which 107 have f(w) values under 0.5 and
>166 have values over 0.5. Just as a rough indicator, only 6 tokens
>have f(w) values under 0.2, but 80 have f(w) values over 0.8.
>
>Wanna send me the output of the same command as seen with your word
>lists? That might be interesting...
>
>Regards............
>--
>| G r e g L o u i s | gpg public key: |
>| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
More information about the Bogofilter
mailing list