preliminary splitter results
David Relson
relson at osagesoftware.com
Sun Jan 19 02:29:20 CET 2003
Hi,
This afternoon, just after Gyepi's report on the spam conference, I
received a message that bogofilter classified as Unsure. As it arrived on
my mail server, it's 74 lines (4768 bytes) of mime multipart/mixed using
base64 encoding of split html.
This is what bogofilter-0.9.1.2 says about it:
X-Bogosity: No, tests=bogofilter, spamicity=0.535714, version=0.9.1.2
Here's what the current cvs code (with mime processing, etc) says:
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.499999, version=0.9.2.cvs
Here's what my bleeding edge, under development, version says:
X-Bogosity: Yes, tests=bogofilter, spamicity=0.998592, version=0.9.2.tst
It's probably wrong to call these splitter results as what's happening is
that I'm removing html comments to convert html comment split text and
combining fragments. Whatever it should be called, the code will be
released next week.
David
P.S. Using Robinson-Fisher and options "-vv" gives a nice view of how
bogofilter computes the spamicity value. Below are the histograms from the
three versions. Notice how the the cvs version's mime processing increases
the number of tokens seen and how the tst version sees many more tokens
having high spam probabilities:
X-Bogosity: No, tests=bogofilter, spamicity=0.535714, version=0.9.1.2
int cnt prob spamicity histogram
0.00 1 0.078825 0.035166 #
0.10 1 0.105482 0.047958 #
0.20 0 0.000000 0.000000
0.30 2 0.359764 0.172040 ##
0.40 0 0.000000 0.000000
0.50 0 0.000000 0.000000
0.60 2 0.616191 0.323661 ##
0.70 1 0.716650 0.383957 #
0.80 3 0.813375 0.509258 ###
0.90 0 0.000000 0.000000
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.499999, version=0.9.2.cvs
int cnt prob spamicity histogram
0.00 13 0.000319 0.000079 #############
0.10 3 0.127952 0.007650 ###
0.20 4 0.264263 0.028891 ####
0.30 7 0.357219 0.075857 #######
0.40 0 0.000000 0.075857
0.50 0 0.000000 0.075857
0.60 11 0.636480 0.199897 ###########
0.70 6 0.740439 0.261592 ######
0.80 12 0.831541 0.364888 ############
0.90 9 0.993811 0.481582 #########
X-Bogosity: Yes, tests=bogofilter, spamicity=0.998592, version=0.9.2.tst
int cnt prob spamicity histogram
0.00 2 0.000311 0.000043 ##
0.10 1 0.105479 0.006836 #
0.20 3 0.275468 0.052960 ###
0.30 13 0.342932 0.180693 #############
0.40 0 0.000000 0.180693
0.50 0 0.000000 0.180693
0.60 14 0.643083 0.362825 ##############
0.70 13 0.749465 0.469316 #############
0.80 15 0.835697 0.548869 ###############
0.90 9 0.993962 0.629116 #########
More information about the bogofilter-dev
mailing list