preliminary splitter results

David Relson relson at osagesoftware.com
Sun Jan 19 02:29:20 CET 2003


Hi,

This afternoon, just after Gyepi's report on the spam conference, I 
received a message that bogofilter classified as Unsure.  As it arrived on 
my mail server, it's 74 lines (4768 bytes) of mime multipart/mixed using 
base64 encoding of split html.

This is what bogofilter-0.9.1.2 says about it:

	X-Bogosity: No, tests=bogofilter, spamicity=0.535714, version=0.9.1.2


Here's what the current cvs code (with mime processing, etc) says:

	X-Bogosity: Unsure, tests=bogofilter, spamicity=0.499999, version=0.9.2.cvs


Here's what my bleeding edge, under development, version says:

	X-Bogosity: Yes, tests=bogofilter, spamicity=0.998592, version=0.9.2.tst

It's probably wrong to call these splitter results as what's happening is 
that I'm removing html comments to convert html comment split text and 
combining fragments.  Whatever it should be called, the code will be 
released next week.

David

P.S.  Using Robinson-Fisher and options "-vv" gives a nice view of how 
bogofilter computes the spamicity value.  Below are the histograms from the 
three versions.  Notice how the the cvs version's mime processing increases 
the number of tokens seen and how the tst version sees many more tokens 
having high spam probabilities:

X-Bogosity: No, tests=bogofilter, spamicity=0.535714, version=0.9.1.2
	  int  cnt    prob   spamicity  histogram
	 0.00    1  0.078825  0.035166  #
	 0.10    1  0.105482  0.047958  #
	 0.20    0  0.000000  0.000000
	 0.30    2  0.359764  0.172040  ##
	 0.40    0  0.000000  0.000000
	 0.50    0  0.000000  0.000000
	 0.60    2  0.616191  0.323661  ##
	 0.70    1  0.716650  0.383957  #
	 0.80    3  0.813375  0.509258  ###
	 0.90    0  0.000000  0.000000

X-Bogosity: Unsure, tests=bogofilter, spamicity=0.499999, version=0.9.2.cvs
	  int  cnt    prob   spamicity  histogram
	 0.00   13  0.000319  0.000079  #############
	 0.10    3  0.127952  0.007650  ###
	 0.20    4  0.264263  0.028891  ####
	 0.30    7  0.357219  0.075857  #######
	 0.40    0  0.000000  0.075857
	 0.50    0  0.000000  0.075857
	 0.60   11  0.636480  0.199897  ###########
	 0.70    6  0.740439  0.261592  ######
	 0.80   12  0.831541  0.364888  ############
	 0.90    9  0.993811  0.481582  #########

X-Bogosity: Yes, tests=bogofilter, spamicity=0.998592, version=0.9.2.tst
	  int  cnt    prob   spamicity  histogram
	 0.00    2  0.000311  0.000043  ##
	 0.10    1  0.105479  0.006836  #
	 0.20    3  0.275468  0.052960  ###
	 0.30   13  0.342932  0.180693  #############
	 0.40    0  0.000000  0.180693
	 0.50    0  0.000000  0.180693
	 0.60   14  0.643083  0.362825  ##############
	 0.70   13  0.749465  0.469316  #############
	 0.80   15  0.835697  0.548869  ###############
	 0.90    9  0.993962  0.629116  #########





More information about the bogofilter-dev mailing list