testing parsing changes

David Relson relson at osagesoftware.com
Fri Nov 7 23:45:38 CET 2003


Greetings,

I've just completed test runs comparing standard bogofilter
(specifically 0.15.8 as it now exists in CVS) with two slightly
modified parsers.  I wanted to see how well the modified parsers did
(compared to standard bogofilter).

Modification D:

	Recognize <!DOCTYPE HTML PUBLIC.*> as the beginning of html text.

Modification T:

	Accept two character tokens, e.g. "AB", "sp", ...

I ran two sets of tests.  The first test set used my accumulated 2003
email for January through September (29,500 spam and 41,500 ham).
After running it I realized that I hadn't used my October messages so
I re-ran with the whole year's data (35,500 spam and 47,000 ham).

For testing, I took the spam and ham messages and divided them into 4
parts.  Part 1, with half the message was used to create the database
(wordlist.db).  Parts 2, 3, and 4 each contained 1/6 of the messages
and were used for scoring.

3 wordlists were created - one each for standard bogofilter,
bogofilter-D (with the DOCTYPE modification), and bogofilter-T (with
the two character token modification).

To establish a baseline result, parts 2, 3, and 4 of the spam messages
were scored using bogofilter's default parameters (spam_cutoff=0.95,
min_dev=0.100, robs=0.010, robx=0.415).  The numbers of false
negatives are printed for each of the 3 parts, as well as a total
count.  These numbers provide an indication of how accurately
bogofilter scores spam (though without an indication of the ham
scoring).

The more interesting results are found next, using the following
method: The ham messages are scored and the results are sorted.  A
target cutoff of 0.25% (of the messages, i.e. 52 for test 1 and 59 for
test 2) is used to find the cutoff value that gives 0.25% false
positives.  This cutoff value is then used in scoring the 3 sets of
spam to see how many of them are scored below the cutoff, i.e. how
many false negatives occur using the cutoff value.

Lastly, I looked at all the results to see how the effects of the
parsing modifications.


*** results for standard bogofilter ***

	The smaller test set has 216,030 tokens in the spam messages
	and 259,287 in the ham messages.

	The larger test set has 264,388 tokens in the spam messages
	and 290,914 in the ham messages.

	The default parameters gave 469 false negatives for the
	smaller test set and 527 for the larger test set.  The 0.25%
	target (described above) gives counts of 52 and 59 messages
	for the two tests.  Using it to determine the cutoff value,
	the numbers of false negatives change to 218 and 292, which
	illustrates the importance of the spam_cutoff value.

*** results for modification T (2 character tokens)  ***

	For the smaller test set, the token count for the spam messages
	increased by 3,892 and for the ham messages by 7,939.

	For the larger test set, the increases are 4,345 and 8,022.

	Both increases seem pretty minor.

	For scoring, the default parameters give 477 and 521 false
	negatives, i.e. there was a slight improvement for the smaller test
	and a slight degradation for the larger test.
	
	Using the targeted cutoff, the false negative counts were 211
	and 294 (for small and large tests).  So, there was an
	improvement for the small test and a degradation for the large
	test.

*** results for modification D (DOCTYPE as html indicator)  ***

	For both test sets, the token count for the spam messages
	decreased by 4 and for the ham messages by 93.

	This is an improvement in both cases but is insignificant
	(given the 200,000 to 300,000 words in the database).

	For scoring, the number of false positives were identical to
	those generated with the standard bogofilter.

*** Conclusion ***

	In my tests neither of these modifications showed a useful
	effect.  Past experience indicates that having more
	information available helps bogofilter do a better job.
	Modification T (two character tokens) would seem to fit in the
	"more info" category, but the tests don't support this idea.
	Intuitively, modification D (DOCTYPE means html) would seem to
	produce more accurate parsing.  Again, the tests don't support
	this conclusion.

For anyone interested, the test scripts are available ...

David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.1107.small.out
Type: application/octet-stream
Size: 1044 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031107/fea32abc/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.1107.large.out
Type: application/octet-stream
Size: 1045 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031107/fea32abc/attachment-0001.obj>


More information about the Bogofilter mailing list