results of comparing agorithms
David Relson
relson at osagesoftware.com
Fri Sep 20 20:08:58 CEST 2002
<x-flowed>
Having implemented several varieties of the spamicity calculation
algorithm, it became time to test with messages not previously processed by
bogofilter. The goal was to see how changes to the algorithm affect the
classification of the message. The 3 algorithms are:
1: original
2: sort/merge
3: weighted
Two initial message sets were used. The first test set contains messages a
file of manually classified spam, for approx the last month. The second
test set contains the messages received by my domain yesterday (Sep
19). All messages were classified according to the 3 algorithms:
Test #1 -- 293 messages from a mbox file containing manually classified spam.
267 messages were considered spam by all three algorithms
6 messages were considered spam by algorithms 1 & 2, but not 3
1 message was considered good by algorithms 1 & 2, but not 3
19 messages were considered good by all three algorithms
Test #2 -- 118 messages received on Sep 19 and unclassified as spam or good
112 messages were classified as good by all three algorithms.
1 message was classified as good by algorithms 1 & 2, but not 3
5 messages were classified as spam by all three algorithms.
The SPAM_CUTOFF was 0.9f for algorithms 1 & 2 ( "original" and
"sort/merge") and 0.5f for #3 ( "weighted").
As can be seen from the numbers above, the original algorithm and its two
variants give roughly the same results. With a word list built from 2641
spam messages and 26489 non-spam messages, the results are correct most of
the time.
The 32 incorrectly or inconsistently classified messages seem "interesting"
and have become test #3. I figure the variation of results indicates the
messages exhibit boundary conditions or something else that is odd.
The messages have been given names that preserve (for posterity and
regression testing) the origin of the message, how it should be classified,
and the 3 classifications. Origin and classification are represented by
letters s, u, and g in the name (with meanings of spam, unclassified, and
good). The 3 algorithm/results are represented by letters Y and N (for
spam=Yes,No).
To test for changes in bogofilter, a script was written that would run
bogofilter for all the messages in a directory, test each one using all 3
algorithms, and print a results line for each message. The printout gives
message name, and the spamicity and classification (spam or no) for each.
The script has been run on this directory, with the results re-ordered by
increasing spamicity. Here're the numbers:
msg.070.s-g.nnn NNN 0.000000 No 0.000000 No 0.045012 No
msg.199.s-g.nnn NNN 0.000000 No 0.000000 No 0.113984 No
msg.011.s-g.nnn NNN 0.000000 No 0.000000 No 0.143061 No
msg.160.s-g.nnn NNN 0.000000 No 0.000000 No 0.208042 No
msg.007.s-g.nnn NNN 0.000000 No 0.000000 No 0.312467 No
msg.098.s-g.nnn NNN 0.000000 No 0.000000 No 0.413358 No
msg.173.s-g.nnn NNN 0.000000 No 0.000000 No 0.270965 No
msg.290.s-g.nnn NNN 0.000000 No 0.000000 No 0.306994 No
msg.291.s-g.nnn NNN 0.000000 No 0.000000 No 0.306994 No
msg.275.s-s.nny NNY 0.000000 No 0.000000 No 0.503011 Yes
msg.009.u-s.nny NNY 0.000000 No 0.000000 No 0.528822 Yes
msg.236.s-s.nnn NNN 0.000003 No 0.000003 No 0.342897 No
msg.044.s-s.nnn NNN 0.000011 No 0.000011 No 0.289796 No
msg.282.s-s.nnn NNN 0.000011 No 0.000011 No 0.289705 No
msg.026.s-s.nnn NNN 0.000544 No 0.000544 No 0.422974 No
msg.088.s-s.nnn NNN 0.002420 No 0.002420 No 0.423802 No
msg.144.s-s.nnn NNN 0.011391 No 0.011391 No 0.297213 No
msg.014.s-s.nnn NNN 0.016080 No 0.016080 No 0.297725 No
msg.262.s-s.nnn NNN 0.034299 No 0.034299 No 0.339199 No
msg.025.s-s.nnn NNN 0.035400 No 0.035400 No 0.423005 No
msg.047.s-s.nnn NNN 0.161002 No 0.161002 No 0.405246 No
msg.073.s-s.yyn YYN 0.942744 Yes 0.942744 Yes 0.410380 No
msg.216.s-s.yyn YYN 0.961042 Yes 0.961042 Yes 0.453430 No
msg.224.s-s.yyn YYN 0.961042 Yes 0.961042 Yes 0.453430 No
msg.284.s-s.yyn YYN 0.999558 Yes 0.999558 Yes 0.474782 No
msg.039.s-s.yyn YYN 0.999893 Yes 0.999893 Yes 0.458921 No
msg.040.s-s.yyn YYN 0.999985 Yes 0.999985 Yes 0.432105 No
msg.008.u-s.yyy YYY 1.000000 Yes 1.000000 Yes 0.660104 Yes
msg.006.u-s.yyy YYY 1.000000 Yes 1.000000 Yes 0.742529 Yes
msg.004.u-s.yyy YYY 1.000000 Yes 1.000000 Yes 0.778667 Yes
msg.001.u-s.yyy YYY 1.000000 Yes 1.000000 Yes 0.893935 Yes
msg.007.u-s.yyy YYY 1.000000 Yes 1.000000 Yes 0.904471 Yes
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
</x-flowed>
More information about the Bogofilter
mailing list