header tagging helps a bit

Greg Louis glouis at dynamicro.on.ca
Sun Sep 28 01:27:15 CEST 2003


Hi, everybody:

Current bogofilter, if told to do so, tags tokens in From:, To:, and a
few other headers specifically.  Version 0.15.4 also tags the remaining
headers with the generic prefix head: and David and I have been working
on evaluating (1) whether this is worth its associated overhead ("le
calembour, c'est la fiente de l'esprit qui vole" -- the pun's the
dropping of the flying spirit -- Victor Hugo), and (2) how to put it
into a production system without rebuilding the training db.

Executive summary: (1) probably; (2) start tagging now but don't
actually use the tags till there are enough of them.  In the meantime,
evaluate header tokens by combining tagged and untagged token counts.

The description that follows can actually be read into, and executed
within, an R session:

---8<--
# Three training db were produced from a corpus of 22283 spam and 18549
# nonspam.  The db were built in directories named all, half and none,
# the names reflecting the portion of the messages that were registered
# with head: tagging.
# 
# A second corpus of 16273 spam and 9659 nonspam was divided into three
# files containing spam and three containing nonspam, in the usual way.
# These were classified with what has been called "degeneration"
# (meaning, in this case, that header tokens were evaluated by combining
# tagged and untagged token counts) and without it (header tokens were
# evaluated with head: counts only).  The classification was performed
# once with each of the three training databases; in factorial terms, we
# have
# 
# training database: all, half, none
# degeneration:      yes, no
# replication:       0, 1, 2
# 
# Because any change of this kind alters the distribution of spam scores,
# it's important to take both fp and fn into account; just to keep things
# simple, I used %fn + %fp as the error variable.  It's ok to do that if
# the variation in fp within a given run is small, as is generally the
# case in the present experiment.  The raw numbers are given in the
# following table:
# 
# database -->	all	half	none
# 		----------------------
# degeneration	fn  fp  fn  fp  fn  fp
#	|
# 	v   0	31 172  28 178  33 190
# 	yes 1   39 179  37 184  41 183
# 	    2   34 172  32 172  38 176
# 
# 	    0   33 171  31 174  49 185
# 	no  1   41 176  40 183  51 178
# 	    2   36 175  34 177  53 176
# 
# the means are:
#   all/degen  4.29
#   all/tag    4.35
#   half/degen 4.28
#   half/tag   4.37
#   none/degen 4.53
#   none/tag   4.90

hdrtag <- read.table("/root/hdrtag.tbl")
attach(hdrtag)
hdrtag$percent <- fn*100/(9659/3) + fp*100/(16273/3)

# print(hdrtag, digits=3)
#      db degen run fn  fp percent
# 1   all   yes   0 31 172    4.13
# 2   all   yes   1 39 179    4.51
# 3   all   yes   2 34 172    4.23
# 4   all    no   0 33 171    4.18
# 5   all    no   1 41 176    4.52
# 6   all    no   2 36 175    4.34
# 7  half   yes   0 27 178    4.12
# 8  half   yes   1 37 184    4.54
# 9  half   yes   2 32 172    4.16
# 10 half    no   0 31 174    4.17
# 11 half    no   1 40 183    4.62
# 12 half    no   2 34 177    4.32
# 13 none   yes   0 33 190    4.53
# 14 none   yes   1 41 183    4.65
# 15 none   yes   2 38 176    4.42
# 16 none    no   0 49 185    4.93
# 17 none    no   1 51 178    4.87
# 18 none    no   2 53 176    4.89

hdrtag$run <- factor(hdrtag$run)
summary(aov(percent ~ db + degen + run, data=hdrtag))

#             Df  Sum Sq Mean Sq F value    Pr(>F)    
# db           2 0.62235 0.31118 18.2774 0.0002279 ***
# degen        1 0.13116 0.13116  7.7039 0.0167870 *  
# run          2 0.25229 0.12614  7.4092 0.0080257 ** 
# Residuals   12 0.20430 0.01703                      

# We conclude:

# 1.  Tagging makes a small but significant improvement in the spam
#     scores.
# 2.  As expected, the benefit of degeneration vanishes as the training
#     database acquires significant counts.
# It follows that bogoadmins should implement tagging-on-registration
# immediately and tagging-on-classification once the number of tagged
# messages in the training database has become adequate to support it.
---8<--

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |

Header information for this message:
Subject: header tagging helps a bit
     To: bogofilter <bogofilter at aotto.com>
   From: Greg Louis <glouis at dynamicro.on.ca>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 211 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030927/1a9322cc/attachment.sig>


More information about the Bogofilter mailing list